Guides5 min read

Voice Messages for AI Chatbots

Add voice input and output to your AI bot. Set up speech-to-text and text-to-speech for OpenClaw and OpenClaw on your VPS.

Published: 27/01/2025

Overview

Transform your AI chatbot into a voice assistant. Send voice messages and receive spoken responses - all through Telegram or Discord. Your VPS handles the transcription and speech synthesis, making hands-free AI interaction a reality.

How It Works

Voice Message (Telegram/Discord)
    ↓
[Your VPS]
    ↓
[Whisper API - Speech to Text]
    ↓
[AI Processes Request]
    ↓
[ElevenLabs - Text to Speech]
    ↓
Voice Response Sent Back

Why Voice?

  • Hands-free: Use while driving, cooking, exercising
  • Faster input: Speak faster than you type
  • Natural interaction: Feels like a real assistant
  • Accessibility: Easier for some users
  • Multilingual: Speak in any language

Setup Guide

Prerequisites

  • OpenClaw or OpenClaw on VPS
  • Telegram or Discord configured
  • API keys for speech services

Step 1: Get API Keys

For Speech-to-Text (Whisper):

  • Option A: Groq - Fast and affordable
  • Option B: OpenAI - Original Whisper

For Text-to-Speech:

  • ElevenLabs - Best quality voices
  • Alternative: OpenAI TTS (simpler, lower quality)

Step 2: Configure Environment

# Speech-to-Text (Transcription)
VOICE_INPUT_ENABLED=true
WHISPER_PROVIDER=groq  # or 'openai'
GROQ_API_KEY=your-groq-key
# or
OPENAI_API_KEY=your-openai-key

# Text-to-Speech (Voice Output)
VOICE_OUTPUT_ENABLED=true
TTS_PROVIDER=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key
ELEVENLABS_VOICE_ID=your-chosen-voice

# Voice Settings
VOICE_RESPONSE_THRESHOLD=50  # Respond with voice if input was voice
AUTO_VOICE_REPLY=true  # Voice input = voice output

Step 3: Choose a Voice

ElevenLabs offers many voices. Find your voice ID:

  1. Go to ElevenLabs
  2. Browse Voice Library
  3. Click a voice → copy Voice ID

Popular choices:

  • Rachel: Warm, professional female
  • Adam: Clear, friendly male
  • Bella: Expressive, natural female
ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM  # Rachel

Step 4: Test Voice Features

Send a voice message in Telegram: 🎤 "What's the weather like in London?"

Bot should reply with a voice message containing the answer.

Voice Configuration Options

Smart Voice Detection

Only reply with voice when user sends voice:

# Match input format
AUTO_VOICE_REPLY=true

# Or always use text
AUTO_VOICE_REPLY=false
PREFER_TEXT_RESPONSE=true

Voice Quality Settings

# ElevenLabs model
ELEVENLABS_MODEL=eleven_turbo_v2_5  # Fast
# or
ELEVENLABS_MODEL=eleven_multilingual_v2  # Best quality

# Voice settings
VOICE_STABILITY=0.5
VOICE_SIMILARITY_BOOST=0.75
VOICE_STYLE=0.5

Language Support

Whisper and ElevenLabs support multiple languages:

# Auto-detect language (recommended)
WHISPER_LANGUAGE=auto

# Or force specific language
WHISPER_LANGUAGE=en

Multilingual conversations:

  • Speak in Italian, get response in Italian
  • Mix languages in the same conversation
  • Better than Siri's language handling!

Creating Custom Voices

Clone Your Own Voice

ElevenLabs allows voice cloning:

  1. Record 1-5 minutes of clear speech
  2. Upload to ElevenLabs
  3. Use the cloned voice ID
ELEVENLABS_VOICE_ID=your-cloned-voice-id

Use cases:

  • Bot speaks in your voice
  • Create branded voice for business
  • Fun personalized assistant

Voice Personas

Different voices for different contexts:

// In bot configuration
const voices = {
  default: 'rachel_voice_id',
  formal: 'professional_voice_id',
  casual: 'friendly_voice_id',
  alerts: 'urgent_voice_id'
};

function selectVoice(context) {
  if (context.isAlert) return voices.alerts;
  if (context.isBusinessHours) return voices.formal;
  return voices.default;
}

Use Cases

Morning Briefing

Wake up to a spoken summary:

MORNING_BRIEFING_VOICE=true
MORNING_BRIEFING_TIME=07:00

Bot sends audio at 7 AM: 🎤 "Good morning! Today is Monday, January 27th. You have 3 meetings: team standup at 10, client call at 2, and dentist at 4:30. Weather is 8 degrees and cloudy. Have a great day!"

Voice-Controlled Home

Speak to control your home:

🎤 "Turn off all the lights and set the thermostat to 20 degrees"

Bot responds with voice confirmation: 🎤 "Done! All lights are off and thermostat set to 20 degrees."

Hands-Free Tasks

While cooking: 🎤 "Set a timer for 15 minutes"

While driving: 🎤 "Read my last 3 emails"

While exercising: 🎤 "What's next on my todo list?"

Language Learning

Practice conversations: 🎤 "Let's practice French. Ask me questions about my day."

Bot responds in French with pronunciation you can hear.

Cost Analysis

Speech-to-Text (Whisper)

| Provider | Cost per Hour | |----------|---------------| | Groq | ~£0.05 | | OpenAI | ~£0.36 |

Typical usage: 5-10 minutes/day = £1-5/month

Text-to-Speech (ElevenLabs)

| Plan | Characters/month | Cost | |------|------------------|------| | Free | 10,000 | £0 | | Starter | 30,000 | ~£4 | | Creator | 100,000 | ~£18 |

Typical usage: 500-1000 chars/response × 50 responses = 25,000-50,000 chars/month

Total Voice Costs

Light usage: £5-10/month Heavy usage: £15-25/month

Performance Optimization

Reduce Latency

# Use fastest models
ELEVENLABS_MODEL=eleven_turbo_v2_5
WHISPER_PROVIDER=groq  # Groq is faster

# Stream responses (if supported)
VOICE_STREAMING=true

Cache Common Responses

# Cache frequently used phrases
VOICE_CACHE_ENABLED=true
VOICE_CACHE_SIZE=100

Greetings, confirmations, and common responses are cached to avoid regeneration.

Troubleshooting

No voice response

# Check API keys
pm2 logs openclaw | grep -i "elevenlabs\|voice"

# Test ElevenLabs directly
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/YOUR_VOICE_ID" \
  -H "xi-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}'

Poor transcription quality

  • Speak clearly and not too fast
  • Reduce background noise
  • Check WHISPER_LANGUAGE setting

Voice sounds robotic

  • Try different ElevenLabs voices
  • Adjust stability and similarity settings
  • Use multilingual model for better quality

High latency

  • Switch to Groq for Whisper (faster)
  • Use eleven_turbo model for TTS
  • Ensure VPS has good network to APIs

Security Considerations

Voice Data Privacy

# Don't store voice files permanently
VOICE_RETENTION_MINUTES=5

# Process and delete
DELETE_VOICE_AFTER_TRANSCRIPTION=true

Rate Limiting

# Prevent API abuse
VOICE_MESSAGES_PER_MINUTE=5
VOICE_MESSAGES_PER_DAY=100

Alternative TTS Options

OpenAI TTS

Simpler setup, lower quality:

TTS_PROVIDER=openai
OPENAI_API_KEY=your-key
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICE=alloy  # alloy, echo, fable, onyx, nova, shimmer

Local TTS (Free)

For privacy-focused setups:

TTS_PROVIDER=local
LOCAL_TTS_ENGINE=piper  # or espeak

Lower quality but no API costs.

Related Guides

Need Help?

Voice integration involves multiple APIs and careful configuration. Our premium setup service includes voice features fully configured and tested.

Need a VPS for Your Bot?

We recommend Hostinger KVM 2 VPS - reliable, fast, and perfect for AI chatbots. Get started with our recommended setup.

Get Hostinger VPS

Need Help With Setup?

Got your VPS? Let us handle the technical work. Professional setup and maintenance for OpenClaw (formerly Clawd.bot).