Voice Models & Languages Configuration
Select the perfect AI voice and configure language preferences to create the ideal caller experience for your business. Understanding voice model differences, language options, and performance impacts will help you make the best choices for your AI receptionist.
Voice Processing Models
Real-time Voice Models
Real-time models process audio streams continuously, enabling natural conversation flow with minimal latency. OpenAI Realtime API:- Latency: 200-400ms end-to-end
- Technology: Direct audio-to-audio processing
- Experience: Natural conversation interruptions and overlapping speech
- Quality: High-quality with optimized real-time performance
- Best for: Interactive conversations requiring immediate responses
Traditional TTS/STT Models
Traditional models use separate Speech-to-Text and Text-to-Speech processing with AI reasoning in between. Processing Flow:- Higher accuracy: More processing time allows for better transcription
- Better reasoning: AI has full text context for complex decision making
- Flexibility: Can modify, analyze, and optimize text before speech synthesis
- Debugging: Full visibility into conversation transcripts
- Higher latency: Multiple processing steps create longer response times
- Less natural flow: Pauses between caller speech and AI response
- Interruption handling: More difficult to handle natural conversation overlaps
Technology Comparison
Aspect | Real-time Models | TTS/STT Models |
---|---|---|
End-to-end Latency | 200-500ms | 800-1500ms |
Conversation Flow | Natural, interruptible | Turn-based with pauses |
Processing Visibility | Limited (audio-to-audio) | Full (complete transcripts) |
Accuracy | Good (optimized for speed) | Excellent (optimized for precision) |
Complex Reasoning | Limited (real-time constraints) | Superior (full context available) |
Interruption Handling | Native support | Requires special handling |
Debugging | Audio-based only | Full text-based analysis |
Cost | Higher (specialized models) | Lower (standard APIs) |
Voice Provider Technologies
Real-time Voice Providers
OpenAI Realtime (Premium):- Ultra-low latency conversational AI
- Native interruption and overlap handling
- Emotional tone and inflection awareness
- Direct audio processing without text intermediary
- Latency: 200-400ms
- Best for: High-end customer service, complex consultation
- Ultra-realistic AI voices with emotional nuance
- High-accuracy speech recognition
- Premium quality with higher processing time
- Combined Latency: 900-1200ms
- Best for: High-quality interactions where accuracy is critical
- Fast, reliable speech processing
- Good quality with optimized performance
- Consistent delivery across different content types
- Combined Latency: 700-1000ms
- Best for: Balanced quality and performance needs
Accessing Voice & Language Configuration

- Navigate to your Basic Settings dashboard
- Select Voice Selection settings
- You’ll see options for:
- Voice model selection
- Language preferences
- Regional accent settings
- Performance optimization options
Voice Model Selection
Available Voice Options
Professional Voices (Recommended for Business)
Sarah (ElevenLabs) - Professional Female- Characteristics: Clear, authoritative, friendly
- Best for: Medical practices, legal offices, corporate services
- Latency: ~800ms
- Languages: English (US, UK, AU)
- Characteristics: Confident, reassuring, articulate
- Best for: Financial services, consulting, technical support
- Latency: ~850ms
- Languages: English (US, UK)
- Characteristics: Natural, friendly, efficient
- Best for: Restaurants, retail, general customer service
- Latency: ~400ms
- Languages: Multiple languages supported
Specialized Voices
Isabella (ElevenLabs) - Warm Female- Characteristics: Caring, empathetic, gentle
- Best for: Healthcare, counseling, senior services
- Latency: ~900ms
- Languages: English, Spanish
- Characteristics: Deep, commanding, trustworthy
- Best for: Legal, insurance, high-stakes services
- Latency: ~600ms
- Languages: 20+ languages with regional variants
- Characteristics: Enthusiastic, upbeat, engaging
- Best for: Entertainment, events, creative services
- Latency: ~450ms
- Languages: Multiple languages with emotional range
Troubleshooting Voice Issues
Common Voice Problems
Issue: Voice Sounds Robotic
Symptoms:- Monotone delivery
- Unnatural pauses
- Lack of emotional variation
- Mechanical pronunciation
- Switch to premium voice models (ElevenLabs)
- Add natural punctuation to your content
- Use contractions and conversational language
- Adjust speaking speed to more natural pace
- Enable advanced prosody settings
Issue: High Latency Affecting Conversations
Symptoms:- Long pauses before AI responds
- Callers speaking over the AI
- Conversation flow interruptions
- Caller frustration with delays
- Switch to lower-latency voice provider (OpenAI)
- Enable response streaming
- Pre-cache common responses
- Optimize text length before voice synthesis
- Consider regional voice server selection
Issue: Pronunciation Errors
Symptoms:- Business name mispronounced
- Technical terms spoken incorrectly
- Names and places pronounced wrong
- Industry jargon not recognized
- Add terms to pronunciation dictionary
- Use phonetic spelling in content
- Test voice with your specific vocabulary
- Configure industry-specific voice models
- Provide alternative text representations
Issue: Language Detection Failures
Symptoms:- Wrong language selected for caller
- Mixing languages inappropriately
- Defaulting to wrong language
- Confusion in multilingual scenarios
- Adjust detection confidence threshold
- Add explicit language selection option
- Improve greeting language indicators
- Test with various accents and dialects
- Configure better fallback strategies
Best Practices for Voice & Language
Business Type Recommendations
Healthcare Practices:- Voice: Warm, empathetic female voice (Isabella/ElevenLabs)
- Speed: Slightly slower for complex medical terms
- Languages: Match patient demographics
- Accent: Local regional accent for familiarity
- Voice: Authoritative, clear male or female (James/Google)
- Speed: Moderate with clear articulation
- Languages: Professional language variants
- Accent: Neutral or prestigious local accent
- Voice: Friendly, enthusiastic (Emma/OpenAI)
- Speed: Natural conversational pace
- Languages: Local community languages
- Accent: Welcoming local or neutral
- Voice: Clear, patient, knowledgeable
- Speed: Moderate with technical term emphasis
- Languages: International English variants
- Accent: Clear, internationally understood