Voice Configuration
Configure text-to-speech providers, voice selection, speech-to-text settings, interruption handling, and audio behavior for your voice agents.
Voice Configuration
How your agent sounds defines the caller experience. thinnestAI gives you full control over text-to-speech, speech-to-text, interruption behavior, and conversational timing. This guide covers every setting you can tune.
Choosing a TTS Provider
thinnestAI supports 8 text-to-speech providers — from ultra-low-latency engines for real-time voice to specialized providers for Indian languages. Pick the one that fits your use case.
Sarvam
India-first voice AI. Best-in-class support for Indian languages and accents. If your agents serve Indian customers, Sarvam is the clear choice.
Strengths:
- 10+ Indian languages (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, Malayalam, Punjabi, Odia)
- Native Indian English accents
- Code-switching support (Hindi-English, Tamil-English, etc.)
- Optimized for Indian telecom networks
Models:
| Model | Languages | Quality | Best For |
|---|---|---|---|
bulbul:v1 | 10+ Indian languages | Highest | Production Indian voice agents |
bulbul:v1-turbo | 10+ Indian languages | High | Low-latency Indian deployments |
Configuration:
{
"tts": {
"provider": "sarvam",
"model": "bulbul:v1",
"voice": "ananya",
"language": "hi-IN",
"speed": 1.0
}
}Popular Sarvam voices:
| Voice | Language | Description |
|---|---|---|
ananya | Hindi | Clear, professional female |
arjun | Hindi | Warm, conversational male |
meera | Tamil | Natural female |
priya | Telugu | Professional female |
advika | English (Indian) | Natural Indian English accent |
Supported Indian languages:
| Language | Code | Code-Switching |
|---|---|---|
| Hindi | hi-IN | Hindi-English |
| Tamil | ta-IN | Tamil-English |
| Telugu | te-IN | Telugu-English |
| Kannada | kn-IN | Kannada-English |
| Bengali | bn-IN | Bengali-English |
| Marathi | mr-IN | Marathi-English |
| Gujarati | gu-IN | — |
| Malayalam | ml-IN | — |
| Punjabi | pa-IN | — |
| Odia | or-IN | — |
Cartesia
Blazing-fast, high-quality voices with fine-grained emotion control. Great for expressive, dynamic conversations.
Strengths:
- Sub-100ms latency
- Emotion and speed control mid-sentence
- Streaming word-level timestamps
- Multilingual support
Models:
| Model | Latency | Quality | Best For |
|---|---|---|---|
sonic-2 | ~80ms | Highest | Expressive voice agents |
sonic-2-turbo | ~50ms | High | Ultra-low latency |
sonic-multilingual | ~100ms | High | Non-English deployments |
Configuration:
{
"tts": {
"provider": "cartesia",
"model": "sonic-2",
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"output_format": "pcm_16000",
"speed": "normal",
"emotion": ["positivity:high", "curiosity:medium"]
}
}Deepgram TTS
Fast, affordable text-to-speech from the leaders in speech AI. Good balance of quality and cost.
Strengths:
- Very low latency
- Simple API
- Competitive pricing
- Good for high-volume use cases
Models:
| Model | Latency | Quality | Best For |
|---|---|---|---|
aura-asteria-en | ~100ms | High | Female English voice |
aura-orion-en | ~100ms | High | Male English voice |
aura-luna-en | ~100ms | High | Soft female English |
aura-arcas-en | ~100ms | High | Authoritative male |
Configuration:
{
"tts": {
"provider": "deepgram",
"model": "aura-asteria-en"
}
}ElevenLabs
The most natural-sounding voices with ultra-realistic intonation. Best for customer-facing agents where voice quality is the top priority.
Strengths:
- Ultra-realistic voices with natural emotion
- Voice cloning (use your own voice)
- 29+ languages
- Extensive voice library
Models:
| Model | Latency | Quality | Best For |
|---|---|---|---|
eleven_turbo_v2_5 | ~150ms | Highest | Production voice agents |
eleven_multilingual_v2 | ~200ms | Highest | Multilingual deployments |
eleven_flash_v2_5 | ~100ms | High | Low-latency needs |
Configuration:
{
"tts": {
"provider": "elevenlabs",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"model": "eleven_turbo_v2_5",
"stability": 0.5,
"similarity_boost": 0.75
}
}Popular ElevenLabs voices:
| Voice | ID | Description |
|---|---|---|
| Rachel | 21m00Tcm4TlvDq8ikWAM | Calm, professional female |
| Adam | pNInz6obpgDQGcFmaJgB | Deep, authoritative male |
| Bella | EXAVITQu4vr4xnSDxMaL | Warm, friendly female |
| Antoni | ErXwobaYiN019PkySvjV | Smooth, conversational male |
Rime
High-quality, low-latency voices with a focus on accuracy and consistency.
Strengths:
- Consistent voice quality across calls
- Good pronunciation accuracy
- Fast inference
- Simple integration
Models:
| Model | Latency | Quality | Best For |
|---|---|---|---|
mist | ~120ms | High | General voice agents |
v1 | ~150ms | Good | Cost-optimized use |
Configuration:
{
"tts": {
"provider": "rime",
"model": "mist",
"voice": "cove",
"speed": 1.0
}
}Inworld
Specialized for character-driven, immersive voice experiences. Great for agents with strong personas.
Strengths:
- Character-consistent voices
- Emotional expressiveness
- Real-time voice modulation
- Persona-driven design
Models:
| Model | Latency | Quality | Best For |
|---|---|---|---|
inworld-v2 | ~130ms | High | Character-driven agents |
inworld-v1 | ~160ms | Good | Standard agents |
Configuration:
{
"tts": {
"provider": "inworld",
"model": "inworld-v2",
"voice": "default",
"emotion": "friendly"
}
}Google Cloud TTS
Wide language support with consistent quality. Best for multilingual deployments.
Strengths:
- 220+ voices across 40+ languages
- Neural2 and Studio voices for premium quality
- SSML support
- Predictable pricing
Models:
| Model | Quality | Best For |
|---|---|---|
Neural2 | Highest | Production use |
Studio | Highest | Premium applications |
Standard | Good | Cost-optimized |
Configuration:
{
"tts": {
"provider": "google",
"voice_name": "en-US-Neural2-F",
"language_code": "en-US",
"speaking_rate": 1.0,
"pitch": 0.0
}
}Recommended Google voices:
| Voice | Language | Description |
|---|---|---|
en-US-Neural2-F | English (US) | Natural female |
en-US-Neural2-D | English (US) | Natural male |
en-GB-Neural2-A | English (UK) | British female |
es-ES-Neural2-A | Spanish | Female |
hi-IN-Neural2-A | Hindi | Female |
Azure Speech
Enterprise-grade with fine-grained control. Best for regulated industries.
Strengths:
- High-quality neural voices
- Custom neural voice training
- SSML with extensive control
- Strong enterprise compliance
Models:
| Model | Quality | Best For |
|---|---|---|
Neural | Highest | Production use |
Custom Neural | Highest | Brand-specific voice |
Configuration:
{
"tts": {
"provider": "azure",
"voice_name": "en-US-JennyNeural",
"style": "friendly",
"speaking_rate": "1.0"
}
}Provider Comparison
| Provider | Latency | Languages | Indian Languages | Best For |
|---|---|---|---|---|
| Aero (recommended) | ~60-100ms | English+ | No | General voice agents |
| Sarvam | ~100ms | 10+ Indian | Yes (best) | Indian market |
| Cartesia | ~50-100ms | 10+ | No | Expressive conversations |
| Deepgram | ~100ms | English | No | High-volume, cost-effective |
| ElevenLabs | ~100-200ms | 29+ | No | Premium voice quality |
| Rime | ~120ms | English+ | No | Consistent, reliable |
| Inworld | ~130ms | English+ | No | Character-driven agents |
| ~150ms | 40+ | Hindi, Bengali, Tamil | Multilingual | |
| Azure | ~150ms | 60+ | Hindi, Tamil, Telugu | Enterprise, compliance |
Voice Selection and Customization
Selecting a Voice in the Dashboard
- Open your voice agent configuration.
- Go to the Voice tab.
- Select a TTS provider.
- Browse available voices and click Preview to hear samples.
- Adjust settings (speed, pitch, stability) with the sliders.
- Click Save.
Voice Selection via API
curl -X PATCH "https://api.thinnest.ai/agents/agent_xyz" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"voice_config": {
"tts_provider": "elevenlabs",
"tts_voice_id": "21m00Tcm4TlvDq8ikWAM",
"tts_model": "eleven_turbo_v2_5",
"tts_stability": 0.5,
"tts_similarity_boost": 0.75
}
}'Matching Voice to Use Case
| Use Case | Recommended Provider | Voice Style |
|---|---|---|
| General voice agents | Aero | Warm, professional |
| Indian market | Sarvam | Native language voices |
| Customer support | Aero or ElevenLabs | Warm, patient, professional |
| Sales calls | Cartesia or ElevenLabs | Energetic, confident |
| High-volume campaigns | Deepgram | Clear, cost-effective |
| Appointment reminders | Aero | Neutral, clear |
| Medical/legal | Azure | Professional, calm |
| Multilingual support | Google or Sarvam | Language-specific voices |
| Character-driven agents | Inworld | Persona-consistent |
| Brand voice | ElevenLabs (clone) | Your custom voice |
Speech-to-Text (STT) Configuration
Speech-to-text converts the caller's audio into text for your AI agent to process. thinnestAI supports multiple STT providers with native support for Indian and global languages.
Deepgram (Default)
Industry-leading speech recognition with the best accuracy and lowest latency. Our default and recommended STT provider.
Models:
| Model | Accuracy | Latency | Best For |
|---|---|---|---|
nova-3 | Highest | ~100ms | Production voice agents |
nova-2 | High | ~120ms | General use |
nova-2-phonecall | High | ~120ms | Phone call audio (optimized) |
nova-2-conversationalai | High | ~120ms | Conversational agents |
enhanced | Good | ~150ms | Cost-optimized |
base | Standard | ~100ms | High-volume, budget |
{
"stt": {
"provider": "deepgram",
"model": "nova-3",
"language": "en-US",
"punctuate": true,
"smart_format": true,
"profanity_filter": false
}
}Sarvam STT
Best-in-class for Indian languages. If your callers speak Hindi, Tamil, Telugu, or other Indian languages, Sarvam delivers the highest accuracy.
Models:
| Model | Languages | Best For |
|---|---|---|
saarika:v2 | 10+ Indian languages | Production Indian voice agents |
saarika:v2-turbo | 10+ Indian languages | Low-latency Indian deployments |
{
"stt": {
"provider": "sarvam",
"model": "saarika:v2",
"language": "hi-IN"
}
}Supports code-switching (e.g., Hindi-English mixed speech) out of the box.
Google Speech-to-Text
Strong multilingual support with good accuracy across 125+ languages. Includes a built-in denoiser that filters background noise for cleaner transcriptions in noisy environments like call centers or outdoor calls.
Models:
| Model | Accuracy | Best For |
|---|---|---|
latest_long | High | Long conversations |
latest_short | High | Short utterances |
telephony | High | Phone call audio |
medical_conversation | High | Healthcare |
{
"stt": {
"provider": "google",
"model": "telephony",
"language": "en-US",
"denoiser": true
}
}Set "denoiser": true to enable automatic background noise filtering. This is especially useful for phone call audio where ambient noise can degrade transcription accuracy.
Azure Speech-to-Text
Enterprise-grade with custom model training. Best for regulated industries.
Models:
| Model | Accuracy | Best For |
|---|---|---|
default | High | General use |
custom | Highest | Domain-specific vocabulary |
conversation | High | Multi-speaker calls |
{
"stt": {
"provider": "azure",
"model": "default",
"language": "en-US"
}
}Cartesia STT
Real-time transcription with word-level timestamps.
{
"stt": {
"provider": "cartesia",
"model": "ink",
"language": "en"
}
}AssemblyAI
High-accuracy transcription with built-in speaker diarization and entity detection.
Models:
| Model | Accuracy | Best For |
|---|---|---|
best | Highest | Maximum accuracy |
nano | Good | Cost-optimized, low latency |
{
"stt": {
"provider": "assemblyai",
"model": "best",
"language": "en"
}
}STT Provider Comparison
| Provider | Languages | Indian Languages | Self-Hosted | Best For |
|---|---|---|---|---|
| Vega (ThinnestAI) | 200+ | All 22 scheduled | Yes | Indian market, data sovereignty |
| Deepgram (default) | 30+ | Hindi | No | General voice agents |
| Sarvam | 10+ Indian | 10+ Indian | No | Indian languages |
| 125+ | Hindi, Tamil, Telugu+ | No | Multilingual | |
| Azure | 100+ | Hindi, Tamil, Telugu+ | No | Enterprise, custom models |
| Cartesia | 10+ | No | No | Timestamps, real-time |
| AssemblyAI | 20+ | No | No | Maximum accuracy |
Multi-Language Support
For agents that handle multiple languages:
{
"stt": {
"language": "multi",
"detect_language": true,
"supported_languages": ["en", "es", "fr", "hi", "ta"]
}
}The agent will detect the caller's language and respond accordingly.
STT/TTS Fallback
For production voice agents, you can configure fallback providers that automatically take over if the primary provider fails or times out. This ensures your agent stays responsive even during provider outages.
How Fallback Works
- The agent sends audio to the primary provider.
- If the primary provider fails (timeout, error, or degraded quality), the system automatically switches to the fallback provider.
- The switch is seamless — callers experience a brief pause at most, not a dropped call.
Configuring TTS Fallback
{
"tts": {
"provider": "elevenlabs",
"model": "eleven_turbo_v2_5",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"fallback": {
"provider": "aero",
"model": "aero-2",
"voice": "maya"
}
}
}Configuring STT Fallback
{
"stt": {
"provider": "deepgram",
"model": "nova-3",
"fallback": {
"provider": "google",
"model": "telephony"
}
}
}Recommended Fallback Pairs
| Primary | Fallback | Reason |
|---|---|---|
| ElevenLabs (TTS) | Aero (TTS) | Premium quality primary, fast fallback |
| Deepgram (STT) | Google (STT) | Best accuracy primary, wide language fallback |
| Cartesia (TTS) | Deepgram (TTS) | Ultra-low latency primary, reliable fallback |
| Sarvam (STT) | Google (STT) | Best Indian language primary, multilingual fallback |
Tip: Choose fallback providers that are hosted on different infrastructure than your primary to maximize resilience. For example, pairing a third-party provider with Google Cloud provides good redundancy.
Interruption Handling
Interruption handling determines what happens when the caller speaks while the agent is talking.
Modes
Allow interruptions (default):
The agent stops speaking when the caller starts talking. This feels natural — like a real conversation.
{
"interruption": {
"enabled": true,
"sensitivity": "medium"
}
}Block interruptions:
The agent finishes its response before listening. Use this for critical information that must be delivered completely (e.g., legal disclaimers).
{
"interruption": {
"enabled": false
}
}Sensitivity Levels
| Level | Description | Best For |
|---|---|---|
low | Only interrupts on sustained speech (1+ seconds) | Noisy environments |
medium | Balanced detection | General use |
high | Interrupts on brief sounds | Quick, responsive conversations |
Per-Message Interruption Control
You can control interruption behavior per message in your agent's system prompt:
When reading the terms and conditions, do not allow interruptions.
For all other responses, allow normal interruptions.The agent will dynamically adjust based on what it is saying.
Silent Timeout Settings
Silent timeouts control what happens when the caller stops speaking.
{
"timeouts": {
"silence_timeout_seconds": 10,
"silence_prompt": "Are you still there? I'm happy to help if you have any questions.",
"max_silence_prompts": 2,
"disconnect_after_max_silence": true,
"disconnect_message": "It seems like you may have stepped away. Feel free to call back anytime. Goodbye!"
}
}| Parameter | Default | Description |
|---|---|---|
silence_timeout_seconds | 10 | Seconds of silence before prompting |
silence_prompt | — | What the agent says after silence |
max_silence_prompts | 2 | How many times to prompt before disconnecting |
disconnect_after_max_silence | true | Disconnect after max prompts exceeded |
disconnect_message | — | Message before disconnecting |
Recommended Timeout Settings
| Use Case | Timeout | Notes |
|---|---|---|
| Quick transactions | 5-8s | Fast-paced conversations |
| Customer support | 10-15s | Callers may be looking up info |
| Technical support | 15-20s | Callers may be following instructions |
| Sales calls | 8-12s | Keep momentum going |
Voice Formatting and Pronunciation
Pronunciation Overrides
If your agent mispronounces specific words (brand names, technical terms), add pronunciation overrides:
{
"pronunciation": {
"overrides": [
{
"word": "thinnestAI",
"pronunciation": "thin-est A.I."
},
{
"word": "PostgreSQL",
"pronunciation": "post-gres-Q-L"
},
{
"word": "API",
"pronunciation": "A.P.I."
}
]
}
}Number and Date Formatting
Control how the agent reads numbers and dates:
{
"formatting": {
"phone_numbers": "digit_by_digit",
"currency": "natural",
"dates": "natural",
"times": "12_hour"
}
}| Setting | Options | Example |
|---|---|---|
phone_numbers | digit_by_digit, grouped | "4-1-5-5-5-5-1-2-3-4" vs "415-555-1234" |
currency | natural, formal | "twenty-three fifty" vs "twenty-three dollars and fifty cents" |
dates | natural, formal | "March fifth" vs "March 5th, 2026" |
times | 12_hour, 24_hour | "2 PM" vs "14 hundred hours" |
Filler Words and Pauses
Make your agent sound more natural by eliminating awkward silence. thinnestAI supports two types of fillers:
Instant Filler Words
Short, natural sounds that play immediately when the user stops speaking — before the AI even starts thinking. This creates a human-like conversational feel, similar to how people say "hmm" or "aha" while listening.
{
"fillerWordsEnabled": true,
"fillerWords": ["hmm", "aha", "um", "okay"],
"fillerWordsMinChars": 10
}fillerWords— Custom filler words. If empty, uses language-appropriate defaults (Hindi: "hmm", "acchha", "haan"; English: "hmm", "uh huh", "um", "right")fillerWordsMinChars— Only play fillers when user said more than N characters. Short replies like "yes" or "no" skip fillers to avoid sounding unnatural
Thinking Phrases
Longer phrases spoken when the agent needs time to process a tool call or complex request:
{
"silenceFillersEnabled": true,
"silenceFillerPhrases": [
"Let me check that for you.",
"One moment please.",
"Good question, let me look into that."
]
}These are LLM-driven — the agent decides when to use them based on context.
Testing Your Configuration
After making changes, test thoroughly:
- Use the web call test — Click Test Call in the dashboard to hear your changes immediately.
- Test edge cases — Try interrupting, staying silent, speaking quickly, and using unusual words.
- Test on a real phone — Web call audio quality differs from phone audio. Always test over a real phone line before going live.
- Compare providers — Try the same conversation with different TTS providers to find the best fit.
- Get feedback — Have someone unfamiliar with the system test the call and provide honest feedback.
Next Steps
- Inbound Calls — Apply your voice configuration to inbound call handling
- Outbound Calls — Use your configured voice for outbound campaigns
- Call Recording — Record calls to review voice quality over time