Voice Configuration

Configure text-to-speech providers, voice selection, speech-to-text settings, interruption handling, and audio behavior for your voice agents.

Voice Configuration

How your agent sounds defines the caller experience. thinnestAI gives you full control over text-to-speech, speech-to-text, interruption behavior, and conversational timing. This guide covers every setting you can tune.

Choosing a TTS Provider

thinnestAI supports 8 text-to-speech providers — from ultra-low-latency engines for real-time voice to specialized providers for Indian languages. Pick the one that fits your use case.

Sarvam

India-first voice AI. Best-in-class support for Indian languages and accents. If your agents serve Indian customers, Sarvam is the clear choice.

Strengths:

  • 10+ Indian languages (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, Malayalam, Punjabi, Odia)
  • Native Indian English accents
  • Code-switching support (Hindi-English, Tamil-English, etc.)
  • Optimized for Indian telecom networks

Models:

ModelLanguagesQualityBest For
bulbul:v110+ Indian languagesHighestProduction Indian voice agents
bulbul:v1-turbo10+ Indian languagesHighLow-latency Indian deployments

Configuration:

{
  "tts": {
    "provider": "sarvam",
    "model": "bulbul:v1",
    "voice": "ananya",
    "language": "hi-IN",
    "speed": 1.0
  }
}

Popular Sarvam voices:

VoiceLanguageDescription
ananyaHindiClear, professional female
arjunHindiWarm, conversational male
meeraTamilNatural female
priyaTeluguProfessional female
advikaEnglish (Indian)Natural Indian English accent

Supported Indian languages:

LanguageCodeCode-Switching
Hindihi-INHindi-English
Tamilta-INTamil-English
Telugute-INTelugu-English
Kannadakn-INKannada-English
Bengalibn-INBengali-English
Marathimr-INMarathi-English
Gujaratigu-IN
Malayalamml-IN
Punjabipa-IN
Odiaor-IN

Cartesia

Blazing-fast, high-quality voices with fine-grained emotion control. Great for expressive, dynamic conversations.

Strengths:

  • Sub-100ms latency
  • Emotion and speed control mid-sentence
  • Streaming word-level timestamps
  • Multilingual support

Models:

ModelLatencyQualityBest For
sonic-2~80msHighestExpressive voice agents
sonic-2-turbo~50msHighUltra-low latency
sonic-multilingual~100msHighNon-English deployments

Configuration:

{
  "tts": {
    "provider": "cartesia",
    "model": "sonic-2",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
    "output_format": "pcm_16000",
    "speed": "normal",
    "emotion": ["positivity:high", "curiosity:medium"]
  }
}

Deepgram TTS

Fast, affordable text-to-speech from the leaders in speech AI. Good balance of quality and cost.

Strengths:

  • Very low latency
  • Simple API
  • Competitive pricing
  • Good for high-volume use cases

Models:

ModelLatencyQualityBest For
aura-asteria-en~100msHighFemale English voice
aura-orion-en~100msHighMale English voice
aura-luna-en~100msHighSoft female English
aura-arcas-en~100msHighAuthoritative male

Configuration:

{
  "tts": {
    "provider": "deepgram",
    "model": "aura-asteria-en"
  }
}

ElevenLabs

The most natural-sounding voices with ultra-realistic intonation. Best for customer-facing agents where voice quality is the top priority.

Strengths:

  • Ultra-realistic voices with natural emotion
  • Voice cloning (use your own voice)
  • 29+ languages
  • Extensive voice library

Models:

ModelLatencyQualityBest For
eleven_turbo_v2_5~150msHighestProduction voice agents
eleven_multilingual_v2~200msHighestMultilingual deployments
eleven_flash_v2_5~100msHighLow-latency needs

Configuration:

{
  "tts": {
    "provider": "elevenlabs",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5",
    "stability": 0.5,
    "similarity_boost": 0.75
  }
}

Popular ElevenLabs voices:

VoiceIDDescription
Rachel21m00Tcm4TlvDq8ikWAMCalm, professional female
AdampNInz6obpgDQGcFmaJgBDeep, authoritative male
BellaEXAVITQu4vr4xnSDxMaLWarm, friendly female
AntoniErXwobaYiN019PkySvjVSmooth, conversational male

Rime

High-quality, low-latency voices with a focus on accuracy and consistency.

Strengths:

  • Consistent voice quality across calls
  • Good pronunciation accuracy
  • Fast inference
  • Simple integration

Models:

ModelLatencyQualityBest For
mist~120msHighGeneral voice agents
v1~150msGoodCost-optimized use

Configuration:

{
  "tts": {
    "provider": "rime",
    "model": "mist",
    "voice": "cove",
    "speed": 1.0
  }
}

Inworld

Specialized for character-driven, immersive voice experiences. Great for agents with strong personas.

Strengths:

  • Character-consistent voices
  • Emotional expressiveness
  • Real-time voice modulation
  • Persona-driven design

Models:

ModelLatencyQualityBest For
inworld-v2~130msHighCharacter-driven agents
inworld-v1~160msGoodStandard agents

Configuration:

{
  "tts": {
    "provider": "inworld",
    "model": "inworld-v2",
    "voice": "default",
    "emotion": "friendly"
  }
}

Google Cloud TTS

Wide language support with consistent quality. Best for multilingual deployments.

Strengths:

  • 220+ voices across 40+ languages
  • Neural2 and Studio voices for premium quality
  • SSML support
  • Predictable pricing

Models:

ModelQualityBest For
Neural2HighestProduction use
StudioHighestPremium applications
StandardGoodCost-optimized

Configuration:

{
  "tts": {
    "provider": "google",
    "voice_name": "en-US-Neural2-F",
    "language_code": "en-US",
    "speaking_rate": 1.0,
    "pitch": 0.0
  }
}

Recommended Google voices:

VoiceLanguageDescription
en-US-Neural2-FEnglish (US)Natural female
en-US-Neural2-DEnglish (US)Natural male
en-GB-Neural2-AEnglish (UK)British female
es-ES-Neural2-ASpanishFemale
hi-IN-Neural2-AHindiFemale

Azure Speech

Enterprise-grade with fine-grained control. Best for regulated industries.

Strengths:

  • High-quality neural voices
  • Custom neural voice training
  • SSML with extensive control
  • Strong enterprise compliance

Models:

ModelQualityBest For
NeuralHighestProduction use
Custom NeuralHighestBrand-specific voice

Configuration:

{
  "tts": {
    "provider": "azure",
    "voice_name": "en-US-JennyNeural",
    "style": "friendly",
    "speaking_rate": "1.0"
  }
}

Provider Comparison

ProviderLatencyLanguagesIndian LanguagesBest For
Aero (recommended)~60-100msEnglish+NoGeneral voice agents
Sarvam~100ms10+ IndianYes (best)Indian market
Cartesia~50-100ms10+NoExpressive conversations
Deepgram~100msEnglishNoHigh-volume, cost-effective
ElevenLabs~100-200ms29+NoPremium voice quality
Rime~120msEnglish+NoConsistent, reliable
Inworld~130msEnglish+NoCharacter-driven agents
Google~150ms40+Hindi, Bengali, TamilMultilingual
Azure~150ms60+Hindi, Tamil, TeluguEnterprise, compliance

Voice Selection and Customization

Selecting a Voice in the Dashboard

  1. Open your voice agent configuration.
  2. Go to the Voice tab.
  3. Select a TTS provider.
  4. Browse available voices and click Preview to hear samples.
  5. Adjust settings (speed, pitch, stability) with the sliders.
  6. Click Save.

Voice Selection via API

curl -X PATCH "https://api.thinnest.ai/agents/agent_xyz" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_config": {
      "tts_provider": "elevenlabs",
      "tts_voice_id": "21m00Tcm4TlvDq8ikWAM",
      "tts_model": "eleven_turbo_v2_5",
      "tts_stability": 0.5,
      "tts_similarity_boost": 0.75
    }
  }'

Matching Voice to Use Case

Use CaseRecommended ProviderVoice Style
General voice agentsAeroWarm, professional
Indian marketSarvamNative language voices
Customer supportAero or ElevenLabsWarm, patient, professional
Sales callsCartesia or ElevenLabsEnergetic, confident
High-volume campaignsDeepgramClear, cost-effective
Appointment remindersAeroNeutral, clear
Medical/legalAzureProfessional, calm
Multilingual supportGoogle or SarvamLanguage-specific voices
Character-driven agentsInworldPersona-consistent
Brand voiceElevenLabs (clone)Your custom voice

Speech-to-Text (STT) Configuration

Speech-to-text converts the caller's audio into text for your AI agent to process. thinnestAI supports multiple STT providers with native support for Indian and global languages.

Deepgram (Default)

Industry-leading speech recognition with the best accuracy and lowest latency. Our default and recommended STT provider.

Models:

ModelAccuracyLatencyBest For
nova-3Highest~100msProduction voice agents
nova-2High~120msGeneral use
nova-2-phonecallHigh~120msPhone call audio (optimized)
nova-2-conversationalaiHigh~120msConversational agents
enhancedGood~150msCost-optimized
baseStandard~100msHigh-volume, budget
{
  "stt": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en-US",
    "punctuate": true,
    "smart_format": true,
    "profanity_filter": false
  }
}

Sarvam STT

Best-in-class for Indian languages. If your callers speak Hindi, Tamil, Telugu, or other Indian languages, Sarvam delivers the highest accuracy.

Models:

ModelLanguagesBest For
saarika:v210+ Indian languagesProduction Indian voice agents
saarika:v2-turbo10+ Indian languagesLow-latency Indian deployments
{
  "stt": {
    "provider": "sarvam",
    "model": "saarika:v2",
    "language": "hi-IN"
  }
}

Supports code-switching (e.g., Hindi-English mixed speech) out of the box.

Google Speech-to-Text

Strong multilingual support with good accuracy across 125+ languages. Includes a built-in denoiser that filters background noise for cleaner transcriptions in noisy environments like call centers or outdoor calls.

Models:

ModelAccuracyBest For
latest_longHighLong conversations
latest_shortHighShort utterances
telephonyHighPhone call audio
medical_conversationHighHealthcare
{
  "stt": {
    "provider": "google",
    "model": "telephony",
    "language": "en-US",
    "denoiser": true
  }
}

Set "denoiser": true to enable automatic background noise filtering. This is especially useful for phone call audio where ambient noise can degrade transcription accuracy.

Azure Speech-to-Text

Enterprise-grade with custom model training. Best for regulated industries.

Models:

ModelAccuracyBest For
defaultHighGeneral use
customHighestDomain-specific vocabulary
conversationHighMulti-speaker calls
{
  "stt": {
    "provider": "azure",
    "model": "default",
    "language": "en-US"
  }
}

Cartesia STT

Real-time transcription with word-level timestamps.

{
  "stt": {
    "provider": "cartesia",
    "model": "ink",
    "language": "en"
  }
}

AssemblyAI

High-accuracy transcription with built-in speaker diarization and entity detection.

Models:

ModelAccuracyBest For
bestHighestMaximum accuracy
nanoGoodCost-optimized, low latency
{
  "stt": {
    "provider": "assemblyai",
    "model": "best",
    "language": "en"
  }
}

STT Provider Comparison

ProviderLanguagesIndian LanguagesSelf-HostedBest For
Vega (ThinnestAI)200+All 22 scheduledYesIndian market, data sovereignty
Deepgram (default)30+HindiNoGeneral voice agents
Sarvam10+ Indian10+ IndianNoIndian languages
Google125+Hindi, Tamil, Telugu+NoMultilingual
Azure100+Hindi, Tamil, Telugu+NoEnterprise, custom models
Cartesia10+NoNoTimestamps, real-time
AssemblyAI20+NoNoMaximum accuracy

Multi-Language Support

For agents that handle multiple languages:

{
  "stt": {
    "language": "multi",
    "detect_language": true,
    "supported_languages": ["en", "es", "fr", "hi", "ta"]
  }
}

The agent will detect the caller's language and respond accordingly.

STT/TTS Fallback

For production voice agents, you can configure fallback providers that automatically take over if the primary provider fails or times out. This ensures your agent stays responsive even during provider outages.

How Fallback Works

  1. The agent sends audio to the primary provider.
  2. If the primary provider fails (timeout, error, or degraded quality), the system automatically switches to the fallback provider.
  3. The switch is seamless — callers experience a brief pause at most, not a dropped call.

Configuring TTS Fallback

{
  "tts": {
    "provider": "elevenlabs",
    "model": "eleven_turbo_v2_5",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "fallback": {
      "provider": "aero",
      "model": "aero-2",
      "voice": "maya"
    }
  }
}

Configuring STT Fallback

{
  "stt": {
    "provider": "deepgram",
    "model": "nova-3",
    "fallback": {
      "provider": "google",
      "model": "telephony"
    }
  }
}
PrimaryFallbackReason
ElevenLabs (TTS)Aero (TTS)Premium quality primary, fast fallback
Deepgram (STT)Google (STT)Best accuracy primary, wide language fallback
Cartesia (TTS)Deepgram (TTS)Ultra-low latency primary, reliable fallback
Sarvam (STT)Google (STT)Best Indian language primary, multilingual fallback

Tip: Choose fallback providers that are hosted on different infrastructure than your primary to maximize resilience. For example, pairing a third-party provider with Google Cloud provides good redundancy.

Interruption Handling

Interruption handling determines what happens when the caller speaks while the agent is talking.

Modes

Allow interruptions (default):

The agent stops speaking when the caller starts talking. This feels natural — like a real conversation.

{
  "interruption": {
    "enabled": true,
    "sensitivity": "medium"
  }
}

Block interruptions:

The agent finishes its response before listening. Use this for critical information that must be delivered completely (e.g., legal disclaimers).

{
  "interruption": {
    "enabled": false
  }
}

Sensitivity Levels

LevelDescriptionBest For
lowOnly interrupts on sustained speech (1+ seconds)Noisy environments
mediumBalanced detectionGeneral use
highInterrupts on brief soundsQuick, responsive conversations

Per-Message Interruption Control

You can control interruption behavior per message in your agent's system prompt:

When reading the terms and conditions, do not allow interruptions.
For all other responses, allow normal interruptions.

The agent will dynamically adjust based on what it is saying.

Silent Timeout Settings

Silent timeouts control what happens when the caller stops speaking.

{
  "timeouts": {
    "silence_timeout_seconds": 10,
    "silence_prompt": "Are you still there? I'm happy to help if you have any questions.",
    "max_silence_prompts": 2,
    "disconnect_after_max_silence": true,
    "disconnect_message": "It seems like you may have stepped away. Feel free to call back anytime. Goodbye!"
  }
}
ParameterDefaultDescription
silence_timeout_seconds10Seconds of silence before prompting
silence_promptWhat the agent says after silence
max_silence_prompts2How many times to prompt before disconnecting
disconnect_after_max_silencetrueDisconnect after max prompts exceeded
disconnect_messageMessage before disconnecting
Use CaseTimeoutNotes
Quick transactions5-8sFast-paced conversations
Customer support10-15sCallers may be looking up info
Technical support15-20sCallers may be following instructions
Sales calls8-12sKeep momentum going

Voice Formatting and Pronunciation

Pronunciation Overrides

If your agent mispronounces specific words (brand names, technical terms), add pronunciation overrides:

{
  "pronunciation": {
    "overrides": [
      {
        "word": "thinnestAI",
        "pronunciation": "thin-est A.I."
      },
      {
        "word": "PostgreSQL",
        "pronunciation": "post-gres-Q-L"
      },
      {
        "word": "API",
        "pronunciation": "A.P.I."
      }
    ]
  }
}

Number and Date Formatting

Control how the agent reads numbers and dates:

{
  "formatting": {
    "phone_numbers": "digit_by_digit",
    "currency": "natural",
    "dates": "natural",
    "times": "12_hour"
  }
}
SettingOptionsExample
phone_numbersdigit_by_digit, grouped"4-1-5-5-5-5-1-2-3-4" vs "415-555-1234"
currencynatural, formal"twenty-three fifty" vs "twenty-three dollars and fifty cents"
datesnatural, formal"March fifth" vs "March 5th, 2026"
times12_hour, 24_hour"2 PM" vs "14 hundred hours"

Filler Words and Pauses

Make your agent sound more natural by eliminating awkward silence. thinnestAI supports two types of fillers:

Instant Filler Words

Short, natural sounds that play immediately when the user stops speaking — before the AI even starts thinking. This creates a human-like conversational feel, similar to how people say "hmm" or "aha" while listening.

{
  "fillerWordsEnabled": true,
  "fillerWords": ["hmm", "aha", "um", "okay"],
  "fillerWordsMinChars": 10
}
  • fillerWords — Custom filler words. If empty, uses language-appropriate defaults (Hindi: "hmm", "acchha", "haan"; English: "hmm", "uh huh", "um", "right")
  • fillerWordsMinChars — Only play fillers when user said more than N characters. Short replies like "yes" or "no" skip fillers to avoid sounding unnatural

Thinking Phrases

Longer phrases spoken when the agent needs time to process a tool call or complex request:

{
  "silenceFillersEnabled": true,
  "silenceFillerPhrases": [
    "Let me check that for you.",
    "One moment please.",
    "Good question, let me look into that."
  ]
}

These are LLM-driven — the agent decides when to use them based on context.

Testing Your Configuration

After making changes, test thoroughly:

  1. Use the web call test — Click Test Call in the dashboard to hear your changes immediately.
  2. Test edge cases — Try interrupting, staying silent, speaking quickly, and using unusual words.
  3. Test on a real phone — Web call audio quality differs from phone audio. Always test over a real phone line before going live.
  4. Compare providers — Try the same conversation with different TTS providers to find the best fit.
  5. Get feedback — Have someone unfamiliar with the system test the call and provide honest feedback.

Next Steps

  • Inbound Calls — Apply your voice configuration to inbound call handling
  • Outbound Calls — Use your configured voice for outbound campaigns
  • Call Recording — Record calls to review voice quality over time

On this page