Skip to main content

Voice Configuration

How your agent sounds defines the caller experience. thinnestAI gives you full control over text-to-speech, speech-to-text, interruption behavior, and conversational timing. This guide covers every setting you can tune.

Choosing a TTS Provider

thinnestAI supports 8 text-to-speech providers — from ultra-low-latency engines for real-time voice to specialized providers for Indian languages. Pick the one that fits your use case.

Sarvam

India-first voice AI. Best-in-class support for Indian languages and accents. If your agents serve Indian customers, Sarvam is the clear choice. Strengths:
  • 10+ Indian languages (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, Malayalam, Punjabi, Odia)
  • Native Indian English accents
  • Code-switching support (Hindi-English, Tamil-English, etc.)
  • Optimized for Indian telecom networks
Models:
ModelLanguagesQualityBest For
bulbul:v310+ Indian languagesHighest (recommended)Production Indian voice agents
bulbul:v3-beta10+ Indian languagesHigh (experimental)Trying upcoming voices
bulbul:v210+ Indian languagesStableLegacy production deployments
bulbul:v3 is the current default. It accepts three additional knobs the older v2 doesn’t:
  • temperature (0.011.0, default 0.5) — assistant-style prosody. We default to 0.5; 0.6 is more expressive but adds prosody jitter on long turns.
  • min_buffer_size (30200 chars, default 50) — how much text to buffer before the first audio chunk emits. Lower = faster TTFA, more fragmented prosody.
  • max_chunk_length (50500 chars, default 150) — splits long sentences for streaming.
These are surfaced automatically when you pick a bulbul:v3 model — no extra config needed. Configuration:
{
  "tts": {
    "provider": "sarvam",
    "model": "bulbul:v3",
    "voice": "shubh",
    "language": "hi-IN",
    "speed": 1.0
  }
}
Popular Sarvam voices:
VoiceSarvam versionLanguageDescription
shubhv3 (default)Hindi / Indian EnglishWarm, conversational male
rituv3Hindi / Indian EnglishClear professional female
pooja, kavya, ishita, priyav3Hindi / Indian EnglishFemale, varied tone
rahul, amit, dev, kabir, adityav3Hindi / Indian EnglishMale, varied tone
anushkav2 (default)HindiFemale, baseline production voice
manisha, vidya, aryav2HindiFemale alternates
abhilash, karun, hiteshv2HindiMale alternates
The full v3 roster is ~25 speakers covering Hindi, Tamil, Telugu, Marathi, Bengali, English, and code-switched accents. Pick voices in the agent studio’s TTS picker — switching between v2 and v3 automatically filters to compatible speakers. Supported Indian languages:
LanguageCodeCode-Switching
Hindihi-INHindi-English
Tamilta-INTamil-English
Telugute-INTelugu-English
Kannadakn-INKannada-English
Bengalibn-INBengali-English
Marathimr-INMarathi-English
Gujaratigu-IN
Malayalamml-IN
Punjabipa-IN
Odiaor-IN

Cartesia

Blazing-fast, high-quality voices with fine-grained emotion control. Great for expressive, dynamic conversations. Strengths:
  • Sub-100ms latency
  • Emotion and speed control mid-sentence
  • Streaming word-level timestamps
  • Multilingual support
Models:
ModelLatencyQualityBest For
sonic-2~80msHighestExpressive voice agents
sonic-2-turbo~50msHighUltra-low latency
sonic-multilingual~100msHighNon-English deployments
Configuration:
{
  "tts": {
    "provider": "cartesia",
    "model": "sonic-2",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
    "output_format": "pcm_16000",
    "speed": "normal",
    "emotion": ["positivity:high", "curiosity:medium"]
  }
}

Deepgram TTS

Fast, affordable text-to-speech from the leaders in speech AI. Good balance of quality and cost. Strengths:
  • Very low latency
  • Simple API
  • Competitive pricing
  • Good for high-volume use cases
Models:
ModelLatencyQualityBest For
aura-asteria-en~100msHighFemale English voice
aura-orion-en~100msHighMale English voice
aura-luna-en~100msHighSoft female English
aura-arcas-en~100msHighAuthoritative male
Configuration:
{
  "tts": {
    "provider": "deepgram",
    "model": "aura-asteria-en"
  }
}

ElevenLabs

The most natural-sounding voices with ultra-realistic intonation. Best for customer-facing agents where voice quality is the top priority. Strengths:
  • Ultra-realistic voices with natural emotion
  • Voice cloning (use your own voice)
  • 29+ languages
  • Extensive voice library
Models:
ModelLatencyQualityBest For
eleven_turbo_v2_5~150msHighestProduction voice agents
eleven_multilingual_v2~200msHighestMultilingual deployments
eleven_flash_v2_5~100msHighLow-latency needs
Configuration:
{
  "tts": {
    "provider": "elevenlabs",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5",
    "stability": 0.5,
    "similarity_boost": 0.75
  }
}
Popular ElevenLabs voices:
VoiceIDDescription
Rachel21m00Tcm4TlvDq8ikWAMCalm, professional female
AdampNInz6obpgDQGcFmaJgBDeep, authoritative male
BellaEXAVITQu4vr4xnSDxMaLWarm, friendly female
AntoniErXwobaYiN019PkySvjVSmooth, conversational male

Rime

High-quality, low-latency voices with a focus on accuracy and consistency. Strengths:
  • Consistent voice quality across calls
  • Good pronunciation accuracy
  • Fast inference
  • Simple integration
Models:
ModelLatencyQualityBest For
mist~120msHighGeneral voice agents
v1~150msGoodCost-optimized use
Configuration:
{
  "tts": {
    "provider": "rime",
    "model": "mist",
    "voice": "cove",
    "speed": 1.0
  }
}

Inworld

Specialized for character-driven, immersive voice experiences. Great for agents with strong personas. Strengths:
  • Character-consistent voices
  • Emotional expressiveness
  • Real-time voice modulation
  • Persona-driven design
Models:
ModelLatencyQualityBest For
inworld-v2~130msHighCharacter-driven agents
inworld-v1~160msGoodStandard agents
Configuration:
{
  "tts": {
    "provider": "inworld",
    "model": "inworld-v2",
    "voice": "default",
    "emotion": "friendly"
  }
}

Google Cloud TTS

Wide language support with consistent quality. Best for multilingual deployments. Strengths:
  • 220+ voices across 40+ languages
  • Neural2 and Studio voices for premium quality
  • SSML support
  • Predictable pricing
Models:
ModelQualityBest For
Neural2HighestProduction use
StudioHighestPremium applications
StandardGoodCost-optimized
Configuration:
{
  "tts": {
    "provider": "google",
    "voice_name": "en-US-Neural2-F",
    "language_code": "en-US",
    "speaking_rate": 1.0,
    "pitch": 0.0
  }
}
Recommended Google voices:
VoiceLanguageDescription
en-US-Neural2-FEnglish (US)Natural female
en-US-Neural2-DEnglish (US)Natural male
en-GB-Neural2-AEnglish (UK)British female
es-ES-Neural2-ASpanishFemale
hi-IN-Neural2-AHindiFemale

Azure Speech

Enterprise-grade with fine-grained control. Best for regulated industries. Strengths:
  • High-quality neural voices
  • Custom neural voice training
  • SSML with extensive control
  • Strong enterprise compliance
Models:
ModelQualityBest For
NeuralHighestProduction use
Custom NeuralHighestBrand-specific voice
Configuration:
{
  "tts": {
    "provider": "azure",
    "voice_name": "en-US-JennyNeural",
    "style": "friendly",
    "speaking_rate": "1.0"
  }
}

Provider Comparison

ProviderLatencyLanguagesIndian LanguagesBest For
| Sarvam | ~100ms | 10+ Indian | Yes (best) | Indian market | | Cartesia (recommended) | ~50-100ms | 10+ | No | Expressive conversations | | Deepgram | ~100ms | English | No | High-volume, cost-effective | | ElevenLabs | ~100-200ms | 29+ | No | Premium voice quality | | Rime | ~120ms | English+ | No | Consistent, reliable | | Inworld | ~130ms | English+ | No | Character-driven agents | | Google | ~150ms | 40+ | Hindi, Bengali, Tamil | Multilingual | | Azure | ~150ms | 60+ | Hindi, Tamil, Telugu | Enterprise, compliance |

Voice Selection and Customization

Selecting a Voice in the Dashboard

  1. Open your voice agent configuration.
  2. Go to the Voice tab.
  3. Select a TTS provider.
  4. Browse available voices and click Preview to hear samples.
  5. Adjust settings (speed, pitch, stability) with the sliders.
  6. Click Save.

Voice Selection via API

curl -X PATCH "https://api.thinnest.ai/agents/agent_xyz" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_config": {
      "tts_provider": "elevenlabs",
      "tts_voice_id": "21m00Tcm4TlvDq8ikWAM",
      "tts_model": "eleven_turbo_v2_5",
      "tts_stability": 0.5,
      "tts_similarity_boost": 0.75
    }
  }'

Matching Voice to Use Case

Use CaseRecommended ProviderVoice Style
General voice agentsCartesiaWarm, professional
Indian marketSarvamNative language voices
Customer supportElevenLabs or CartesiaWarm, patient, professional
Sales callsCartesia or ElevenLabsEnergetic, confident
High-volume campaignsDeepgramClear, cost-effective
Appointment remindersDeepgramNeutral, clear
Medical/legalAzureProfessional, calm
Multilingual supportGoogle or SarvamLanguage-specific voices
Character-driven agentsInworldPersona-consistent
Brand voiceElevenLabs (clone)Your custom voice

Speech-to-Text (STT) Configuration

Speech-to-text converts the caller’s audio into text for your AI agent to process. thinnestAI supports multiple STT providers with native support for Indian and global languages.

Deepgram (Default)

Industry-leading speech recognition with the best accuracy and lowest latency. Our default and recommended STT provider. Models:
ModelAccuracyLatencyBest For
flux-general-enHigh~200ms + native turn detectionVoice agents (English) — fastest end-to-end with built-in EOU
nova-3Highest~100msProduction voice agents (multilingual)
nova-2High~120msGeneral use
nova-2-phonecallHigh~120msPhone call audio (optimized)
nova-2-conversationalaiHigh~120msConversational agents
enhancedGood~150msCost-optimized
baseStandard~100msHigh-volume, budget
Flux models use Deepgram’s v2 streaming API with native turn detection — the model itself decides when the user has stopped speaking, eliminating the need for a separate VAD silence timer. Pair with Turn Detection Mode → STT Endpointing for the lowest end-to-end latency (~400ms total user-end-of-speech to agent reply).When Flux is selected, the interim_results, punctuate, and language UI settings are managed by Flux internally and are not forwarded to the STT plugin.
{
  "stt": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en-US",
    "punctuate": true,
    "smart_format": true,
    "profanity_filter": false
  }
}

Sarvam STT

Best-in-class for Indian languages. If your callers speak Hindi, Tamil, Telugu, or other Indian languages, Sarvam delivers the highest accuracy. Models:
ModelLanguagesBest For
saarika:v210+ Indian languagesProduction Indian voice agents
saarika:v2-turbo10+ Indian languagesLow-latency Indian deployments
{
  "stt": {
    "provider": "sarvam",
    "model": "saarika:v2",
    "language": "hi-IN"
  }
}
Supports code-switching (e.g., Hindi-English mixed speech) out of the box.

Google Speech-to-Text

Strong multilingual support with good accuracy across 125+ languages. Includes a built-in denoiser that filters background noise for cleaner transcriptions in noisy environments like call centers or outdoor calls. Models:
ModelAccuracyBest For
latest_longHighLong conversations
latest_shortHighShort utterances
telephonyHighPhone call audio
medical_conversationHighHealthcare
{
  "stt": {
    "provider": "google",
    "model": "telephony",
    "language": "en-US",
    "denoiser": true
  }
}
Set "denoiser": true to enable automatic background noise filtering. This is especially useful for phone call audio where ambient noise can degrade transcription accuracy.

Azure Speech-to-Text

Enterprise-grade with custom model training. Best for regulated industries. Models:
ModelAccuracyBest For
defaultHighGeneral use
customHighestDomain-specific vocabulary
conversationHighMulti-speaker calls
{
  "stt": {
    "provider": "azure",
    "model": "default",
    "language": "en-US"
  }
}

Cartesia STT

Real-time transcription with word-level timestamps.
{
  "stt": {
    "provider": "cartesia",
    "model": "ink",
    "language": "en"
  }
}

AssemblyAI

High-accuracy transcription with built-in speaker diarization and entity detection. Models:
ModelAccuracyBest For
bestHighestMaximum accuracy
nanoGoodCost-optimized, low latency
{
  "stt": {
    "provider": "assemblyai",
    "model": "best",
    "language": "en"
  }
}

STT Provider Comparison

ProviderLanguagesIndian LanguagesSelf-HostedBest For
| Deepgram (default) | 30+ | Hindi | No | General voice agents | | Sarvam | 10+ Indian | 10+ Indian | No | Indian languages | | Google | 125+ | Hindi, Tamil, Telugu+ | No | Multilingual | | Azure | 100+ | Hindi, Tamil, Telugu+ | No | Enterprise, custom models | | Cartesia | 10+ | No | No | Timestamps, real-time | | AssemblyAI | 20+ | No | No | Maximum accuracy |

Multi-Language Support

For agents that handle multiple languages:
{
  "stt": {
    "language": "multi",
    "detect_language": true,
    "supported_languages": ["en", "es", "fr", "hi", "ta"]
  }
}
The agent will detect the caller’s language and respond accordingly.

Turn Detection Mode

Turn detection decides when the user has stopped speaking so the agent can take its turn. Three modes available, each with different latency and accuracy trade-offs.
ModeidMechanismBest For
Ensembleturn_detectorCombines silence timing, the live transcript, and an AI end-of-turn model — tells a mid-thought pause apart from a finished questionRecommended default — most accurate across languages, including callers who pause mid-sentence
STT Endpointingstt_endpointingThe STT provider’s own end-of-utterance signalLowest latency with a provider that has a strong native signal (Deepgram Flux, Sarvam Saaras)
VAD Onlyvad_onlySilence-based detection only — fires after min_silence_duration of silenceFastest possible turn-end; short, simple utterances
Ensemble is the default for new agents. Rather than relying on a single signal, it reads what the caller said alongside how long they paused — so it waits through a natural mid-sentence pause (“I’d like to know about…”) and ends the turn promptly on a complete thought (“…your pricing”).

Your choice is respected

Turn Detection Mode is exactly what you set on the Detection tab. Choosing a speech-to-text provider no longer overrides it — the Ensemble model layers on top of any provider, so there’s no need to force a single-signal mode for Sarvam or Deepgram Flux. If you want the lowest possible latency and your provider has a strong native end-of-utterance signal (Deepgram Flux, Sarvam Saaras), pick STT Endpointing explicitly.

Configuration via API

{
  "voice_config": {
    "turn_detection_mode": "stt_endpointing",
    "min_silence_duration": 0.3,
    "min_speech_duration": 0.1,
    "min_endpointing_delay": 300,
    "max_endpointing_delay": 1.5
  }
}
FieldTypeDefaultNotes
turn_detection_modestring"turn_detector"One of turn_detector, stt_endpointing, vad_only
min_silence_durationfloat (seconds)0.3VAD silence threshold before declaring speech ended
min_speech_durationfloat (seconds)0.1Minimum sound duration to count as valid speech (filters short noises)
min_endpointing_delayint (ms)300Buffer added after silence detection before processing
max_endpointing_delayfloat (seconds)1.5Hard cutoff for turn end regardless of context

STT/TTS Fallback

For production voice agents, you can configure fallback providers that automatically take over if the primary provider fails or times out. This ensures your agent stays responsive even during provider outages.

How Fallback Works

  1. The agent sends audio to the primary provider.
  2. If the primary provider fails (timeout, error, or degraded quality), the system automatically switches to the fallback provider.
  3. The switch is seamless — callers experience a brief pause at most, not a dropped call.

Configuring TTS Fallback

{
  "tts": {
    "provider": "elevenlabs",
    "model": "eleven_turbo_v2_5",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "fallback": {
      "provider": "aero",
      "model": "aero-2",
      "voice": "maya"
    }
  }
}

Configuring STT Fallback

{
  "stt": {
    "provider": "deepgram",
    "model": "nova-3",
    "fallback": {
      "provider": "google",
      "model": "telephony"
    }
  }
}
PrimaryFallbackReason
ElevenLabs (TTS)Deepgram (TTS)Premium quality primary, fast fallback
Deepgram (STT)Google (STT)Best accuracy primary, wide language fallback
Cartesia (TTS)Deepgram (TTS)Ultra-low latency primary, reliable fallback
Sarvam (STT)Google (STT)Best Indian language primary, multilingual fallback
Tip: Choose fallback providers that are hosted on different infrastructure than your primary to maximize resilience. For example, pairing a third-party provider with Google Cloud provides good redundancy.

Noise Cancellation

Noise cancellation runs on the audio coming into the agent (the caller’s microphone) before it reaches STT — cleaner input means more accurate transcripts and fewer false interruptions on noisy phone lines. Pick from the Voice → Noise Cancellation dropdown in the agent studio:
OptionEngineLatencyBest for
DTLN (default)Dual-LSTM denoiser, ~3 MB model~8 msLightweight baseline. Strips traffic, hum, AC noise.
HushDeepFilterNet3 + Auxiliary Separation Head (Weya AI)~1 ms / frameStrongest on-device option. Also suppresses background voices, TV, and music.
None0Disable when the caller’s environment is already clean (studio, headset).
Both options are self-hosted (CPU), so they work the same on every deployment. None is honored at runtime — pre-fix, an explicit “off” silently fell back to DTLN.

Hush — install notes

Hush requires a separate model bundle (DeepFilterNet3 weights + the auxiliary speaker-separation ONNX) at $HUSH_MODEL_DIR (default /app/hush_model). Folder structure expected by the wrapper:
hush_model/
├── lib/
├── onnx/
└── model_best.ckpt
Download from huggingface.co/weya-ai/hush and place under that path before starting the agent worker. If the bundle is missing, the loader logs a warning and falls back to DTLN — your agent still gets noise cancellation, just not Hush-grade.

Configuration via API

{
  "voice": {
    "noise_cancellation": "hush"
  }
}
Accepts: dtln | hush | none.

Voice Style Preamble

Every voice agent automatically gets a short “you are speaking, not writing” preamble prepended to its system prompt at runtime. This steers even chat-tuned models (GPT-4o, Sarvam-30B, etc.) toward phone-friendly output:
  • 1–2 short sentences per reply (~30 words)
  • Use contractions (“I’m”, “you’re”)
  • Never use markdown, bullets, headings, code blocks, or symbols
  • Spell out currency, dates, units (“twenty dollars”, not “$20”)
  • Don’t read URLs or email addresses character by character
  • Ask one question at a time
Without this preamble, chat-trained LLMs frequently emit **bold**, bullet lists, or $10 as “dollar sign one zero” on calls. The preamble catches all of that.

Editing the preamble

Open the agent’s Behavior panel — the Voice Style Preamble card sits at the top of the right column. The textarea is pre-populated with the built-in default text; edit any line to customize for your domain (e.g., legal/medical disclaimer scripts), or click Reset to default to revert. Toggle the switch off if your system prompt already encodes voice style constraints — the preamble will be skipped at runtime.

Configuration via API

{
  "hooks": {
    "voice_style_preamble": {
      "enabled": true,
      "text": ""
    }
  }
}
  • enabled: false — skip the preamble entirely
  • text: "" (or omitted) — use the built-in default
  • text: "<custom text>" — use your text instead of the default

Interruption Handling

Interruption handling determines what happens when the caller speaks while the agent is talking.

Modes

Allow interruptions (default): The agent stops speaking when the caller starts talking. This feels natural — like a real conversation.
{
  "interruption": {
    "enabled": true,
    "sensitivity": "medium"
  }
}
Block interruptions: The agent finishes its response before listening. Use this for critical information that must be delivered completely (e.g., legal disclaimers).
{
  "interruption": {
    "enabled": false
  }
}

Sensitivity Levels

LevelDescriptionBest For
lowOnly interrupts on sustained speech (1+ seconds)Noisy environments
mediumBalanced detectionGeneral use
highInterrupts on brief soundsQuick, responsive conversations

Per-Message Interruption Control

You can control interruption behavior per message in your agent’s system prompt:
When reading the terms and conditions, do not allow interruptions.
For all other responses, allow normal interruptions.
The agent will dynamically adjust based on what it is saying.

Silent Timeout Settings

Silent timeouts control what happens when the caller stops speaking.
{
  "timeouts": {
    "silence_timeout_seconds": 10,
    "silence_prompt": "Are you still there? I'm happy to help if you have any questions.",
    "max_silence_prompts": 2,
    "disconnect_after_max_silence": true,
    "disconnect_message": "It seems like you may have stepped away. Feel free to call back anytime. Goodbye!"
  }
}
ParameterDefaultDescription
silence_timeout_seconds10Seconds of silence before prompting
silence_promptWhat the agent says after silence
max_silence_prompts2How many times to prompt before disconnecting
disconnect_after_max_silencetrueDisconnect after max prompts exceeded
disconnect_messageMessage before disconnecting
Use CaseTimeoutNotes
Quick transactions5-8sFast-paced conversations
Customer support10-15sCallers may be looking up info
Technical support15-20sCallers may be following instructions
Sales calls8-12sKeep momentum going

Wait Before Speaking

A small pause between user end-of-turn and the agent starting to speak can make responses feel more human. Counterintuitively, instant replies often sound robotic — a 300-500ms “thinking beat” reads as natural conversation. Default is 0 (instant reply, lowest latency); dial up if you want the conversational feel.
SettingDefaultRangeEffect
wait_before_speak_seconds00 - 2 secondsDelay before agent’s reply starts streaming

When to tune

ValueUse Case
0.0s (Default — Instant)Latency-critical bots (IVR, ad-tech, status checks); most voice agents — feel snappy
0.3-0.5sMore human-feeling conversations — adds a “thinking beat”
0.8-1.5s (Deliberate)Agents handling complex requests where a longer pause reinforces “thinking”

Configuration via API

{
  "voice_config": {
    "wait_before_speak_seconds": 0.4
  }
}
The wait is measured by livekit as on_user_turn_completed_delay in EOU metrics — visible in production logs for verification.
With preemptive generation enabled (default), the LLM call starts on interim transcripts before the user finishes. The wait_before_speak delay applies between EOU and TTS start — by then the LLM has often already finished, so the wait is the only added latency. Total perceived gap: ~EOU + max(LLM TTFT, wait_before_speak) + TTS TTFB.

Voice Formatting and Pronunciation

Pronunciation Overrides

If your agent mispronounces specific words (brand names, technical terms), add pronunciation overrides:
{
  "pronunciation": {
    "overrides": [
      {
        "word": "thinnestAI",
        "pronunciation": "thin-est A.I."
      },
      {
        "word": "PostgreSQL",
        "pronunciation": "post-gres-Q-L"
      },
      {
        "word": "API",
        "pronunciation": "A.P.I."
      }
    ]
  }
}

Number and Date Formatting

Control how the agent reads numbers and dates:
{
  "formatting": {
    "phone_numbers": "digit_by_digit",
    "currency": "natural",
    "dates": "natural",
    "times": "12_hour"
  }
}
SettingOptionsExample
phone_numbersdigit_by_digit, grouped”4-1-5-5-5-5-1-2-3-4” vs “415-555-1234”
currencynatural, formal”twenty-three fifty” vs “twenty-three dollars and fifty cents”
datesnatural, formal”March fifth” vs “March 5th, 2026”
times12_hour, 24_hour”2 PM” vs “14 hundred hours”

Filler Words and Pauses

Make your agent sound more natural by eliminating awkward silence. thinnestAI supports two types of fillers:

Instant Filler Words

Short, natural sounds that play immediately (~200 ms after the user stops speaking) — before the LLM has even produced a token. The real reply slides in behind the filler as it arrives. This masks the LLM + KB latency that would otherwise be perceived as a 1-3 second silence. Where to find it: Voice Configuration → Advanced tab → Filler Words card. Toggle defaults to on for new agents.
{
  "fillerWordsEnabled": true,
  "fillerWords": ["hmm", "aha", "um", "okay"],
  "fillerWordsMinChars": 10,
  "fillerMinTtftMs": 800
}
  • fillerWordsEnabled — Master toggle. Default true for new agents.
  • fillerWords — Custom filler words. Leave empty to use language-appropriate defaults:
    • Englishhmm, uh huh, um, right, okay
    • Hindihmm, acchha, haan, theek hai, ji
    • Otherhmm, aha, um, okay
  • fillerWordsMinChars — Skip the filler when the user’s transcript is shorter than this. Default 10. Stops “yes” / “no” replies from getting an unnecessary “hmm” in front.
  • fillerMinTtftMs — Collision guard. Only fire the filler if the previous turn’s LLM TTFT was at least this slow (in ms). Default 800. On a fast turn (TTFT < 800 ms) the real reply would arrive before the filler audio finishes, causing audio overlap even with cross-fade enabled. The first turn always probes (no prior TTFT data to gate on).
The filler runs through session.say() with allow_interruptions=true and add_to_chat_ctx=false — when the real LLM reply arrives, it preempts the filler instead of queueing behind it, and the filler text never reaches the model’s conversation history.
Filler Words apply in both Cascaded and Speech-to-Speech modes. In S2S, the filler is spoken by the cascaded TTS path (it’s a pre-LLM cue, not part of the realtime model’s stream).

Thinking Phrases

Longer phrases spoken when the agent needs time to process a tool call or complex request:
{
  "silenceFillersEnabled": true,
  "silenceFillerPhrases": [
    "Let me check that for you.",
    "One moment please.",
    "Good question, let me look into that."
  ]
}
These are LLM-driven — the agent decides when to use them based on context.

Turn Detection

How the agent decides the caller has stopped talking. Three modes are exposed under Detection in the voice panel.
ModeWhat it doesWhen to use
Ensemble (default)Fuses VAD + STT + a semantic end-of-utterance model that understands meaning across 23+ languages. Knows the difference between a mid-thought pause and a finished sentence.Default. Almost always the right choice — especially for multilingual or Indic agents.
STT EndpointingUses the STT provider’s own end-of-phrase signal. Silence-based, no semantics. Fast.English-only chitchat where every millisecond counts and the STT (Deepgram, Sarvam) emits a reliable endpointing signal.
VAD OnlyRaw silence-based detection. Fastest and crudest.Simple, scripted flows where the caller’s turns are short and predictable.
The semantic model runs locally inside the bot image — no per-call network round-trip, no third-party API. It falls back automatically to LiveKit’s bundled multilingual model, and then to single-method VAD/STT, if the upgraded model can’t load.

Adaptive EOU Confidence

When you pick Ensemble, an extra slider appears: Adaptive EOU Confidence. This is the threshold the semantic model uses to decide whether the caller has actually finished.
  • Higher (0.85–0.95) — waits for more certainty before ending the turn. Fewer mid-thought cut-offs; the agent is more patient. Good for hesitant callers, technical Q&A, and IVRs that read out long IDs.
  • Lower (0.40–0.60) — snappier. The agent responds faster but is more likely to interrupt a thought-pause. Good for terse, transactional chitchat.
  • 0.70 (default) is balanced and works for most agents.

Smart Interruption

Under Advanced → Interruption Handling, the Smart interruption toggle uses the same semantic model to recognise filler words and incomplete utterances on the fly. When the caller says “uh huh”, “yeah okay”, or trails off mid-word, the agent ignores it and keeps talking — instead of cutting itself off. Strictly suppresses false interruptions; it never causes the agent to miss a real one. On by default. No-ops gracefully if the semantic model isn’t loaded (the bot falls back to the static never-interrupt phrase list).

Proactive Re-engagement

Under Detection → Proactive Re-engagement, an opt-in toggle for one of the most asked-for behaviours in long-form support and onboarding calls. When the caller pauses mid-sentence after a semantically incomplete utterance — “My ticket number is…” (silence) — the agent fires a short, language-aware nudge (“I’m listening, go ahead”) in the caller’s own language, instead of either prematurely answering or sitting silent until the 30-second abandonment poke. Configure:
  • Pause delay (1.5–8.0 s) — how long the caller can pause mid-thought before the agent nudges. Default 3.0 s.
  • Max nudges per pause (1–3) — cap so the agent doesn’t keep prodding. Default 1.
  • Explicit phrases (optional) — leave empty for an LLM-generated nudge in the caller’s language. Adding phrases here locks the wording and is not auto-translated.
Off by default — terse or IVR-style agents usually don’t want it.

Testing Your Configuration

After making changes, test thoroughly:
  1. Use the web call test — Click Test Call in the dashboard to hear your changes immediately.
  2. Test edge cases — Try interrupting, staying silent, speaking quickly, and using unusual words.
  3. Test on a real phone — Web call audio quality differs from phone audio. Always test over a real phone line before going live.
  4. Compare providers — Try the same conversation with different TTS providers to find the best fit.
  5. Get feedback — Have someone unfamiliar with the system test the call and provide honest feedback.

Next Steps

  • Inbound Calls — Apply your voice configuration to inbound call handling
  • Outbound Calls — Use your configured voice for outbound campaigns
  • Call Recording — Record calls to review voice quality over time