Voice Configuration
How your agent sounds defines the caller experience. thinnestAI gives you full control over text-to-speech, speech-to-text, interruption behavior, and conversational timing. This guide covers every setting you can tune.
Choosing a TTS Provider
thinnestAI supports 8 text-to-speech providers — from ultra-low-latency engines for real-time voice to specialized providers for Indian languages. Pick the one that fits your use case.
Sarvam
India-first voice AI. Best-in-class support for Indian languages and accents. If your agents serve Indian customers, Sarvam is the clear choice.
Strengths:
- 10+ Indian languages (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, Malayalam, Punjabi, Odia)
- Native Indian English accents
- Code-switching support (Hindi-English, Tamil-English, etc.)
- Optimized for Indian telecom networks
Models:
| Model | Languages | Quality | Best For |
|---|
bulbul:v3 | 10+ Indian languages | Highest (recommended) | Production Indian voice agents |
bulbul:v3-beta | 10+ Indian languages | High (experimental) | Trying upcoming voices |
bulbul:v2 | 10+ Indian languages | Stable | Legacy production deployments |
bulbul:v3 is the current default. It accepts three additional knobs the
older v2 doesn’t:
temperature (0.01–1.0, default 0.5) — assistant-style prosody.
We default to 0.5; 0.6 is more expressive but adds prosody jitter on
long turns.
min_buffer_size (30–200 chars, default 50) — how much text
to buffer before the first audio chunk emits. Lower = faster TTFA, more
fragmented prosody.
max_chunk_length (50–500 chars, default 150) — splits long
sentences for streaming.
These are surfaced automatically when you pick a bulbul:v3 model — no
extra config needed.
Configuration:
{
"tts": {
"provider": "sarvam",
"model": "bulbul:v3",
"voice": "shubh",
"language": "hi-IN",
"speed": 1.0
}
}
Popular Sarvam voices:
| Voice | Sarvam version | Language | Description |
|---|
shubh | v3 (default) | Hindi / Indian English | Warm, conversational male |
ritu | v3 | Hindi / Indian English | Clear professional female |
pooja, kavya, ishita, priya | v3 | Hindi / Indian English | Female, varied tone |
rahul, amit, dev, kabir, aditya | v3 | Hindi / Indian English | Male, varied tone |
anushka | v2 (default) | Hindi | Female, baseline production voice |
manisha, vidya, arya | v2 | Hindi | Female alternates |
abhilash, karun, hitesh | v2 | Hindi | Male alternates |
The full v3 roster is ~25 speakers covering Hindi, Tamil, Telugu, Marathi,
Bengali, English, and code-switched accents. Pick voices in the agent
studio’s TTS picker — switching between v2 and v3 automatically filters to
compatible speakers.
Supported Indian languages:
| Language | Code | Code-Switching |
|---|
| Hindi | hi-IN | Hindi-English |
| Tamil | ta-IN | Tamil-English |
| Telugu | te-IN | Telugu-English |
| Kannada | kn-IN | Kannada-English |
| Bengali | bn-IN | Bengali-English |
| Marathi | mr-IN | Marathi-English |
| Gujarati | gu-IN | — |
| Malayalam | ml-IN | — |
| Punjabi | pa-IN | — |
| Odia | or-IN | — |
Cartesia
Blazing-fast, high-quality voices with fine-grained emotion control. Great for expressive, dynamic conversations.
Strengths:
- Sub-100ms latency
- Emotion and speed control mid-sentence
- Streaming word-level timestamps
- Multilingual support
Models:
| Model | Latency | Quality | Best For |
|---|
sonic-2 | ~80ms | Highest | Expressive voice agents |
sonic-2-turbo | ~50ms | High | Ultra-low latency |
sonic-multilingual | ~100ms | High | Non-English deployments |
Configuration:
{
"tts": {
"provider": "cartesia",
"model": "sonic-2",
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"output_format": "pcm_16000",
"speed": "normal",
"emotion": ["positivity:high", "curiosity:medium"]
}
}
Deepgram TTS
Fast, affordable text-to-speech from the leaders in speech AI. Good balance of quality and cost.
Strengths:
- Very low latency
- Simple API
- Competitive pricing
- Good for high-volume use cases
Models:
| Model | Latency | Quality | Best For |
|---|
aura-asteria-en | ~100ms | High | Female English voice |
aura-orion-en | ~100ms | High | Male English voice |
aura-luna-en | ~100ms | High | Soft female English |
aura-arcas-en | ~100ms | High | Authoritative male |
Configuration:
{
"tts": {
"provider": "deepgram",
"model": "aura-asteria-en"
}
}
ElevenLabs
The most natural-sounding voices with ultra-realistic intonation. Best for customer-facing agents where voice quality is the top priority.
Strengths:
- Ultra-realistic voices with natural emotion
- Voice cloning (use your own voice)
- 29+ languages
- Extensive voice library
Models:
| Model | Latency | Quality | Best For |
|---|
eleven_turbo_v2_5 | ~150ms | Highest | Production voice agents |
eleven_multilingual_v2 | ~200ms | Highest | Multilingual deployments |
eleven_flash_v2_5 | ~100ms | High | Low-latency needs |
Configuration:
{
"tts": {
"provider": "elevenlabs",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"model": "eleven_turbo_v2_5",
"stability": 0.5,
"similarity_boost": 0.75
}
}
Popular ElevenLabs voices:
| Voice | ID | Description |
|---|
| Rachel | 21m00Tcm4TlvDq8ikWAM | Calm, professional female |
| Adam | pNInz6obpgDQGcFmaJgB | Deep, authoritative male |
| Bella | EXAVITQu4vr4xnSDxMaL | Warm, friendly female |
| Antoni | ErXwobaYiN019PkySvjV | Smooth, conversational male |
Rime
High-quality, low-latency voices with a focus on accuracy and consistency.
Strengths:
- Consistent voice quality across calls
- Good pronunciation accuracy
- Fast inference
- Simple integration
Models:
| Model | Latency | Quality | Best For |
|---|
mist | ~120ms | High | General voice agents |
v1 | ~150ms | Good | Cost-optimized use |
Configuration:
{
"tts": {
"provider": "rime",
"model": "mist",
"voice": "cove",
"speed": 1.0
}
}
Inworld
Specialized for character-driven, immersive voice experiences. Great for agents with strong personas.
Strengths:
- Character-consistent voices
- Emotional expressiveness
- Real-time voice modulation
- Persona-driven design
Models:
| Model | Latency | Quality | Best For |
|---|
inworld-v2 | ~130ms | High | Character-driven agents |
inworld-v1 | ~160ms | Good | Standard agents |
Configuration:
{
"tts": {
"provider": "inworld",
"model": "inworld-v2",
"voice": "default",
"emotion": "friendly"
}
}
Google Cloud TTS
Wide language support with consistent quality. Best for multilingual deployments.
Strengths:
- 220+ voices across 40+ languages
- Neural2 and Studio voices for premium quality
- SSML support
- Predictable pricing
Models:
| Model | Quality | Best For |
|---|
Neural2 | Highest | Production use |
Studio | Highest | Premium applications |
Standard | Good | Cost-optimized |
Configuration:
{
"tts": {
"provider": "google",
"voice_name": "en-US-Neural2-F",
"language_code": "en-US",
"speaking_rate": 1.0,
"pitch": 0.0
}
}
Recommended Google voices:
| Voice | Language | Description |
|---|
en-US-Neural2-F | English (US) | Natural female |
en-US-Neural2-D | English (US) | Natural male |
en-GB-Neural2-A | English (UK) | British female |
es-ES-Neural2-A | Spanish | Female |
hi-IN-Neural2-A | Hindi | Female |
Azure Speech
Enterprise-grade with fine-grained control. Best for regulated industries.
Strengths:
- High-quality neural voices
- Custom neural voice training
- SSML with extensive control
- Strong enterprise compliance
Models:
| Model | Quality | Best For |
|---|
Neural | Highest | Production use |
Custom Neural | Highest | Brand-specific voice |
Configuration:
{
"tts": {
"provider": "azure",
"voice_name": "en-US-JennyNeural",
"style": "friendly",
"speaking_rate": "1.0"
}
}
Provider Comparison
| Provider | Latency | Languages | Indian Languages | Best For |
|---|
| Sarvam | ~100ms | 10+ Indian | Yes (best) | Indian market |
| Cartesia (recommended) | ~50-100ms | 10+ | No | Expressive conversations |
| Deepgram | ~100ms | English | No | High-volume, cost-effective |
| ElevenLabs | ~100-200ms | 29+ | No | Premium voice quality |
| Rime | ~120ms | English+ | No | Consistent, reliable |
| Inworld | ~130ms | English+ | No | Character-driven agents |
| Google | ~150ms | 40+ | Hindi, Bengali, Tamil | Multilingual |
| Azure | ~150ms | 60+ | Hindi, Tamil, Telugu | Enterprise, compliance |
Voice Selection and Customization
Selecting a Voice in the Dashboard
- Open your voice agent configuration.
- Go to the Voice tab.
- Select a TTS provider.
- Browse available voices and click Preview to hear samples.
- Adjust settings (speed, pitch, stability) with the sliders.
- Click Save.
Voice Selection via API
curl -X PATCH "https://api.thinnest.ai/agents/agent_xyz" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"voice_config": {
"tts_provider": "elevenlabs",
"tts_voice_id": "21m00Tcm4TlvDq8ikWAM",
"tts_model": "eleven_turbo_v2_5",
"tts_stability": 0.5,
"tts_similarity_boost": 0.75
}
}'
Matching Voice to Use Case
| Use Case | Recommended Provider | Voice Style |
|---|
| General voice agents | Cartesia | Warm, professional |
| Indian market | Sarvam | Native language voices |
| Customer support | ElevenLabs or Cartesia | Warm, patient, professional |
| Sales calls | Cartesia or ElevenLabs | Energetic, confident |
| High-volume campaigns | Deepgram | Clear, cost-effective |
| Appointment reminders | Deepgram | Neutral, clear |
| Medical/legal | Azure | Professional, calm |
| Multilingual support | Google or Sarvam | Language-specific voices |
| Character-driven agents | Inworld | Persona-consistent |
| Brand voice | ElevenLabs (clone) | Your custom voice |
Speech-to-Text (STT) Configuration
Speech-to-text converts the caller’s audio into text for your AI agent to process. thinnestAI supports multiple STT providers with native support for Indian and global languages.
Deepgram (Default)
Industry-leading speech recognition with the best accuracy and lowest latency. Our default and recommended STT provider.
Models:
| Model | Accuracy | Latency | Best For |
|---|
flux-general-en | High | ~200ms + native turn detection | Voice agents (English) — fastest end-to-end with built-in EOU |
nova-3 | Highest | ~100ms | Production voice agents (multilingual) |
nova-2 | High | ~120ms | General use |
nova-2-phonecall | High | ~120ms | Phone call audio (optimized) |
nova-2-conversationalai | High | ~120ms | Conversational agents |
enhanced | Good | ~150ms | Cost-optimized |
base | Standard | ~100ms | High-volume, budget |
Flux models use Deepgram’s v2 streaming API with native turn detection — the model itself decides when the user has stopped speaking, eliminating the need for a separate VAD silence timer. Pair with Turn Detection Mode → STT Endpointing for the lowest end-to-end latency (~400ms total user-end-of-speech to agent reply).When Flux is selected, the interim_results, punctuate, and language UI settings are managed by Flux internally and are not forwarded to the STT plugin.
{
"stt": {
"provider": "deepgram",
"model": "nova-3",
"language": "en-US",
"punctuate": true,
"smart_format": true,
"profanity_filter": false
}
}
Sarvam STT
Best-in-class for Indian languages. If your callers speak Hindi, Tamil, Telugu, or other Indian languages, Sarvam delivers the highest accuracy.
Models:
| Model | Languages | Best For |
|---|
saarika:v2 | 10+ Indian languages | Production Indian voice agents |
saarika:v2-turbo | 10+ Indian languages | Low-latency Indian deployments |
{
"stt": {
"provider": "sarvam",
"model": "saarika:v2",
"language": "hi-IN"
}
}
Supports code-switching (e.g., Hindi-English mixed speech) out of the box.
Google Speech-to-Text
Strong multilingual support with good accuracy across 125+ languages. Includes a built-in denoiser that filters background noise for cleaner transcriptions in noisy environments like call centers or outdoor calls.
Models:
| Model | Accuracy | Best For |
|---|
latest_long | High | Long conversations |
latest_short | High | Short utterances |
telephony | High | Phone call audio |
medical_conversation | High | Healthcare |
{
"stt": {
"provider": "google",
"model": "telephony",
"language": "en-US",
"denoiser": true
}
}
Set "denoiser": true to enable automatic background noise filtering. This is especially useful for phone call audio where ambient noise can degrade transcription accuracy.
Azure Speech-to-Text
Enterprise-grade with custom model training. Best for regulated industries.
Models:
| Model | Accuracy | Best For |
|---|
default | High | General use |
custom | Highest | Domain-specific vocabulary |
conversation | High | Multi-speaker calls |
{
"stt": {
"provider": "azure",
"model": "default",
"language": "en-US"
}
}
Cartesia STT
Real-time transcription with word-level timestamps.
{
"stt": {
"provider": "cartesia",
"model": "ink",
"language": "en"
}
}
AssemblyAI
High-accuracy transcription with built-in speaker diarization and entity detection.
Models:
| Model | Accuracy | Best For |
|---|
best | Highest | Maximum accuracy |
nano | Good | Cost-optimized, low latency |
{
"stt": {
"provider": "assemblyai",
"model": "best",
"language": "en"
}
}
STT Provider Comparison
| Provider | Languages | Indian Languages | Self-Hosted | Best For |
|---|
| Deepgram (default) | 30+ | Hindi | No | General voice agents |
| Sarvam | 10+ Indian | 10+ Indian | No | Indian languages |
| Google | 125+ | Hindi, Tamil, Telugu+ | No | Multilingual |
| Azure | 100+ | Hindi, Tamil, Telugu+ | No | Enterprise, custom models |
| Cartesia | 10+ | No | No | Timestamps, real-time |
| AssemblyAI | 20+ | No | No | Maximum accuracy |
Multi-Language Support
For agents that handle multiple languages:
{
"stt": {
"language": "multi",
"detect_language": true,
"supported_languages": ["en", "es", "fr", "hi", "ta"]
}
}
The agent will detect the caller’s language and respond accordingly.
Turn Detection Mode
Turn detection decides when the user has stopped speaking so the agent can take its turn. Three modes available, each with different latency and accuracy trade-offs.
| Mode | id | Mechanism | Best For |
|---|
| Ensemble | turn_detector | Combines silence timing, the live transcript, and an AI end-of-turn model — tells a mid-thought pause apart from a finished question | Recommended default — most accurate across languages, including callers who pause mid-sentence |
| STT Endpointing | stt_endpointing | The STT provider’s own end-of-utterance signal | Lowest latency with a provider that has a strong native signal (Deepgram Flux, Sarvam Saaras) |
| VAD Only | vad_only | Silence-based detection only — fires after min_silence_duration of silence | Fastest possible turn-end; short, simple utterances |
Ensemble is the default for new agents. Rather than relying on a single signal, it reads what the caller said alongside how long they paused — so it waits through a natural mid-sentence pause (“I’d like to know about…”) and ends the turn promptly on a complete thought (“…your pricing”).
Your choice is respected
Turn Detection Mode is exactly what you set on the Detection tab. Choosing a speech-to-text provider no longer overrides it — the Ensemble model layers on top of any provider, so there’s no need to force a single-signal mode for Sarvam or Deepgram Flux.
If you want the lowest possible latency and your provider has a strong native end-of-utterance signal (Deepgram Flux, Sarvam Saaras), pick STT Endpointing explicitly.
Configuration via API
{
"voice_config": {
"turn_detection_mode": "stt_endpointing",
"min_silence_duration": 0.3,
"min_speech_duration": 0.1,
"min_endpointing_delay": 300,
"max_endpointing_delay": 1.5
}
}
| Field | Type | Default | Notes |
|---|
turn_detection_mode | string | "turn_detector" | One of turn_detector, stt_endpointing, vad_only |
min_silence_duration | float (seconds) | 0.3 | VAD silence threshold before declaring speech ended |
min_speech_duration | float (seconds) | 0.1 | Minimum sound duration to count as valid speech (filters short noises) |
min_endpointing_delay | int (ms) | 300 | Buffer added after silence detection before processing |
max_endpointing_delay | float (seconds) | 1.5 | Hard cutoff for turn end regardless of context |
STT/TTS Fallback
For production voice agents, you can configure fallback providers that automatically take over if the primary provider fails or times out. This ensures your agent stays responsive even during provider outages.
How Fallback Works
- The agent sends audio to the primary provider.
- If the primary provider fails (timeout, error, or degraded quality), the system automatically switches to the fallback provider.
- The switch is seamless — callers experience a brief pause at most, not a dropped call.
Configuring TTS Fallback
{
"tts": {
"provider": "elevenlabs",
"model": "eleven_turbo_v2_5",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"fallback": {
"provider": "aero",
"model": "aero-2",
"voice": "maya"
}
}
}
Configuring STT Fallback
{
"stt": {
"provider": "deepgram",
"model": "nova-3",
"fallback": {
"provider": "google",
"model": "telephony"
}
}
}
Recommended Fallback Pairs
| Primary | Fallback | Reason |
|---|
| ElevenLabs (TTS) | Deepgram (TTS) | Premium quality primary, fast fallback |
| Deepgram (STT) | Google (STT) | Best accuracy primary, wide language fallback |
| Cartesia (TTS) | Deepgram (TTS) | Ultra-low latency primary, reliable fallback |
| Sarvam (STT) | Google (STT) | Best Indian language primary, multilingual fallback |
Tip: Choose fallback providers that are hosted on different infrastructure than your primary to maximize resilience. For example, pairing a third-party provider with Google Cloud provides good redundancy.
Noise Cancellation
Noise cancellation runs on the audio coming into the agent (the
caller’s microphone) before it reaches STT — cleaner input means more
accurate transcripts and fewer false interruptions on noisy phone lines.
Pick from the Voice → Noise Cancellation dropdown in the agent studio:
| Option | Engine | Latency | Best for |
|---|
| DTLN (default) | Dual-LSTM denoiser, ~3 MB model | ~8 ms | Lightweight baseline. Strips traffic, hum, AC noise. |
| Hush | DeepFilterNet3 + Auxiliary Separation Head (Weya AI) | ~1 ms / frame | Strongest on-device option. Also suppresses background voices, TV, and music. |
| None | — | 0 | Disable when the caller’s environment is already clean (studio, headset). |
Both options are self-hosted (CPU), so they work the same on every
deployment. None is honored at runtime — pre-fix, an explicit “off”
silently fell back to DTLN.
Hush — install notes
Hush requires a separate model bundle (DeepFilterNet3 weights + the
auxiliary speaker-separation ONNX) at $HUSH_MODEL_DIR (default
/app/hush_model). Folder structure expected by the wrapper:
hush_model/
├── lib/
├── onnx/
└── model_best.ckpt
Download from huggingface.co/weya-ai/hush and place under that path
before starting the agent worker. If the bundle is missing, the loader
logs a warning and falls back to DTLN — your agent still gets noise
cancellation, just not Hush-grade.
Configuration via API
{
"voice": {
"noise_cancellation": "hush"
}
}
Accepts: dtln | hush | none.
Voice Style Preamble
Every voice agent automatically gets a short “you are speaking, not
writing” preamble prepended to its system prompt at runtime. This steers
even chat-tuned models (GPT-4o, Sarvam-30B, etc.) toward phone-friendly
output:
- 1–2 short sentences per reply (~30 words)
- Use contractions (“I’m”, “you’re”)
- Never use markdown, bullets, headings, code blocks, or symbols
- Spell out currency, dates, units (“twenty dollars”, not “$20”)
- Don’t read URLs or email addresses character by character
- Ask one question at a time
Without this preamble, chat-trained LLMs frequently emit **bold**,
bullet lists, or $10 as “dollar sign one zero” on calls. The preamble
catches all of that.
Editing the preamble
Open the agent’s Behavior panel — the Voice Style Preamble card
sits at the top of the right column. The textarea is pre-populated with
the built-in default text; edit any line to customize for your domain
(e.g., legal/medical disclaimer scripts), or click Reset to default
to revert.
Toggle the switch off if your system prompt already encodes voice style
constraints — the preamble will be skipped at runtime.
Configuration via API
{
"hooks": {
"voice_style_preamble": {
"enabled": true,
"text": ""
}
}
}
enabled: false — skip the preamble entirely
text: "" (or omitted) — use the built-in default
text: "<custom text>" — use your text instead of the default
Interruption Handling
Interruption handling determines what happens when the caller speaks while the agent is talking.
Modes
Allow interruptions (default):
The agent stops speaking when the caller starts talking. This feels natural — like a real conversation.
{
"interruption": {
"enabled": true,
"sensitivity": "medium"
}
}
Block interruptions:
The agent finishes its response before listening. Use this for critical information that must be delivered completely (e.g., legal disclaimers).
{
"interruption": {
"enabled": false
}
}
Sensitivity Levels
| Level | Description | Best For |
|---|
low | Only interrupts on sustained speech (1+ seconds) | Noisy environments |
medium | Balanced detection | General use |
high | Interrupts on brief sounds | Quick, responsive conversations |
Per-Message Interruption Control
You can control interruption behavior per message in your agent’s system prompt:
When reading the terms and conditions, do not allow interruptions.
For all other responses, allow normal interruptions.
The agent will dynamically adjust based on what it is saying.
Silent Timeout Settings
Silent timeouts control what happens when the caller stops speaking.
{
"timeouts": {
"silence_timeout_seconds": 10,
"silence_prompt": "Are you still there? I'm happy to help if you have any questions.",
"max_silence_prompts": 2,
"disconnect_after_max_silence": true,
"disconnect_message": "It seems like you may have stepped away. Feel free to call back anytime. Goodbye!"
}
}
| Parameter | Default | Description |
|---|
silence_timeout_seconds | 10 | Seconds of silence before prompting |
silence_prompt | — | What the agent says after silence |
max_silence_prompts | 2 | How many times to prompt before disconnecting |
disconnect_after_max_silence | true | Disconnect after max prompts exceeded |
disconnect_message | — | Message before disconnecting |
Recommended Timeout Settings
| Use Case | Timeout | Notes |
|---|
| Quick transactions | 5-8s | Fast-paced conversations |
| Customer support | 10-15s | Callers may be looking up info |
| Technical support | 15-20s | Callers may be following instructions |
| Sales calls | 8-12s | Keep momentum going |
Wait Before Speaking
A small pause between user end-of-turn and the agent starting to speak can make responses feel more human. Counterintuitively, instant replies often sound robotic — a 300-500ms “thinking beat” reads as natural conversation. Default is 0 (instant reply, lowest latency); dial up if you want the conversational feel.
| Setting | Default | Range | Effect |
|---|
wait_before_speak_seconds | 0 | 0 - 2 seconds | Delay before agent’s reply starts streaming |
When to tune
| Value | Use Case |
|---|
0.0s (Default — Instant) | Latency-critical bots (IVR, ad-tech, status checks); most voice agents — feel snappy |
0.3-0.5s | More human-feeling conversations — adds a “thinking beat” |
0.8-1.5s (Deliberate) | Agents handling complex requests where a longer pause reinforces “thinking” |
Configuration via API
{
"voice_config": {
"wait_before_speak_seconds": 0.4
}
}
The wait is measured by livekit as on_user_turn_completed_delay in EOU metrics — visible in production logs for verification.
With preemptive generation enabled (default), the LLM call starts on interim transcripts before the user finishes. The wait_before_speak delay applies between EOU and TTS start — by then the LLM has often already finished, so the wait is the only added latency. Total perceived gap: ~EOU + max(LLM TTFT, wait_before_speak) + TTS TTFB.
Pronunciation Overrides
If your agent mispronounces specific words (brand names, technical terms), add pronunciation overrides:
{
"pronunciation": {
"overrides": [
{
"word": "thinnestAI",
"pronunciation": "thin-est A.I."
},
{
"word": "PostgreSQL",
"pronunciation": "post-gres-Q-L"
},
{
"word": "API",
"pronunciation": "A.P.I."
}
]
}
}
Control how the agent reads numbers and dates:
{
"formatting": {
"phone_numbers": "digit_by_digit",
"currency": "natural",
"dates": "natural",
"times": "12_hour"
}
}
| Setting | Options | Example |
|---|
phone_numbers | digit_by_digit, grouped | ”4-1-5-5-5-5-1-2-3-4” vs “415-555-1234” |
currency | natural, formal | ”twenty-three fifty” vs “twenty-three dollars and fifty cents” |
dates | natural, formal | ”March fifth” vs “March 5th, 2026” |
times | 12_hour, 24_hour | ”2 PM” vs “14 hundred hours” |
Filler Words and Pauses
Make your agent sound more natural by eliminating awkward silence. thinnestAI supports two types of fillers:
Instant Filler Words
Short, natural sounds that play immediately (~200 ms after the user stops speaking) — before the LLM has even produced a token. The real reply slides in behind the filler as it arrives. This masks the LLM + KB latency that would otherwise be perceived as a 1-3 second silence.
Where to find it: Voice Configuration → Advanced tab → Filler Words card. Toggle defaults to on for new agents.
{
"fillerWordsEnabled": true,
"fillerWords": ["hmm", "aha", "um", "okay"],
"fillerWordsMinChars": 10,
"fillerMinTtftMs": 800
}
fillerWordsEnabled — Master toggle. Default true for new agents.
fillerWords — Custom filler words. Leave empty to use language-appropriate defaults:
- English —
hmm, uh huh, um, right, okay
- Hindi —
hmm, acchha, haan, theek hai, ji
- Other —
hmm, aha, um, okay
fillerWordsMinChars — Skip the filler when the user’s transcript is shorter than this. Default 10. Stops “yes” / “no” replies from getting an unnecessary “hmm” in front.
fillerMinTtftMs — Collision guard. Only fire the filler if the previous turn’s LLM TTFT was at least this slow (in ms). Default 800. On a fast turn (TTFT < 800 ms) the real reply would arrive before the filler audio finishes, causing audio overlap even with cross-fade enabled. The first turn always probes (no prior TTFT data to gate on).
The filler runs through session.say() with allow_interruptions=true and add_to_chat_ctx=false — when the real LLM reply arrives, it preempts the filler instead of queueing behind it, and the filler text never reaches the model’s conversation history.
Filler Words apply in both Cascaded and Speech-to-Speech modes. In S2S, the filler is spoken by the cascaded TTS path (it’s a pre-LLM cue, not part of the realtime model’s stream).
Thinking Phrases
Longer phrases spoken when the agent needs time to process a tool call or complex request:
{
"silenceFillersEnabled": true,
"silenceFillerPhrases": [
"Let me check that for you.",
"One moment please.",
"Good question, let me look into that."
]
}
These are LLM-driven — the agent decides when to use them based on context.
Turn Detection
How the agent decides the caller has stopped talking. Three modes are exposed under Detection in the voice panel.
| Mode | What it does | When to use |
|---|
| Ensemble (default) | Fuses VAD + STT + a semantic end-of-utterance model that understands meaning across 23+ languages. Knows the difference between a mid-thought pause and a finished sentence. | Default. Almost always the right choice — especially for multilingual or Indic agents. |
| STT Endpointing | Uses the STT provider’s own end-of-phrase signal. Silence-based, no semantics. Fast. | English-only chitchat where every millisecond counts and the STT (Deepgram, Sarvam) emits a reliable endpointing signal. |
| VAD Only | Raw silence-based detection. Fastest and crudest. | Simple, scripted flows where the caller’s turns are short and predictable. |
The semantic model runs locally inside the bot image — no per-call network round-trip, no third-party API. It falls back automatically to LiveKit’s bundled multilingual model, and then to single-method VAD/STT, if the upgraded model can’t load.
Adaptive EOU Confidence
When you pick Ensemble, an extra slider appears: Adaptive EOU Confidence. This is the threshold the semantic model uses to decide whether the caller has actually finished.
- Higher (0.85–0.95) — waits for more certainty before ending the turn. Fewer mid-thought cut-offs; the agent is more patient. Good for hesitant callers, technical Q&A, and IVRs that read out long IDs.
- Lower (0.40–0.60) — snappier. The agent responds faster but is more likely to interrupt a thought-pause. Good for terse, transactional chitchat.
- 0.70 (default) is balanced and works for most agents.
Smart Interruption
Under Advanced → Interruption Handling, the Smart interruption toggle uses the same semantic model to recognise filler words and incomplete utterances on the fly. When the caller says “uh huh”, “yeah okay”, or trails off mid-word, the agent ignores it and keeps talking — instead of cutting itself off. Strictly suppresses false interruptions; it never causes the agent to miss a real one.
On by default. No-ops gracefully if the semantic model isn’t loaded (the bot falls back to the static never-interrupt phrase list).
Proactive Re-engagement
Under Detection → Proactive Re-engagement, an opt-in toggle for one of the most asked-for behaviours in long-form support and onboarding calls.
When the caller pauses mid-sentence after a semantically incomplete utterance — “My ticket number is…” (silence) — the agent fires a short, language-aware nudge (“I’m listening, go ahead”) in the caller’s own language, instead of either prematurely answering or sitting silent until the 30-second abandonment poke.
Configure:
- Pause delay (1.5–8.0 s) — how long the caller can pause mid-thought before the agent nudges. Default 3.0 s.
- Max nudges per pause (1–3) — cap so the agent doesn’t keep prodding. Default 1.
- Explicit phrases (optional) — leave empty for an LLM-generated nudge in the caller’s language. Adding phrases here locks the wording and is not auto-translated.
Off by default — terse or IVR-style agents usually don’t want it.
Testing Your Configuration
After making changes, test thoroughly:
- Use the web call test — Click Test Call in the dashboard to hear your changes immediately.
- Test edge cases — Try interrupting, staying silent, speaking quickly, and using unusual words.
- Test on a real phone — Web call audio quality differs from phone audio. Always test over a real phone line before going live.
- Compare providers — Try the same conversation with different TTS providers to find the best fit.
- Get feedback — Have someone unfamiliar with the system test the call and provide honest feedback.
Next Steps