Voice Configuration

How your agent sounds defines the caller experience. thinnestAI gives you full control over text-to-speech, speech-to-text, interruption behavior, and conversational timing. This guide covers every setting you can tune.

The Tasks tab (Tasks & Task Groups) is deprecated and hidden — it’s superseded by Voice Workflows, which collect structured data with Collector/Conversation nodes and add branching, knowledge, and tools.

Choosing a TTS Provider

thinnestAI supports 8 text-to-speech providers — from ultra-low-latency engines for real-time voice to specialized providers for Indian languages. Pick the one that fits your use case.

Sarvam

India-first voice AI. Best-in-class support for Indian languages and accents. If your agents serve Indian customers, Sarvam is the clear choice. Strengths:

10+ Indian languages (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, Gujarati, Malayalam, Punjabi, Odia)
Native Indian English accents
Code-switching support (Hindi-English, Tamil-English, etc.)
Optimized for Indian telecom networks

Models:

Model	Languages	Quality	Best For
`bulbul:v3`	10+ Indian languages	Highest (recommended)	Production Indian voice agents
`bulbul:v3-beta`	10+ Indian languages	High (experimental)	Trying upcoming voices
`bulbul:v2`	10+ Indian languages	Stable	Legacy production deployments

bulbul:v3 is the current default. It accepts three additional knobs the older v2 doesn’t:

temperature (0.01–1.0, default 0.5) — assistant-style prosody. We default to 0.5; 0.6 is more expressive but adds prosody jitter on long turns.
min_buffer_size (30–200 chars, default 50) — how much text to buffer before the first audio chunk emits. Lower = faster TTFA, more fragmented prosody.
max_chunk_length (50–500 chars, default 150) — splits long sentences for streaming.

These are surfaced automatically when you pick a bulbul:v3 model — no extra config needed. Configuration:

{
  "tts": {
    "provider": "sarvam",
    "model": "bulbul:v3",
    "voice": "shubh",
    "language": "hi-IN",
    "speed": 1.0
  }
}

Popular Sarvam voices:

Voice	Sarvam version	Language	Description
`shubh`	v3 (default)	Hindi / Indian English	Warm, conversational male
`ritu`	v3	Hindi / Indian English	Clear professional female
`pooja`, `kavya`, `ishita`, `priya`	v3	Hindi / Indian English	Female, varied tone
`rahul`, `amit`, `dev`, `kabir`, `aditya`	v3	Hindi / Indian English	Male, varied tone
`anushka`	v2 (default)	Hindi	Female, baseline production voice
`manisha`, `vidya`, `arya`	v2	Hindi	Female alternates
`abhilash`, `karun`, `hitesh`	v2	Hindi	Male alternates

The full v3 roster is ~25 speakers covering Hindi, Tamil, Telugu, Marathi, Bengali, English, and code-switched accents. Pick voices in the agent studio’s TTS picker — switching between v2 and v3 automatically filters to compatible speakers. Supported Indian languages:

Language	Code	Code-Switching
Hindi	`hi-IN`	Hindi-English
Tamil	`ta-IN`	Tamil-English
Telugu	`te-IN`	Telugu-English
Kannada	`kn-IN`	Kannada-English
Bengali	`bn-IN`	Bengali-English
Marathi	`mr-IN`	Marathi-English
Gujarati	`gu-IN`	—
Malayalam	`ml-IN`	—
Punjabi	`pa-IN`	—
Odia	`or-IN`	—

Cartesia

Blazing-fast, high-quality voices with fine-grained emotion control. Great for expressive, dynamic conversations. Strengths:

Sub-100ms latency
Emotion and speed control mid-sentence
Streaming word-level timestamps
Multilingual support

Models:

Model	Latency	Quality	Best For
`sonic-2`	~80ms	Highest	Expressive voice agents
`sonic-2-turbo`	~50ms	High	Ultra-low latency
`sonic-multilingual`	~100ms	High	Non-English deployments

Configuration:

{
  "tts": {
    "provider": "cartesia",
    "model": "sonic-2",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
    "output_format": "pcm_16000",
    "speed": "normal",
    "emotion": ["positivity:high", "curiosity:medium"]
  }
}

Deepgram TTS

Fast, affordable text-to-speech from the leaders in speech AI. Good balance of quality and cost. Strengths:

Very low latency
Simple API
Competitive pricing
Good for high-volume use cases

Models:

Model	Latency	Quality	Best For
`aura-asteria-en`	~100ms	High	Female English voice
`aura-orion-en`	~100ms	High	Male English voice
`aura-luna-en`	~100ms	High	Soft female English
`aura-arcas-en`	~100ms	High	Authoritative male

Configuration:

{
  "tts": {
    "provider": "deepgram",
    "model": "aura-asteria-en"
  }
}

ElevenLabs

The most natural-sounding voices with ultra-realistic intonation. Best for customer-facing agents where voice quality is the top priority. Strengths:

Ultra-realistic voices with natural emotion
Voice cloning (use your own voice)
29+ languages
Extensive voice library

Models:

Model	Latency	Quality	Best For
`eleven_turbo_v2_5`	~150ms	Highest	Production voice agents
`eleven_multilingual_v2`	~200ms	Highest	Multilingual deployments
`eleven_flash_v2_5`	~100ms	High	Low-latency needs

Configuration:

{
  "tts": {
    "provider": "elevenlabs",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5",
    "stability": 0.5,
    "similarity_boost": 0.75
  }
}

Popular ElevenLabs voices:

Voice	ID	Description
Rachel	`21m00Tcm4TlvDq8ikWAM`	Calm, professional female
Adam	`pNInz6obpgDQGcFmaJgB`	Deep, authoritative male
Bella	`EXAVITQu4vr4xnSDxMaL`	Warm, friendly female
Antoni	`ErXwobaYiN019PkySvjV`	Smooth, conversational male

Rime

High-quality, low-latency voices with a focus on accuracy and consistency. Strengths:

Consistent voice quality across calls
Good pronunciation accuracy
Fast inference
Simple integration

Models:

Model	Latency	Quality	Best For
`mist`	~120ms	High	General voice agents
`v1`	~150ms	Good	Cost-optimized use

Configuration:

{
  "tts": {
    "provider": "rime",
    "model": "mist",
    "voice": "cove",
    "speed": 1.0
  }
}

Inworld

Specialized for character-driven, immersive voice experiences. Great for agents with strong personas. Strengths:

Character-consistent voices
Emotional expressiveness
Real-time voice modulation
Persona-driven design

Models:

Model	Latency	Quality	Best For
`inworld-v2`	~130ms	High	Character-driven agents
`inworld-v1`	~160ms	Good	Standard agents

Configuration:

{
  "tts": {
    "provider": "inworld",
    "model": "inworld-v2",
    "voice": "default",
    "emotion": "friendly"
  }
}

Google Cloud TTS

Wide language support with consistent quality. Best for multilingual deployments. Strengths:

220+ voices across 40+ languages
Neural2 and Studio voices for premium quality
SSML support
Predictable pricing

Models:

Model	Quality	Best For
`Neural2`	Highest	Production use
`Studio`	Highest	Premium applications
`Standard`	Good	Cost-optimized

Configuration:

{
  "tts": {
    "provider": "google",
    "voice_name": "en-US-Neural2-F",
    "language_code": "en-US",
    "speaking_rate": 1.0,
    "pitch": 0.0
  }
}

Recommended Google voices:

Voice	Language	Description
`en-US-Neural2-F`	English (US)	Natural female
`en-US-Neural2-D`	English (US)	Natural male
`en-GB-Neural2-A`	English (UK)	British female
`es-ES-Neural2-A`	Spanish	Female
`hi-IN-Neural2-A`	Hindi	Female

Azure Speech

Enterprise-grade with fine-grained control. Best for regulated industries. Strengths:

High-quality neural voices
Custom neural voice training
SSML with extensive control
Strong enterprise compliance

Models:

Model	Quality	Best For
`Neural`	Highest	Production use
`Custom Neural`	Highest	Brand-specific voice

Configuration:

{
  "tts": {
    "provider": "azure",
    "voice_name": "en-US-JennyNeural",
    "style": "friendly",
    "speaking_rate": "1.0"
  }
}

Provider Comparison

Provider	Latency	Languages	Indian Languages	Best For

| Sarvam | ~100ms | 10+ Indian | Yes (best) | Indian market | | Cartesia (recommended) | ~50-100ms | 10+ | No | Expressive conversations | | Deepgram | ~100ms | English | No | High-volume, cost-effective | | ElevenLabs | ~100-200ms | 29+ | No | Premium voice quality | | Rime | ~120ms | English+ | No | Consistent, reliable | | Inworld | ~130ms | English+ | No | Character-driven agents | | Google | ~150ms | 40+ | Hindi, Bengali, Tamil | Multilingual | | Azure | ~150ms | 60+ | Hindi, Tamil, Telugu | Enterprise, compliance |

Voice Selection and Customization

Selecting a Voice in the Dashboard

Open your voice agent configuration.
Go to the Voice tab.
Select a TTS provider.
Browse available voices and click Preview to hear samples.
Adjust settings (speed, pitch, stability) with the sliders.
Click Save.

Voice Selection via API

curl -X PATCH "https://api.thinnest.ai/agents/agent_xyz" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_config": {
      "tts_provider": "elevenlabs",
      "tts_voice_id": "21m00Tcm4TlvDq8ikWAM",
      "tts_model": "eleven_turbo_v2_5",
      "tts_stability": 0.5,
      "tts_similarity_boost": 0.75
    }
  }'

Matching Voice to Use Case

Use Case	Recommended Provider	Voice Style
General voice agents	Cartesia	Warm, professional
Indian market	Sarvam	Native language voices
Customer support	ElevenLabs or Cartesia	Warm, patient, professional
Sales calls	Cartesia or ElevenLabs	Energetic, confident
High-volume campaigns	Deepgram	Clear, cost-effective
Appointment reminders	Deepgram	Neutral, clear
Medical/legal	Azure	Professional, calm
Multilingual support	Google or Sarvam	Language-specific voices
Character-driven agents	Inworld	Persona-consistent
Brand voice	ElevenLabs (clone)	Your custom voice

Speech-to-Text (STT) Configuration

Speech-to-text converts the caller’s audio into text for your AI agent to process. thinnestAI supports multiple STT providers with native support for Indian and global languages.

Deepgram (Default)

Industry-leading speech recognition with the best accuracy and lowest latency. Our default and recommended STT provider. Models:

Model	Accuracy	Latency	Best For
`flux-general-en`	High	~200ms + native turn detection	Voice agents (English) — fastest end-to-end with built-in EOU
`nova-3`	Highest	~100ms	Production voice agents (multilingual)
`nova-2`	High	~120ms	General use
`nova-2-phonecall`	High	~120ms	Phone call audio (optimized)
`nova-2-conversationalai`	High	~120ms	Conversational agents
`enhanced`	Good	~150ms	Cost-optimized
`base`	Standard	~100ms	High-volume, budget

Flux models use Deepgram’s v2 streaming API with native turn detection — the model itself decides when the user has stopped speaking, eliminating the need for a separate VAD silence timer. Pair with Turn Detection Mode → STT Endpointing for the lowest end-to-end latency (~400ms total user-end-of-speech to agent reply).When Flux is selected, the interim_results, punctuate, and language UI settings are managed by Flux internally and are not forwarded to the STT plugin.

{
  "stt": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en-US",
    "punctuate": true,
    "smart_format": true,
    "profanity_filter": false
  }
}

Sarvam STT

Best-in-class for Indian languages. If your callers speak Hindi, Tamil, Telugu, or other Indian languages, Sarvam delivers the highest accuracy. Models:

Model	Languages	Best For
`saarika:v2`	10+ Indian languages	Production Indian voice agents
`saarika:v2-turbo`	10+ Indian languages	Low-latency Indian deployments

{
  "stt": {
    "provider": "sarvam",
    "model": "saarika:v2",
    "language": "hi-IN"
  }
}

Supports code-switching (e.g., Hindi-English mixed speech) out of the box.

Google Speech-to-Text

Strong multilingual support with good accuracy across 125+ languages. Includes a built-in denoiser that filters background noise for cleaner transcriptions in noisy environments like call centers or outdoor calls. Models:

Model	Accuracy	Best For
`latest_long`	High	Long conversations
`latest_short`	High	Short utterances
`telephony`	High	Phone call audio
`medical_conversation`	High	Healthcare

{
  "stt": {
    "provider": "google",
    "model": "telephony",
    "language": "en-US",
    "denoiser": true
  }
}

Set "denoiser": true to enable automatic background noise filtering. This is especially useful for phone call audio where ambient noise can degrade transcription accuracy.

Azure Speech-to-Text

Enterprise-grade with custom model training. Best for regulated industries. Models:

Model	Accuracy	Best For
`default`	High	General use
`custom`	Highest	Domain-specific vocabulary
`conversation`	High	Multi-speaker calls

{
  "stt": {
    "provider": "azure",
    "model": "default",
    "language": "en-US"
  }
}

Cartesia STT

Real-time transcription with word-level timestamps.

{
  "stt": {
    "provider": "cartesia",
    "model": "ink",
    "language": "en"
  }
}

AssemblyAI

High-accuracy transcription with built-in speaker diarization and entity detection. Models:

Model	Accuracy	Best For
`best`	Highest	Maximum accuracy
`nano`	Good	Cost-optimized, low latency

{
  "stt": {
    "provider": "assemblyai",
    "model": "best",
    "language": "en"
  }
}

STT Provider Comparison

Provider	Languages	Indian Languages	Self-Hosted	Best For

| Deepgram (default) | 30+ | Hindi | No | General voice agents | | Sarvam | 10+ Indian | 10+ Indian | No | Indian languages | | Google | 125+ | Hindi, Tamil, Telugu+ | No | Multilingual | | Azure | 100+ | Hindi, Tamil, Telugu+ | No | Enterprise, custom models | | Cartesia | 10+ | No | No | Timestamps, real-time | | AssemblyAI | 20+ | No | No | Maximum accuracy |

Multi-Language Support

For agents that handle multiple languages:

{
  "stt": {
    "language": "multi",
    "detect_language": true,
    "supported_languages": ["en", "es", "fr", "hi", "ta"]
  }
}

The agent will detect the caller’s language and respond accordingly.

Turn Detection Mode

Turn detection decides when the user has stopped speaking so the agent can take its turn. Three modes available, each with different latency and accuracy trade-offs.

Mode	id	Mechanism	Best For
Ensemble	`turn_detector`	Combines silence timing, the live transcript, and an AI end-of-turn model — tells a mid-thought pause apart from a finished question	Recommended default — most accurate across languages, including callers who pause mid-sentence
STT Endpointing	`stt_endpointing`	The STT provider’s own end-of-utterance signal	Lowest latency with a provider that has a strong native signal (Deepgram Flux, Sarvam Saaras)
VAD Only	`vad_only`	Silence-based detection only — fires after `min_silence_duration` of silence	Fastest possible turn-end; short, simple utterances

Ensemble is the default for new agents. Rather than relying on a single signal, it reads what the caller said alongside how long they paused — so it waits through a natural mid-sentence pause (“I’d like to know about…”) and ends the turn promptly on a complete thought (“…your pricing”).

Your choice is respected

Turn Detection Mode is exactly what you set on the Detection tab. Choosing a speech-to-text provider no longer overrides it — the Ensemble model layers on top of any provider, so there’s no need to force a single-signal mode for Sarvam or Deepgram Flux. If you want the lowest possible latency and your provider has a strong native end-of-utterance signal (Deepgram Flux, Sarvam Saaras), pick STT Endpointing explicitly.

Configuration via API

{
  "voice_config": {
    "turn_detection_mode": "stt_endpointing",
    "min_silence_duration": 0.3,
    "min_speech_duration": 0.1,
    "min_endpointing_delay": 300,
    "max_endpointing_delay": 1.5
  }
}

Field	Type	Default	Notes
`turn_detection_mode`	string	`"turn_detector"`	One of `turn_detector`, `stt_endpointing`, `vad_only`
`min_silence_duration`	float (seconds)	`0.3`	VAD silence threshold before declaring speech ended
`min_speech_duration`	float (seconds)	`0.1`	Minimum sound duration to count as valid speech (filters short noises)
`min_endpointing_delay`	int (ms)	`300`	Buffer added after silence detection before processing
`max_endpointing_delay`	float (seconds)	`1.5`	Hard cutoff for turn end regardless of context

STT/TTS Fallback

For production voice agents, you can configure fallback providers that automatically take over if the primary provider fails or times out. This ensures your agent stays responsive even during provider outages.

How Fallback Works

The agent sends audio to the primary provider.
If the primary provider fails (timeout, error, or degraded quality), the system automatically switches to the fallback provider.
The switch is seamless — callers experience a brief pause at most, not a dropped call.

Configuring TTS Fallback

{
  "tts": {
    "provider": "elevenlabs",
    "model": "eleven_turbo_v2_5",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "fallback": {
      "provider": "aero",
      "model": "aero-2",
      "voice": "maya"
    }
  }
}

Configuring STT Fallback

{
  "stt": {
    "provider": "deepgram",
    "model": "nova-3",
    "fallback": {
      "provider": "google",
      "model": "telephony"
    }
  }
}

Recommended Fallback Pairs

Primary	Fallback	Reason
ElevenLabs (TTS)	Deepgram (TTS)	Premium quality primary, fast fallback
Deepgram (STT)	Google (STT)	Best accuracy primary, wide language fallback
Cartesia (TTS)	Deepgram (TTS)	Ultra-low latency primary, reliable fallback
Sarvam (STT)	Google (STT)	Best Indian language primary, multilingual fallback

Tip: Choose fallback providers that are hosted on different infrastructure than your primary to maximize resilience. For example, pairing a third-party provider with Google Cloud provides good redundancy.

Noise Cancellation

Noise cancellation runs on the audio coming into the agent (the caller’s microphone) before it reaches STT — cleaner input means more accurate transcripts and fewer false interruptions on noisy phone lines. Pick from the Voice → Noise Cancellation dropdown in the agent studio:

Option	Engine	Latency	Best for
DTLN (default)	Dual-LSTM denoiser, ~3 MB model	~8 ms	Lightweight baseline. Strips traffic, hum, AC noise.
Hush	DeepFilterNet3 + Auxiliary Separation Head (Weya AI)	~1 ms / frame	Strongest on-device option. Also suppresses background voices, TV, and music.
None	—	0	Disable when the caller’s environment is already clean (studio, headset).

Both options are self-hosted (CPU), so they work the same on every deployment. None is honored at runtime — pre-fix, an explicit “off” silently fell back to DTLN.

Hush — install notes

Hush requires a separate model bundle (DeepFilterNet3 weights + the auxiliary speaker-separation ONNX) at $HUSH_MODEL_DIR (default /app/hush_model). Folder structure expected by the wrapper:

hush_model/
├── lib/
├── onnx/
└── model_best.ckpt

Download from huggingface.co/weya-ai/hush and place under that path before starting the agent worker. If the bundle is missing, the loader logs a warning and falls back to DTLN — your agent still gets noise cancellation, just not Hush-grade.

Configuration via API

{
  "voice": {
    "noise_cancellation": "hush"
  }
}

Accepts: dtln | hush | none.

Voice Style Preamble

Every voice agent automatically gets a short “you are speaking, not writing” preamble prepended to its system prompt at runtime. This steers even chat-tuned models (GPT-4o, Sarvam-30B, etc.) toward phone-friendly output:

1–2 short sentences per reply (~30 words)
Use contractions (“I’m”, “you’re”)
Never use markdown, bullets, headings, code blocks, or symbols
Spell out currency, dates, units (“twenty dollars”, not “$20”)
Don’t read URLs or email addresses character by character
Ask one question at a time

Without this preamble, chat-trained LLMs frequently emit **bold**, bullet lists, or $10 as “dollar sign one zero” on calls. The preamble catches all of that.

Editing the preamble

Open the agent’s Behavior panel — the Voice Style Preamble card sits at the top of the right column. The textarea is pre-populated with the built-in default text; edit any line to customize for your domain (e.g., legal/medical disclaimer scripts), or click Reset to default to revert. Toggle the switch off if your system prompt already encodes voice style constraints — the preamble will be skipped at runtime.

Configuration via API

{
  "hooks": {
    "voice_style_preamble": {
      "enabled": true,
      "text": ""
    }
  }
}

enabled: false — skip the preamble entirely
text: "" (or omitted) — use the built-in default
text: "<custom text>" — use your text instead of the default

Interruption Handling

Interruption handling determines what happens when the caller speaks while the agent is talking.

Modes

Allow interruptions (default): The agent stops speaking when the caller starts talking. This feels natural — like a real conversation.

{
  "interruption": {
    "enabled": true,
    "sensitivity": "medium"
  }
}

Block interruptions: The agent finishes its response before listening. Use this for critical information that must be delivered completely (e.g., legal disclaimers).

{
  "interruption": {
    "enabled": false
  }
}

Sensitivity Levels

Level	Description	Best For
`low`	Only interrupts on sustained speech (1+ seconds)	Noisy environments
`medium`	Balanced detection	General use
`high`	Interrupts on brief sounds	Quick, responsive conversations

Per-Message Interruption Control

You can control interruption behavior per message in your agent’s system prompt:

When reading the terms and conditions, do not allow interruptions.
For all other responses, allow normal interruptions.

The agent will dynamically adjust based on what it is saying.

Silent Timeout Settings

Silent timeouts control what happens when the caller stops speaking.

{
  "timeouts": {
    "silence_timeout_seconds": 10,
    "silence_prompt": "Are you still there? I'm happy to help if you have any questions.",
    "max_silence_prompts": 2,
    "disconnect_after_max_silence": true,
    "disconnect_message": "It seems like you may have stepped away. Feel free to call back anytime. Goodbye!"
  }
}

Parameter	Default	Description
`silence_timeout_seconds`	10	Seconds of silence before prompting
`silence_prompt`	—	What the agent says after silence
`max_silence_prompts`	2	How many times to prompt before disconnecting
`disconnect_after_max_silence`	true	Disconnect after max prompts exceeded
`disconnect_message`	—	Message before disconnecting

Recommended Timeout Settings

Use Case	Timeout	Notes
Quick transactions	5-8s	Fast-paced conversations
Customer support	10-15s	Callers may be looking up info
Technical support	15-20s	Callers may be following instructions
Sales calls	8-12s	Keep momentum going

Wait Before Speaking

A small pause between user end-of-turn and the agent starting to speak can make responses feel more human. Counterintuitively, instant replies often sound robotic — a 300-500ms “thinking beat” reads as natural conversation. Default is 0 (instant reply, lowest latency); dial up if you want the conversational feel.

Setting	Default	Range	Effect
`wait_before_speak_seconds`	`0`	`0` - `2` seconds	Delay before agent’s reply starts streaming

When to tune

Value	Use Case
`0.0s` (Default — Instant)	Latency-critical bots (IVR, ad-tech, status checks); most voice agents — feel snappy
`0.3-0.5s`	More human-feeling conversations — adds a “thinking beat”
`0.8-1.5s` (Deliberate)	Agents handling complex requests where a longer pause reinforces “thinking”

Configuration via API

{
  "voice_config": {
    "wait_before_speak_seconds": 0.4
  }
}

The wait is measured by livekit as on_user_turn_completed_delay in EOU metrics — visible in production logs for verification.

With preemptive generation enabled (default), the LLM call starts on interim transcripts before the user finishes. The wait_before_speak delay applies between EOU and TTS start — by then the LLM has often already finished, so the wait is the only added latency. Total perceived gap: ~EOU + max(LLM TTFT, wait_before_speak) + TTS TTFB.

Voice Formatting and Pronunciation

Pronunciation Overrides

If your agent mispronounces specific words (brand names, technical terms), add pronunciation overrides:

{
  "pronunciation": {
    "overrides": [
      {
        "word": "thinnestAI",
        "pronunciation": "thin-est A.I."
      },
      {
        "word": "PostgreSQL",
        "pronunciation": "post-gres-Q-L"
      },
      {
        "word": "API",
        "pronunciation": "A.P.I."
      }
    ]
  }
}

Number and Date Formatting

Control how the agent reads numbers and dates:

{
  "formatting": {
    "phone_numbers": "digit_by_digit",
    "currency": "natural",
    "dates": "natural",
    "times": "12_hour"
  }
}

Setting	Options	Example
`phone_numbers`	`digit_by_digit`, `grouped`	”4-1-5-5-5-5-1-2-3-4” vs “415-555-1234”
`currency`	`natural`, `formal`	”twenty-three fifty” vs “twenty-three dollars and fifty cents”
`dates`	`natural`, `formal`	”March fifth” vs “March 5th, 2026”
`times`	`12_hour`, `24_hour`	”2 PM” vs “14 hundred hours”

Filler Words and Pauses

Make your agent sound more natural by eliminating awkward silence. thinnestAI supports two types of fillers:

Instant Filler Words

Short, natural sounds that play immediately (~200 ms after the user stops speaking) — before the LLM has even produced a token. The real reply slides in behind the filler as it arrives. This masks the LLM + KB latency that would otherwise be perceived as a 1-3 second silence. Where to find it: Voice Configuration → Advanced tab → Filler Words card. Toggle defaults to on for new agents.

{
  "fillerWordsEnabled": true,
  "fillerWords": ["hmm", "aha", "um", "okay"],
  "fillerWordsMinChars": 10,
  "fillerMinTtftMs": 800
}

fillerWordsEnabled — Master toggle. Default true for new agents.
fillerWords — Custom filler words. Leave empty to use language-appropriate defaults:
- English — hmm, uh huh, um, right, okay
- Hindi — hmm, acchha, haan, theek hai, ji
- Other — hmm, aha, um, okay
fillerWordsMinChars — Skip the filler when the user’s transcript is shorter than this. Default 10. Stops “yes” / “no” replies from getting an unnecessary “hmm” in front.
fillerMinTtftMs — Collision guard. Only fire the filler if the previous turn’s LLM TTFT was at least this slow (in ms). Default 800. On a fast turn (TTFT < 800 ms) the real reply would arrive before the filler audio finishes, causing audio overlap even with cross-fade enabled. The first turn always probes (no prior TTFT data to gate on).

The filler runs through session.say() with allow_interruptions=true and add_to_chat_ctx=false — when the real LLM reply arrives, it preempts the filler instead of queueing behind it, and the filler text never reaches the model’s conversation history.

Filler Words apply in both Cascaded and Speech-to-Speech modes. In S2S, the filler is spoken by the cascaded TTS path (it’s a pre-LLM cue, not part of the realtime model’s stream).

Thinking Phrases

Longer phrases spoken when the agent needs time to process a tool call or complex request:

{
  "silenceFillersEnabled": true,
  "silenceFillerPhrases": [
    "Let me check that for you.",
    "One moment please.",
    "Good question, let me look into that."
  ]
}

These are LLM-driven — the agent decides when to use them based on context.

Turn Detection

How the agent decides the caller has stopped talking. Three modes are exposed under Detection in the voice panel.

Mode	What it does	When to use
Ensemble (default)	Fuses VAD + STT + a semantic end-of-utterance model that understands meaning across 23+ languages. Knows the difference between a mid-thought pause and a finished sentence.	Default. Almost always the right choice — especially for multilingual or Indic agents.
STT Endpointing	Uses the STT provider’s own end-of-phrase signal. Silence-based, no semantics. Fast.	English-only chitchat where every millisecond counts and the STT (Deepgram, Sarvam) emits a reliable endpointing signal.
VAD Only	Raw silence-based detection. Fastest and crudest.	Simple, scripted flows where the caller’s turns are short and predictable.

The semantic model runs locally inside the bot image — no per-call network round-trip, no third-party API. It falls back automatically to LiveKit’s bundled multilingual model, and then to single-method VAD/STT, if the upgraded model can’t load.

Adaptive EOU Confidence

When you pick Ensemble, an extra slider appears: Adaptive EOU Confidence. This is the threshold the semantic model uses to decide whether the caller has actually finished.

Higher (0.85–0.95) — waits for more certainty before ending the turn. Fewer mid-thought cut-offs; the agent is more patient. Good for hesitant callers, technical Q&A, and IVRs that read out long IDs.
Lower (0.40–0.60) — snappier. The agent responds faster but is more likely to interrupt a thought-pause. Good for terse, transactional chitchat.
0.70 (default) is balanced and works for most agents.

Smart Interruption

Under Advanced → Interruption Handling, the Smart interruption toggle uses the same semantic model to recognise filler words and incomplete utterances on the fly. When the caller says “uh huh”, “yeah okay”, or trails off mid-word, the agent ignores it and keeps talking — instead of cutting itself off. Strictly suppresses false interruptions; it never causes the agent to miss a real one. On by default. No-ops gracefully if the semantic model isn’t loaded (the bot falls back to the static never-interrupt phrase list).

Proactive Re-engagement

Under Detection → Proactive Re-engagement, an opt-in toggle for one of the most asked-for behaviours in long-form support and onboarding calls. When the caller pauses mid-sentence after a semantically incomplete utterance — “My ticket number is…” (silence) — the agent fires a short, language-aware nudge (“I’m listening, go ahead”) in the caller’s own language, instead of either prematurely answering or sitting silent until the 30-second abandonment poke. Configure:

Pause delay (1.5–8.0 s) — how long the caller can pause mid-thought before the agent nudges. Default 3.0 s.
Max nudges per pause (1–3) — cap so the agent doesn’t keep prodding. Default 1.
Explicit phrases (optional) — leave empty for an LLM-generated nudge in the caller’s language. Adding phrases here locks the wording and is not auto-translated.

Off by default — terse or IVR-style agents usually don’t want it.

Testing Your Configuration

After making changes, test thoroughly:

Use the web call test — Click Test Call in the dashboard to hear your changes immediately.
Test edge cases — Try interrupting, staying silent, speaking quickly, and using unusual words.
Test on a real phone — Web call audio quality differs from phone audio. Always test over a real phone line before going live.
Compare providers — Try the same conversation with different TTS providers to find the best fit.
Get feedback — Have someone unfamiliar with the system test the call and provide honest feedback.

Next Steps

Inbound Calls — Apply your voice configuration to inbound call handling
Outbound Calls — Use your configured voice for outbound campaigns
Call Recording — Record calls to review voice quality over time

​Voice Configuration

​Choosing a TTS Provider

​Sarvam

​Cartesia

​Deepgram TTS

​ElevenLabs

​Rime

​Inworld

​Google Cloud TTS

​Azure Speech

​Provider Comparison

​Voice Selection and Customization

​Selecting a Voice in the Dashboard

​Voice Selection via API

​Matching Voice to Use Case

​Speech-to-Text (STT) Configuration

​Deepgram (Default)

​Sarvam STT

​Google Speech-to-Text

​Azure Speech-to-Text

​Cartesia STT

​AssemblyAI

​STT Provider Comparison

​Multi-Language Support

​Turn Detection Mode

​Your choice is respected

​Configuration via API

​STT/TTS Fallback

​How Fallback Works

​Configuring TTS Fallback

​Configuring STT Fallback

​Recommended Fallback Pairs

​Noise Cancellation

​Hush — install notes

​Configuration via API

​Voice Style Preamble

​Editing the preamble

​Configuration via API

​Interruption Handling

​Modes

​Sensitivity Levels

​Per-Message Interruption Control

​Silent Timeout Settings

​Recommended Timeout Settings

​Wait Before Speaking

​When to tune

​Configuration via API

​Voice Formatting and Pronunciation

​Pronunciation Overrides

​Number and Date Formatting

​Filler Words and Pauses

​Instant Filler Words

​Thinking Phrases

​Turn Detection

​Adaptive EOU Confidence

​Smart Interruption

​Proactive Re-engagement

​Testing Your Configuration

​Next Steps

Voice Configuration

Choosing a TTS Provider

Sarvam

Cartesia

Deepgram TTS

ElevenLabs

Rime

Inworld

Google Cloud TTS

Azure Speech

Provider Comparison

Voice Selection and Customization

Selecting a Voice in the Dashboard

Voice Selection via API

Matching Voice to Use Case

Speech-to-Text (STT) Configuration

Deepgram (Default)

Sarvam STT

Google Speech-to-Text

Azure Speech-to-Text

Cartesia STT

AssemblyAI

STT Provider Comparison

Multi-Language Support

Turn Detection Mode

Your choice is respected

Configuration via API

STT/TTS Fallback

How Fallback Works

Configuring TTS Fallback

Configuring STT Fallback

Recommended Fallback Pairs

Noise Cancellation

Hush — install notes

Configuration via API

Voice Style Preamble

Editing the preamble

Configuration via API

Interruption Handling

Modes

Sensitivity Levels

Per-Message Interruption Control

Silent Timeout Settings

Recommended Timeout Settings

Wait Before Speaking

When to tune

Configuration via API

Voice Formatting and Pronunciation

Pronunciation Overrides

Number and Date Formatting

Filler Words and Pauses

Instant Filler Words

Thinking Phrases

Turn Detection

Adaptive EOU Confidence

Smart Interruption

Proactive Re-engagement

Testing Your Configuration

Next Steps