Transcribe Audio
Transcribe audio files using Vega or Dhara STT engines.
/api/stt/transcribeTranscribe an audio file. Supports WAV, MP3, FLAC, OGG, and WebM formats.
No parameters for this endpoint.
Request (multipart/form-data)
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
audio | file | Yes | — | Audio file (WAV, MP3, FLAC, OGG, WebM). Max 40s per request. |
language | string | Yes | — | Language code (e.g. hi, en, ta, bn). See Languages. |
model | string | No | vega | STT engine: vega (~100ms, live + file) or dhara (~400ms, file only, word timestamps). |
punctuate | boolean | No | true | Add punctuation and capitalization automatically. |
kenlm | boolean | No | true | Use KenLM language model for higher accuracy. Adds ~50ms. Available for 25 Indian languages. |
timestamps | boolean | No | false | Include word-level timestamps (Dhara only). |
diarize | boolean | No | false | Speaker diarization via pyannote. File only. |
sentiment | boolean | No | false | Detect overall sentiment (positive/negative/neutral). |
format_numbers | boolean | No | false | Auto-format numbers, currency, percentages in transcript. |
vad_filter | boolean | No | true | Voice Activity Detection — skip silence segments (Dhara only). Auto-disabled for audio < 10s. |
bgm_filter | boolean | No | false | Background music removal via spectral subtraction (Dhara only). |
enhance | boolean | No | true | Auto Gain Control — normalize volume for speakerphone/far-field audio. Critical for Indian conditions. |
denoise | boolean | No | false | DeepFilterNet3 neural noise suppression — removes traffic, crowds, TV, music background. ~10ms overhead. |
redact | boolean | No | false | Redact Indian PII: Aadhaar, PAN, phone, credit card, email, UPI. |
keywords | string | No | — | Comma-separated vocabulary boosting terms (company names, jargon). |
utterances | boolean | No | false | Speaker-segmented utterances (requires diarize=true). |
Audio Enhancement Pipeline
All audio is processed through an enhancement pipeline before ASR inference. This dramatically improves accuracy in real-world Indian conditions.
Auto Gain Control (enhance=true, default ON)
Normalizes audio volume to a target RMS level. Essential for:
- Speakerphone calls — far-field mic picks up very quiet audio
- Volume variation — different devices, network conditions
- Indian call centers — agents with headsets vs customers on speakerphone
Without AGC, quiet audio produces garbage: "नत ेरा न आ श". With AGC: "नमस्ते, मेरा नाम आशुतोष है".
Neural Noise Suppression (denoise=true, opt-in)
DeepFilterNet3 removes background noise while preserving speech:
- Traffic noise — auto-rickshaws, honking, construction
- Crowd noise — markets, multi-speaker households
- TV/music — Bollywood, news channels running in background
- Wind — outdoor/agricultural recordings
Adds ~10ms latency. Enable for noisy environments.
Telephony Detection (automatic)
Auto-detects 8kHz narrowband audio (2G/GSM calls from Airtel, BSNL, Vi) and optimizes processing. No configuration needed — the server detects sample rate and adjusts automatically.
Echo Cancellation
AEC (Acoustic Echo Cancellation) is handled at the transport layer by LiveKit/WebRTC before audio reaches the STT server. This is the correct architecture — the media pipeline removes echo before transcription.
Response
{
"text": "नमस्ते, मेरा नाम अशुतोष है।",
"language": "hi",
"duration_seconds": 5.2,
"processing_ms": 372.1,
"model": "vega",
"provider": "thinnestai",
"confidence": 0.94,
"words": [
{ "word": "नमस्ते", "start": 0.0, "end": 0.52, "confidence": 0.97 }
],
"speakers": [
{ "speaker": "A", "start": 0.0, "end": 2.1, "text": "नमस्ते" }
],
"sentiment": { "label": "positive", "score": 0.82 },
"kenlm_used": true,
"text_raw": "namaste mera naam ashutosh hai"
}| Field | Type | Description |
|---|---|---|
text | string | Final transcribed text with punctuation and formatting applied. |
language | string | Detected or requested language code. |
duration_seconds | number | Audio duration in seconds. |
processing_ms | number | Server-side processing time in milliseconds. |
model | string | Engine used (vega or dhara). |
confidence | number | Overall confidence score (0-1). Dhara only. |
words | array | Word-level timestamps. Only when timestamps=true (Dhara). |
speakers | array | Speaker diarization segments. Only when diarize=true. |
sentiment | object | Sentiment analysis result. Only when sentiment=true. |
kenlm_used | boolean | Whether KenLM language model was applied. |
text_raw | string | Raw transcript before PII redaction. Only when redact=true. |
Models
| Model | Latency | Best For | Languages |
|---|---|---|---|
| Vega | ~100ms | Live voice agents, real-time streaming | 200+ (22 Indian + international) |
| Dhara | ~400ms | File transcription, meetings, word timestamps | 200+ |
Authentication
Include your API key in the Authorization header:
curl -X POST https://api.thinnest.ai/api/stt/transcribe \
-H "Authorization: Bearer thns_sk_your_key_here" \
-F "audio=@meeting.wav" \
-F "language=hi" \
-F "model=vega" \
-F "punctuate=true" \
-F "kenlm=true"Pricing
| Component | Cost |
|---|---|
| Transcription | ₹0.25/min |
| All features (timestamps, diarization, sentiment, PII redaction) | Included |