Speech-to-Text

Transcribe Audio

Transcribe audio files using Vega or Dhara STT engines.

POST/api/stt/transcribe

Transcribe an audio file. Supports WAV, MP3, FLAC, OGG, and WebM formats.

No parameters for this endpoint.


Request (multipart/form-data)

FieldTypeRequiredDefaultDescription
audiofileYesAudio file (WAV, MP3, FLAC, OGG, WebM). Max 40s per request.
languagestringYesLanguage code (e.g. hi, en, ta, bn). See Languages.
modelstringNovegaSTT engine: vega (~100ms, live + file) or dhara (~400ms, file only, word timestamps).
punctuatebooleanNotrueAdd punctuation and capitalization automatically.
kenlmbooleanNotrueUse KenLM language model for higher accuracy. Adds ~50ms. Available for 25 Indian languages.
timestampsbooleanNofalseInclude word-level timestamps (Dhara only).
diarizebooleanNofalseSpeaker diarization via pyannote. File only.
sentimentbooleanNofalseDetect overall sentiment (positive/negative/neutral).
format_numbersbooleanNofalseAuto-format numbers, currency, percentages in transcript.
vad_filterbooleanNotrueVoice Activity Detection — skip silence segments (Dhara only). Auto-disabled for audio < 10s.
bgm_filterbooleanNofalseBackground music removal via spectral subtraction (Dhara only).
enhancebooleanNotrueAuto Gain Control — normalize volume for speakerphone/far-field audio. Critical for Indian conditions.
denoisebooleanNofalseDeepFilterNet3 neural noise suppression — removes traffic, crowds, TV, music background. ~10ms overhead.
redactbooleanNofalseRedact Indian PII: Aadhaar, PAN, phone, credit card, email, UPI.
keywordsstringNoComma-separated vocabulary boosting terms (company names, jargon).
utterancesbooleanNofalseSpeaker-segmented utterances (requires diarize=true).

Audio Enhancement Pipeline

All audio is processed through an enhancement pipeline before ASR inference. This dramatically improves accuracy in real-world Indian conditions.

Auto Gain Control (enhance=true, default ON)

Normalizes audio volume to a target RMS level. Essential for:

  • Speakerphone calls — far-field mic picks up very quiet audio
  • Volume variation — different devices, network conditions
  • Indian call centers — agents with headsets vs customers on speakerphone

Without AGC, quiet audio produces garbage: "नत ेरा न आ श". With AGC: "नमस्ते, मेरा नाम आशुतोष है".

Neural Noise Suppression (denoise=true, opt-in)

DeepFilterNet3 removes background noise while preserving speech:

  • Traffic noise — auto-rickshaws, honking, construction
  • Crowd noise — markets, multi-speaker households
  • TV/music — Bollywood, news channels running in background
  • Wind — outdoor/agricultural recordings

Adds ~10ms latency. Enable for noisy environments.

Telephony Detection (automatic)

Auto-detects 8kHz narrowband audio (2G/GSM calls from Airtel, BSNL, Vi) and optimizes processing. No configuration needed — the server detects sample rate and adjusts automatically.

Echo Cancellation

AEC (Acoustic Echo Cancellation) is handled at the transport layer by LiveKit/WebRTC before audio reaches the STT server. This is the correct architecture — the media pipeline removes echo before transcription.


Response

{
  "text": "नमस्ते, मेरा नाम अशुतोष है।",
  "language": "hi",
  "duration_seconds": 5.2,
  "processing_ms": 372.1,
  "model": "vega",
  "provider": "thinnestai",
  "confidence": 0.94,
  "words": [
    { "word": "नमस्ते", "start": 0.0, "end": 0.52, "confidence": 0.97 }
  ],
  "speakers": [
    { "speaker": "A", "start": 0.0, "end": 2.1, "text": "नमस्ते" }
  ],
  "sentiment": { "label": "positive", "score": 0.82 },
  "kenlm_used": true,
  "text_raw": "namaste mera naam ashutosh hai"
}
FieldTypeDescription
textstringFinal transcribed text with punctuation and formatting applied.
languagestringDetected or requested language code.
duration_secondsnumberAudio duration in seconds.
processing_msnumberServer-side processing time in milliseconds.
modelstringEngine used (vega or dhara).
confidencenumberOverall confidence score (0-1). Dhara only.
wordsarrayWord-level timestamps. Only when timestamps=true (Dhara).
speakersarraySpeaker diarization segments. Only when diarize=true.
sentimentobjectSentiment analysis result. Only when sentiment=true.
kenlm_usedbooleanWhether KenLM language model was applied.
text_rawstringRaw transcript before PII redaction. Only when redact=true.

Models

ModelLatencyBest ForLanguages
Vega~100msLive voice agents, real-time streaming200+ (22 Indian + international)
Dhara~400msFile transcription, meetings, word timestamps200+

Authentication

Include your API key in the Authorization header:

curl -X POST https://api.thinnest.ai/api/stt/transcribe \
  -H "Authorization: Bearer thns_sk_your_key_here" \
  -F "audio=@meeting.wav" \
  -F "language=hi" \
  -F "model=vega" \
  -F "punctuate=true" \
  -F "kenlm=true"

Pricing

ComponentCost
Transcription₹0.25/min
All features (timestamps, diarization, sentiment, PII redaction)Included

On this page