Supported Formats & Limits

This page covers the technical details of how thinnestAI processes your knowledge sources — supported formats, size constraints, chunking strategy, and tips for getting the best results.

File Format Reference

Format	Extension	Max Size	Notes
PDF	`.pdf`	50 MB	Text-based PDFs work best. Scanned PDFs require OCR preprocessing.
Word Document	`.docx`	25 MB	Full formatting support. Headings, lists, and tables are preserved.
Plain Text	`.txt`	10 MB	Simple and reliable. Best for clean, unformatted content.
Markdown	`.md`, `.mdx`	10 MB	Formatting (headings, lists, code blocks) is preserved during chunking.
Excel	`.xlsx`	25 MB	Structured data with column headers. Each row becomes searchable content.
CSV	`.csv`	10 MB	Tabular data. Column headers are used as field names.

Source Type Limits

Source Type	Limit	Notes
File Upload	50 MB per file	Batch upload supported (multiple files at once)
URL	Single page	One page per URL source. Add multiple URLs for multi-page sites.
YouTube	Videos with captions	Transcript must be available (auto-generated or manual)
Text Input	100,000 characters	Approximately 25,000 words
GitHub	100 MB per repo import	Filter by path and file type for large repos
Azure Blob	50 MB per file	Same file format requirements as direct upload
SharePoint	50 MB per file	Same file format requirements as direct upload
Excel	25 MB, 100,000 rows	Large workbooks may take longer to process

Knowledge Base Limits

Limit	Value
Sources per knowledge base	500
Knowledge bases per agent	10
Total storage	Depends on your plan

Chunking and Processing

When you add a source, thinnestAI breaks the content into smaller pieces called chunks. This is essential for accurate retrieval — the agent finds the specific chunk that answers the question, rather than searching through an entire document.

How Chunking Works

Content extraction — Text is extracted from the source format (PDF parsing, HTML scraping, etc.).
Cleaning — Headers, footers, navigation elements, and other noise are removed.
Splitting — Content is split into chunks at natural boundaries:
- Headings and section breaks
- Paragraph boundaries
- Sentence boundaries (for very long paragraphs)
Overlap — Adjacent chunks share a small overlap to preserve context across boundaries.

Chunk Parameters

Parameter	Default	Description
Chunk size	~1,000 tokens	Target size for each chunk
Chunk overlap	~200 tokens	Overlap between consecutive chunks
Splitting strategy	Semantic	Splits at headings and paragraphs first, then sentences

Why Chunk Size Matters

Too large — Chunks contain mixed topics, making retrieval less precise.
Too small — Chunks lack context, making answers less comprehensive.
Just right — Each chunk covers one topic or concept completely.

The default settings work well for most content. The system automatically adapts to your document structure.

Embedding Models

thinnestAI converts each chunk into a vector embedding — a numerical representation that captures the semantic meaning of the text. These embeddings power semantic search.

How Embeddings Work

"What is your return policy?"
        │
        ▼
   Embedding Model
        │
        ▼
[0.023, -0.156, 0.891, ...]  ← 1536-dimensional vector

Similar concepts produce similar vectors, so a search for “refund process” will find chunks about “return policy” even though the words are different.

Embedding Details

Property	Value
Model	OpenAI `text-embedding-3-small` (default)
Dimensions	1536
Max tokens per chunk	8,191

Vector Storage

Embeddings are stored in a PostgreSQL database with the pgvector extension, enabling fast similarity search at scale. For larger deployments, thinnestAI also supports LanceDB and ChromaDB as vector storage backends.

Search & Retrieval

When your agent searches the knowledge base, the system runs a hybrid search:

Search Type	How It Works	Best For
Semantic search	Converts the query to a vector and finds the closest chunks by meaning	Conversational queries, synonym matching
Full-text search	Keyword matching using PostgreSQL full-text search	Exact terms, names, codes, acronyms
Hybrid (default)	Combines both and reranks results	Best overall accuracy

Retrieval Parameters

Parameter	Default	Description
Top K	5	Number of chunks returned per search
Similarity threshold	0.7	Minimum relevance score (0-1)

Best Practices for Knowledge Quality

Document Structure

Good structure produces better chunks and more accurate retrieval. Do:

Use clear headings (H1, H2, H3) to organize content.
Write one topic per section.
Use bullet points and numbered lists for key information.
Include a table of contents for long documents.

Don’t:

Put multiple unrelated topics in one section.
Use images as the primary way to convey information (images aren’t indexed).
Include large blocks of unformatted text without breaks.

Content Quality

Be specific — “Our PAYG plan charges ₹1.50/min for voice and ₹0.50/message for chat” is better than “See our pricing page for details.”
Be complete — Each section should be self-contained. Don’t rely on context from other sections.
Be current — Remove outdated content. Old information can cause incorrect answers.
Avoid contradictions — If two sources say different things, the agent may give inconsistent answers.

Organizing Knowledge Bases

Approach	When to Use
Single knowledge base	Small content set (under 50 sources), all related topics
Multiple knowledge bases	Large content sets, distinct topic areas, different teams
Per-agent knowledge	Each agent handles a specific domain with its own knowledge

Example organization for a SaaS company:

Knowledge Base	Content
Product Docs	Feature guides, tutorials, API reference
Company Policies	Refund policy, terms of service, privacy policy
Sales & Pricing	Plan details, pricing tables, comparison charts
Internal FAQ	Common customer questions and answers

Testing Your Knowledge

After adding sources, test with real questions:

Ask direct questions — “What’s the refund policy?” Should return exact information.
Ask rephrased questions — “Can I get my money back?” Should find the same content.
Ask edge cases — “What happens if I cancel after 60 days?” Should find the closest relevant policy.
Check for gaps — If the agent says “I don’t have information about that,” you may need to add more content.

Troubleshooting

Issue	Cause	Solution
Agent says “I don’t know”	Content not in knowledge base	Add the missing information as a source
Wrong or outdated answers	Stale sources	Delete and re-add updated content
PDF not processing	Scanned image PDF	Use an OCR tool to make the PDF text-searchable first
Slow processing	Very large file	Split into smaller files or filter to relevant sections
Irrelevant search results	Content too broad or mixed	Improve document structure with clear headings and focused sections

Next Steps

Knowledge Sources — Step-by-step guide for each source type.
Knowledge Overview — Creating and assigning knowledge bases.

Introduction

Getting Started

Voice Agents

Agent Capabilities

Channels

Quality & Oversight

Platform

Supported Formats & Limits

Supported Formats & Limits

File Format Reference

Source Type Limits

Knowledge Base Limits

Chunking and Processing

How Chunking Works

Chunk Parameters

Why Chunk Size Matters

Embedding Models

How Embeddings Work

Embedding Details

Vector Storage

Search & Retrieval

Retrieval Parameters

Best Practices for Knowledge Quality

Document Structure

Content Quality

Organizing Knowledge Bases

Testing Your Knowledge

Troubleshooting

Next Steps

​Supported Formats & Limits

​File Format Reference

​Source Type Limits

​Knowledge Base Limits

​Chunking and Processing

​How Chunking Works

​Chunk Parameters

​Why Chunk Size Matters

​Embedding Models

​How Embeddings Work

​Embedding Details

​Vector Storage

​Search & Retrieval

​Retrieval Parameters

​Best Practices for Knowledge Quality

​Document Structure

​Content Quality

​Organizing Knowledge Bases

​Testing Your Knowledge

​Troubleshooting

​Next Steps

Supported Formats & Limits

File Format Reference

Source Type Limits

Knowledge Base Limits

Chunking and Processing

How Chunking Works

Chunk Parameters

Why Chunk Size Matters

Embedding Models

How Embeddings Work

Embedding Details

Vector Storage

Search & Retrieval

Retrieval Parameters

Best Practices for Knowledge Quality

Document Structure

Content Quality

Organizing Knowledge Bases

Testing Your Knowledge

Troubleshooting

Next Steps