Supported Formats & Limits
Reference for supported file formats, size limits, chunking behavior, embedding models, and best practices for knowledge quality.
Supported Formats & Limits
This page covers the technical details of how thinnestAI processes your knowledge sources — supported formats, size constraints, chunking strategy, and tips for getting the best results.
File Format Reference
| Format | Extension | Max Size | Notes |
|---|---|---|---|
.pdf | 50 MB | Text-based PDFs work best. Scanned PDFs require OCR preprocessing. | |
| Word Document | .docx | 25 MB | Full formatting support. Headings, lists, and tables are preserved. |
| Plain Text | .txt | 10 MB | Simple and reliable. Best for clean, unformatted content. |
| Markdown | .md, .mdx | 10 MB | Formatting (headings, lists, code blocks) is preserved during chunking. |
| Excel | .xlsx | 25 MB | Structured data with column headers. Each row becomes searchable content. |
| CSV | .csv | 10 MB | Tabular data. Column headers are used as field names. |
Source Type Limits
| Source Type | Limit | Notes |
|---|---|---|
| File Upload | 50 MB per file | Batch upload supported (multiple files at once) |
| URL | Single page | One page per URL source. Add multiple URLs for multi-page sites. |
| YouTube | Videos with captions | Transcript must be available (auto-generated or manual) |
| Text Input | 100,000 characters | Approximately 25,000 words |
| GitHub | 100 MB per repo import | Filter by path and file type for large repos |
| Azure Blob | 50 MB per file | Same file format requirements as direct upload |
| SharePoint | 50 MB per file | Same file format requirements as direct upload |
| Excel | 25 MB, 100,000 rows | Large workbooks may take longer to process |
Knowledge Base Limits
| Limit | Value |
|---|---|
| Sources per knowledge base | 500 |
| Knowledge bases per agent | 10 |
| Total storage | Depends on your plan |
Chunking and Processing
When you add a source, thinnestAI breaks the content into smaller pieces called chunks. This is essential for accurate retrieval — the agent finds the specific chunk that answers the question, rather than searching through an entire document.
How Chunking Works
- Content extraction — Text is extracted from the source format (PDF parsing, HTML scraping, etc.).
- Cleaning — Headers, footers, navigation elements, and other noise are removed.
- Splitting — Content is split into chunks at natural boundaries:
- Headings and section breaks
- Paragraph boundaries
- Sentence boundaries (for very long paragraphs)
- Overlap — Adjacent chunks share a small overlap to preserve context across boundaries.
Chunk Parameters
| Parameter | Default | Description |
|---|---|---|
| Chunk size | ~1,000 tokens | Target size for each chunk |
| Chunk overlap | ~200 tokens | Overlap between consecutive chunks |
| Splitting strategy | Semantic | Splits at headings and paragraphs first, then sentences |
Why Chunk Size Matters
- Too large — Chunks contain mixed topics, making retrieval less precise.
- Too small — Chunks lack context, making answers less comprehensive.
- Just right — Each chunk covers one topic or concept completely.
The default settings work well for most content. The system automatically adapts to your document structure.
Embedding Models
thinnestAI converts each chunk into a vector embedding — a numerical representation that captures the semantic meaning of the text. These embeddings power semantic search.
How Embeddings Work
"What is your return policy?"
│
▼
Embedding Model
│
▼
[0.023, -0.156, 0.891, ...] ← 1536-dimensional vectorSimilar concepts produce similar vectors, so a search for "refund process" will find chunks about "return policy" even though the words are different.
Embedding Details
| Property | Value |
|---|---|
| Model | OpenAI text-embedding-3-small (default) |
| Dimensions | 1536 |
| Max tokens per chunk | 8,191 |
Vector Storage
Embeddings are stored in a PostgreSQL database with the pgvector extension, enabling fast similarity search at scale. For larger deployments, thinnestAI also supports LanceDB and ChromaDB as vector storage backends.
Search & Retrieval
When your agent searches the knowledge base, the system runs a hybrid search:
| Search Type | How It Works | Best For |
|---|---|---|
| Semantic search | Converts the query to a vector and finds the closest chunks by meaning | Conversational queries, synonym matching |
| Full-text search | Keyword matching using PostgreSQL full-text search | Exact terms, names, codes, acronyms |
| Hybrid (default) | Combines both and reranks results | Best overall accuracy |
Retrieval Parameters
| Parameter | Default | Description |
|---|---|---|
| Top K | 5 | Number of chunks returned per search |
| Similarity threshold | 0.7 | Minimum relevance score (0-1) |
Best Practices for Knowledge Quality
Document Structure
Good structure produces better chunks and more accurate retrieval.
Do:
- Use clear headings (H1, H2, H3) to organize content.
- Write one topic per section.
- Use bullet points and numbered lists for key information.
- Include a table of contents for long documents.
Don't:
- Put multiple unrelated topics in one section.
- Use images as the primary way to convey information (images aren't indexed).
- Include large blocks of unformatted text without breaks.
Content Quality
- Be specific — "Our PAYG plan charges ₹1.50/min for voice and ₹0.50/message for chat" is better than "See our pricing page for details."
- Be complete — Each section should be self-contained. Don't rely on context from other sections.
- Be current — Remove outdated content. Old information can cause incorrect answers.
- Avoid contradictions — If two sources say different things, the agent may give inconsistent answers.
Organizing Knowledge Bases
| Approach | When to Use |
|---|---|
| Single knowledge base | Small content set (under 50 sources), all related topics |
| Multiple knowledge bases | Large content sets, distinct topic areas, different teams |
| Per-agent knowledge | Each agent handles a specific domain with its own knowledge |
Example organization for a SaaS company:
| Knowledge Base | Content |
|---|---|
| Product Docs | Feature guides, tutorials, API reference |
| Company Policies | Refund policy, terms of service, privacy policy |
| Sales & Pricing | Plan details, pricing tables, comparison charts |
| Internal FAQ | Common customer questions and answers |
Testing Your Knowledge
After adding sources, test with real questions:
- Ask direct questions — "What's the refund policy?" Should return exact information.
- Ask rephrased questions — "Can I get my money back?" Should find the same content.
- Ask edge cases — "What happens if I cancel after 60 days?" Should find the closest relevant policy.
- Check for gaps — If the agent says "I don't have information about that," you may need to add more content.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Agent says "I don't know" | Content not in knowledge base | Add the missing information as a source |
| Wrong or outdated answers | Stale sources | Delete and re-add updated content |
| PDF not processing | Scanned image PDF | Use an OCR tool to make the PDF text-searchable first |
| Slow processing | Very large file | Split into smaller files or filter to relevant sections |
| Irrelevant search results | Content too broad or mixed | Improve document structure with clear headings and focused sections |
Next Steps
- Knowledge Sources — Step-by-step guide for each source type.
- Knowledge Overview — Creating and assigning knowledge bases.