Supported Formats & Limits
This page covers the technical details of how thinnestAI processes your knowledge sources — supported formats, size constraints, chunking strategy, and tips for getting the best results.File Format Reference
| Format | Extension | Max Size | Notes |
|---|---|---|---|
.pdf | 50 MB | Text-based PDFs work best. Scanned PDFs require OCR preprocessing. | |
| Word Document | .docx | 25 MB | Full formatting support. Headings, lists, and tables are preserved. |
| Plain Text | .txt | 10 MB | Simple and reliable. Best for clean, unformatted content. |
| Markdown | .md, .mdx | 10 MB | Formatting (headings, lists, code blocks) is preserved during chunking. |
| Excel | .xlsx | 25 MB | Structured data with column headers. Each row becomes searchable content. |
| CSV | .csv | 10 MB | Tabular data. Column headers are used as field names. |
Source Type Limits
| Source Type | Limit | Notes |
|---|---|---|
| File Upload | 50 MB per file | Batch upload supported (multiple files at once) |
| URL | Single page | One page per URL source. Add multiple URLs for multi-page sites. |
| YouTube | Videos with captions | Transcript must be available (auto-generated or manual) |
| Text Input | 100,000 characters | Approximately 25,000 words |
| GitHub | 100 MB per repo import | Filter by path and file type for large repos |
| Azure Blob | 50 MB per file | Same file format requirements as direct upload |
| SharePoint | 50 MB per file | Same file format requirements as direct upload |
| Excel | 25 MB, 100,000 rows | Large workbooks may take longer to process |
Knowledge Base Limits
| Limit | Value |
|---|---|
| Sources per knowledge base | 500 |
| Knowledge bases per agent | 10 |
| Total storage | Depends on your plan |
Chunking and Processing
When you add a source, thinnestAI breaks the content into smaller pieces called chunks. This is essential for accurate retrieval — the agent finds the specific chunk that answers the question, rather than searching through an entire document.How Chunking Works
- Content extraction — Text is extracted from the source format (PDF parsing, HTML scraping, etc.).
- Cleaning — Headers, footers, navigation elements, and other noise are removed.
- Splitting — Content is split into chunks at natural boundaries:
- Headings and section breaks
- Paragraph boundaries
- Sentence boundaries (for very long paragraphs)
- Overlap — Adjacent chunks share a small overlap to preserve context across boundaries.
Chunk Parameters
| Parameter | Default | Description |
|---|---|---|
| Chunk size | ~1,000 tokens | Target size for each chunk |
| Chunk overlap | ~200 tokens | Overlap between consecutive chunks |
| Splitting strategy | Semantic | Splits at headings and paragraphs first, then sentences |
Why Chunk Size Matters
- Too large — Chunks contain mixed topics, making retrieval less precise.
- Too small — Chunks lack context, making answers less comprehensive.
- Just right — Each chunk covers one topic or concept completely.
Embedding Models
thinnestAI converts each chunk into a vector embedding — a numerical representation that captures the semantic meaning of the text. These embeddings power semantic search.How Embeddings Work
Embedding Details
| Property | Value |
|---|---|
| Model | OpenAI text-embedding-3-small (default) |
| Dimensions | 1536 |
| Max tokens per chunk | 8,191 |
Vector Storage
Embeddings are stored in a PostgreSQL database with the pgvector extension, enabling fast similarity search at scale. For larger deployments, thinnestAI also supports LanceDB and ChromaDB as vector storage backends.Search & Retrieval
When your agent searches the knowledge base, the system runs a hybrid search:| Search Type | How It Works | Best For |
|---|---|---|
| Semantic search | Converts the query to a vector and finds the closest chunks by meaning | Conversational queries, synonym matching |
| Full-text search | Keyword matching using PostgreSQL full-text search | Exact terms, names, codes, acronyms |
| Hybrid (default) | Combines both and reranks results | Best overall accuracy |
Retrieval Parameters
| Parameter | Default | Description |
|---|---|---|
| Top K | 5 | Number of chunks returned per search |
| Similarity threshold | 0.7 | Minimum relevance score (0-1) |
Best Practices for Knowledge Quality
Document Structure
Good structure produces better chunks and more accurate retrieval. Do:- Use clear headings (H1, H2, H3) to organize content.
- Write one topic per section.
- Use bullet points and numbered lists for key information.
- Include a table of contents for long documents.
- Put multiple unrelated topics in one section.
- Use images as the primary way to convey information (images aren’t indexed).
- Include large blocks of unformatted text without breaks.
Content Quality
- Be specific — “Our PAYG plan charges ₹1.50/min for voice and ₹0.50/message for chat” is better than “See our pricing page for details.”
- Be complete — Each section should be self-contained. Don’t rely on context from other sections.
- Be current — Remove outdated content. Old information can cause incorrect answers.
- Avoid contradictions — If two sources say different things, the agent may give inconsistent answers.
Organizing Knowledge Bases
| Approach | When to Use |
|---|---|
| Single knowledge base | Small content set (under 50 sources), all related topics |
| Multiple knowledge bases | Large content sets, distinct topic areas, different teams |
| Per-agent knowledge | Each agent handles a specific domain with its own knowledge |
| Knowledge Base | Content |
|---|---|
| Product Docs | Feature guides, tutorials, API reference |
| Company Policies | Refund policy, terms of service, privacy policy |
| Sales & Pricing | Plan details, pricing tables, comparison charts |
| Internal FAQ | Common customer questions and answers |
Testing Your Knowledge
After adding sources, test with real questions:- Ask direct questions — “What’s the refund policy?” Should return exact information.
- Ask rephrased questions — “Can I get my money back?” Should find the same content.
- Ask edge cases — “What happens if I cancel after 60 days?” Should find the closest relevant policy.
- Check for gaps — If the agent says “I don’t have information about that,” you may need to add more content.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Agent says “I don’t know” | Content not in knowledge base | Add the missing information as a source |
| Wrong or outdated answers | Stale sources | Delete and re-add updated content |
| PDF not processing | Scanned image PDF | Use an OCR tool to make the PDF text-searchable first |
| Slow processing | Very large file | Split into smaller files or filter to relevant sections |
| Irrelevant search results | Content too broad or mixed | Improve document structure with clear headings and focused sections |
Next Steps
- Knowledge Sources — Step-by-step guide for each source type.
- Knowledge Overview — Creating and assigning knowledge bases.

