Knowledge Bases

Supported Formats & Limits

Reference for supported file formats, size limits, chunking behavior, embedding models, and best practices for knowledge quality.

Supported Formats & Limits

This page covers the technical details of how thinnestAI processes your knowledge sources — supported formats, size constraints, chunking strategy, and tips for getting the best results.

File Format Reference

FormatExtensionMax SizeNotes
PDF.pdf50 MBText-based PDFs work best. Scanned PDFs require OCR preprocessing.
Word Document.docx25 MBFull formatting support. Headings, lists, and tables are preserved.
Plain Text.txt10 MBSimple and reliable. Best for clean, unformatted content.
Markdown.md, .mdx10 MBFormatting (headings, lists, code blocks) is preserved during chunking.
Excel.xlsx25 MBStructured data with column headers. Each row becomes searchable content.
CSV.csv10 MBTabular data. Column headers are used as field names.

Source Type Limits

Source TypeLimitNotes
File Upload50 MB per fileBatch upload supported (multiple files at once)
URLSingle pageOne page per URL source. Add multiple URLs for multi-page sites.
YouTubeVideos with captionsTranscript must be available (auto-generated or manual)
Text Input100,000 charactersApproximately 25,000 words
GitHub100 MB per repo importFilter by path and file type for large repos
Azure Blob50 MB per fileSame file format requirements as direct upload
SharePoint50 MB per fileSame file format requirements as direct upload
Excel25 MB, 100,000 rowsLarge workbooks may take longer to process

Knowledge Base Limits

LimitValue
Sources per knowledge base500
Knowledge bases per agent10
Total storageDepends on your plan

Chunking and Processing

When you add a source, thinnestAI breaks the content into smaller pieces called chunks. This is essential for accurate retrieval — the agent finds the specific chunk that answers the question, rather than searching through an entire document.

How Chunking Works

  1. Content extraction — Text is extracted from the source format (PDF parsing, HTML scraping, etc.).
  2. Cleaning — Headers, footers, navigation elements, and other noise are removed.
  3. Splitting — Content is split into chunks at natural boundaries:
    • Headings and section breaks
    • Paragraph boundaries
    • Sentence boundaries (for very long paragraphs)
  4. Overlap — Adjacent chunks share a small overlap to preserve context across boundaries.

Chunk Parameters

ParameterDefaultDescription
Chunk size~1,000 tokensTarget size for each chunk
Chunk overlap~200 tokensOverlap between consecutive chunks
Splitting strategySemanticSplits at headings and paragraphs first, then sentences

Why Chunk Size Matters

  • Too large — Chunks contain mixed topics, making retrieval less precise.
  • Too small — Chunks lack context, making answers less comprehensive.
  • Just right — Each chunk covers one topic or concept completely.

The default settings work well for most content. The system automatically adapts to your document structure.

Embedding Models

thinnestAI converts each chunk into a vector embedding — a numerical representation that captures the semantic meaning of the text. These embeddings power semantic search.

How Embeddings Work

"What is your return policy?"


   Embedding Model


[0.023, -0.156, 0.891, ...]  ← 1536-dimensional vector

Similar concepts produce similar vectors, so a search for "refund process" will find chunks about "return policy" even though the words are different.

Embedding Details

PropertyValue
ModelOpenAI text-embedding-3-small (default)
Dimensions1536
Max tokens per chunk8,191

Vector Storage

Embeddings are stored in a PostgreSQL database with the pgvector extension, enabling fast similarity search at scale. For larger deployments, thinnestAI also supports LanceDB and ChromaDB as vector storage backends.

Search & Retrieval

When your agent searches the knowledge base, the system runs a hybrid search:

Search TypeHow It WorksBest For
Semantic searchConverts the query to a vector and finds the closest chunks by meaningConversational queries, synonym matching
Full-text searchKeyword matching using PostgreSQL full-text searchExact terms, names, codes, acronyms
Hybrid (default)Combines both and reranks resultsBest overall accuracy

Retrieval Parameters

ParameterDefaultDescription
Top K5Number of chunks returned per search
Similarity threshold0.7Minimum relevance score (0-1)

Best Practices for Knowledge Quality

Document Structure

Good structure produces better chunks and more accurate retrieval.

Do:

  • Use clear headings (H1, H2, H3) to organize content.
  • Write one topic per section.
  • Use bullet points and numbered lists for key information.
  • Include a table of contents for long documents.

Don't:

  • Put multiple unrelated topics in one section.
  • Use images as the primary way to convey information (images aren't indexed).
  • Include large blocks of unformatted text without breaks.

Content Quality

  • Be specific — "Our PAYG plan charges ₹1.50/min for voice and ₹0.50/message for chat" is better than "See our pricing page for details."
  • Be complete — Each section should be self-contained. Don't rely on context from other sections.
  • Be current — Remove outdated content. Old information can cause incorrect answers.
  • Avoid contradictions — If two sources say different things, the agent may give inconsistent answers.

Organizing Knowledge Bases

ApproachWhen to Use
Single knowledge baseSmall content set (under 50 sources), all related topics
Multiple knowledge basesLarge content sets, distinct topic areas, different teams
Per-agent knowledgeEach agent handles a specific domain with its own knowledge

Example organization for a SaaS company:

Knowledge BaseContent
Product DocsFeature guides, tutorials, API reference
Company PoliciesRefund policy, terms of service, privacy policy
Sales & PricingPlan details, pricing tables, comparison charts
Internal FAQCommon customer questions and answers

Testing Your Knowledge

After adding sources, test with real questions:

  1. Ask direct questions — "What's the refund policy?" Should return exact information.
  2. Ask rephrased questions — "Can I get my money back?" Should find the same content.
  3. Ask edge cases — "What happens if I cancel after 60 days?" Should find the closest relevant policy.
  4. Check for gaps — If the agent says "I don't have information about that," you may need to add more content.

Troubleshooting

IssueCauseSolution
Agent says "I don't know"Content not in knowledge baseAdd the missing information as a source
Wrong or outdated answersStale sourcesDelete and re-add updated content
PDF not processingScanned image PDFUse an OCR tool to make the PDF text-searchable first
Slow processingVery large fileSplit into smaller files or filter to relevant sections
Irrelevant search resultsContent too broad or mixedImprove document structure with clear headings and focused sections

Next Steps

On this page