Scraping Hub

The Scraping Hub provides multiple web scraping engines for extracting content from websites. Use it to build knowledge bases from web content, monitor competitors, or give your agents access to real-time web data.

Available Engines

Engine	Speed	JS Support	Anti-Bot	API Key	Best For
Crawl4AI	Fast	Yes	Basic	No	General-purpose scraping, free tier
Firecrawl	Medium	Yes	Advanced	Yes	JavaScript-heavy sites, SPAs
JinaReader	Fast	Limited	Basic	Yes	Clean text extraction, articles
Spider	Fast	Yes	Advanced	Yes	Large-scale crawling
Newspaper	Fast	No	None	No	News articles and blog posts
Scrapling	Medium	Yes	Good	No	Complex sites with dynamic content

Using the Scraping Hub

From the Dashboard

Navigate to Knowledge for your agent.
Click Add Source > URL.
Enter the URL(s) to scrape.
Select a scraping engine (or use auto-select).
Configure options:
- Crawl depth — How many levels of links to follow
- Max pages — Maximum pages to scrape
- Include/exclude patterns — URL patterns to filter
Click Start Scraping.

The scraped content is automatically chunked, embedded, and added to your agent’s knowledge base.

Scraping Options

Option	Description	Default
Engine	Which scraping engine to use	Auto-select
Crawl Depth	Levels of links to follow (0 = single page)	0
Max Pages	Maximum pages to scrape	10
Include Patterns	Only scrape URLs matching these patterns	None
Exclude Patterns	Skip URLs matching these patterns	None
Wait for JS	Wait for JavaScript to render before scraping	Engine default
Custom Headers	HTTP headers to send with requests	None

Engine Selection Guide

Crawl4AI (Recommended Default)

Best for most use cases. Open-source, no API key needed, handles JavaScript rendering.

# Scrape a single page
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/docs",
    "engine": "crawl4ai",
    "agent_id": "agent_abc123"
  }'

Firecrawl

Best for JavaScript-heavy sites like SPAs, dashboards, and modern web apps. Handles anti-bot measures.

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://app.example.com/dashboard",
    "engine": "firecrawl",
    "agent_id": "agent_abc123",
    "options": {
      "wait_for_selector": ".content-loaded"
    }
  }'

Requires a Firecrawl API key in your integration settings.

JinaReader

Best for clean text extraction from articles, documentation, and blog posts. Returns markdown-formatted content.

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://blog.example.com/my-article",
    "engine": "jina",
    "agent_id": "agent_abc123"
  }'

Newspaper

Best specifically for news articles. Extracts title, authors, publish date, and article text.

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.example.com/breaking-story",
    "engine": "newspaper",
    "agent_id": "agent_abc123"
  }'

Sitemap Crawling

For full-site indexing, provide a sitemap URL:

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/sitemap.xml",
    "type": "sitemap",
    "agent_id": "agent_abc123",
    "options": {
      "max_pages": 50,
      "include_patterns": ["/docs/*", "/blog/*"]
    }
  }'

Multi-URL Scraping

Scrape multiple URLs in a single request:

curl -X POST https://api.thinnest.ai/api/knowledge/scrape/batch \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page-1",
      "https://example.com/page-2",
      "https://example.com/page-3"
    ],
    "engine": "crawl4ai",
    "agent_id": "agent_abc123"
  }'

URL Discovery

The platform can discover additional URLs from a starting page:

Scrape the initial URL.
Extract all internal links.
Present discovered URLs for selection.
Scrape selected URLs.

This is useful when you want to index a specific section of a website without a sitemap.

Content Processing Pipeline

After scraping, content goes through an automated pipeline:

Raw HTML → Clean Text → Chunking → Embedding → Vector Storage

Chunking Strategies

Strategy	Description	Best For
Recursive	Splits by paragraphs, then sentences, then words	General content
Fixed-size	Splits into fixed character-length chunks	Uniform retrieval
Semantic	Splits by topic/meaning boundaries	Technical documentation
Agentic	AI determines optimal chunk boundaries	Complex documents

Integration Keys

Some engines require API keys. Configure them in Settings > Integrations:

Engine	Key Name	Get a Key
Firecrawl	`FIRECRAWL_API_KEY`	firecrawl.dev
JinaReader	`JINA_API_KEY`	jina.ai
Spider	`SPIDER_API_KEY`	spider.cloud

Best Practices

Start with Crawl4AI — Free, fast, and handles most sites. Switch engines only if needed.
Use include/exclude patterns — Avoid scraping irrelevant pages (login, admin, etc.).
Set reasonable max pages — Start small (10-20 pages) and increase as needed.
Check content quality — After scraping, review the extracted content in the Knowledge tab.
Re-scrape periodically — Web content changes. Set up periodic re-scraping for dynamic sites.
Respect robots.txt — Be a good web citizen. Don’t scrape sites that disallow it.

Introduction

Getting Started

Voice Agents

Agent Capabilities

Channels

Quality & Oversight

Platform

Scraping Hub

Scraping Hub

Available Engines

Using the Scraping Hub

From the Dashboard

Scraping Options

Engine Selection Guide

Crawl4AI (Recommended Default)

Firecrawl

JinaReader

Newspaper

Sitemap Crawling

Multi-URL Scraping

URL Discovery

Content Processing Pipeline

Chunking Strategies

Integration Keys

Best Practices

​Scraping Hub

​Available Engines

​Using the Scraping Hub

​From the Dashboard

​Scraping Options

​Engine Selection Guide

​Crawl4AI (Recommended Default)

​Firecrawl

​JinaReader

​Newspaper

​Sitemap Crawling

​Multi-URL Scraping

​URL Discovery

​Content Processing Pipeline

​Chunking Strategies

​Integration Keys

​Best Practices

Scraping Hub

Available Engines

Using the Scraping Hub

From the Dashboard

Scraping Options

Engine Selection Guide

Crawl4AI (Recommended Default)

Firecrawl

JinaReader

Newspaper

Sitemap Crawling

Multi-URL Scraping

URL Discovery

Content Processing Pipeline

Chunking Strategies

Integration Keys

Best Practices