Scraping Hub

Scraping Hub

Extract content from any website using multiple scraping engines — Firecrawl, Crawl4AI, JinaReader, and more. Power your agent's knowledge base with web data.

Scraping Hub

The Scraping Hub provides multiple web scraping engines for extracting content from websites. Use it to build knowledge bases from web content, monitor competitors, or give your agents access to real-time web data.

Available Engines

EngineSpeedJS SupportAnti-BotAPI KeyBest For
Crawl4AIFastYesBasicNoGeneral-purpose scraping, free tier
FirecrawlMediumYesAdvancedYesJavaScript-heavy sites, SPAs
JinaReaderFastLimitedBasicYesClean text extraction, articles
SpiderFastYesAdvancedYesLarge-scale crawling
NewspaperFastNoNoneNoNews articles and blog posts
ScraplingMediumYesGoodNoComplex sites with dynamic content

Using the Scraping Hub

From the Dashboard

  1. Navigate to Knowledge for your agent.
  2. Click Add Source > URL.
  3. Enter the URL(s) to scrape.
  4. Select a scraping engine (or use auto-select).
  5. Configure options:
    • Crawl depth — How many levels of links to follow
    • Max pages — Maximum pages to scrape
    • Include/exclude patterns — URL patterns to filter
  6. Click Start Scraping.

The scraped content is automatically chunked, embedded, and added to your agent's knowledge base.

Scraping Options

OptionDescriptionDefault
EngineWhich scraping engine to useAuto-select
Crawl DepthLevels of links to follow (0 = single page)0
Max PagesMaximum pages to scrape10
Include PatternsOnly scrape URLs matching these patternsNone
Exclude PatternsSkip URLs matching these patternsNone
Wait for JSWait for JavaScript to render before scrapingEngine default
Custom HeadersHTTP headers to send with requestsNone

Engine Selection Guide

Best for most use cases. Open-source, no API key needed, handles JavaScript rendering.

# Scrape a single page
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/docs",
    "engine": "crawl4ai",
    "agent_id": "agent_abc123"
  }'

Firecrawl

Best for JavaScript-heavy sites like SPAs, dashboards, and modern web apps. Handles anti-bot measures.

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://app.example.com/dashboard",
    "engine": "firecrawl",
    "agent_id": "agent_abc123",
    "options": {
      "wait_for_selector": ".content-loaded"
    }
  }'

Requires a Firecrawl API key in your integration settings.

JinaReader

Best for clean text extraction from articles, documentation, and blog posts. Returns markdown-formatted content.

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://blog.example.com/my-article",
    "engine": "jina",
    "agent_id": "agent_abc123"
  }'

Newspaper

Best specifically for news articles. Extracts title, authors, publish date, and article text.

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.example.com/breaking-story",
    "engine": "newspaper",
    "agent_id": "agent_abc123"
  }'

Sitemap Crawling

For full-site indexing, provide a sitemap URL:

curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/sitemap.xml",
    "type": "sitemap",
    "agent_id": "agent_abc123",
    "options": {
      "max_pages": 50,
      "include_patterns": ["/docs/*", "/blog/*"]
    }
  }'

Multi-URL Scraping

Scrape multiple URLs in a single request:

curl -X POST https://api.thinnest.ai/api/knowledge/scrape/batch \
  -H "Authorization: Bearer $THINNESTAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page-1",
      "https://example.com/page-2",
      "https://example.com/page-3"
    ],
    "engine": "crawl4ai",
    "agent_id": "agent_abc123"
  }'

URL Discovery

The platform can discover additional URLs from a starting page:

  1. Scrape the initial URL.
  2. Extract all internal links.
  3. Present discovered URLs for selection.
  4. Scrape selected URLs.

This is useful when you want to index a specific section of a website without a sitemap.

Content Processing Pipeline

After scraping, content goes through an automated pipeline:

Raw HTML → Clean Text → Chunking → Embedding → Vector Storage

Chunking Strategies

StrategyDescriptionBest For
RecursiveSplits by paragraphs, then sentences, then wordsGeneral content
Fixed-sizeSplits into fixed character-length chunksUniform retrieval
SemanticSplits by topic/meaning boundariesTechnical documentation
AgenticAI determines optimal chunk boundariesComplex documents

Integration Keys

Some engines require API keys. Configure them in Settings > Integrations:

EngineKey NameGet a Key
FirecrawlFIRECRAWL_API_KEYfirecrawl.dev
JinaReaderJINA_API_KEYjina.ai
SpiderSPIDER_API_KEYspider.cloud

Best Practices

  • Start with Crawl4AI — Free, fast, and handles most sites. Switch engines only if needed.
  • Use include/exclude patterns — Avoid scraping irrelevant pages (login, admin, etc.).
  • Set reasonable max pages — Start small (10-20 pages) and increase as needed.
  • Check content quality — After scraping, review the extracted content in the Knowledge tab.
  • Re-scrape periodically — Web content changes. Set up periodic re-scraping for dynamic sites.
  • Respect robots.txt — Be a good web citizen. Don't scrape sites that disallow it.

On this page