Scraping Hub
Extract content from any website using multiple scraping engines — Firecrawl, Crawl4AI, JinaReader, and more. Power your agent's knowledge base with web data.
Scraping Hub
The Scraping Hub provides multiple web scraping engines for extracting content from websites. Use it to build knowledge bases from web content, monitor competitors, or give your agents access to real-time web data.
Available Engines
| Engine | Speed | JS Support | Anti-Bot | API Key | Best For |
|---|---|---|---|---|---|
| Crawl4AI | Fast | Yes | Basic | No | General-purpose scraping, free tier |
| Firecrawl | Medium | Yes | Advanced | Yes | JavaScript-heavy sites, SPAs |
| JinaReader | Fast | Limited | Basic | Yes | Clean text extraction, articles |
| Spider | Fast | Yes | Advanced | Yes | Large-scale crawling |
| Newspaper | Fast | No | None | No | News articles and blog posts |
| Scrapling | Medium | Yes | Good | No | Complex sites with dynamic content |
Using the Scraping Hub
From the Dashboard
- Navigate to Knowledge for your agent.
- Click Add Source > URL.
- Enter the URL(s) to scrape.
- Select a scraping engine (or use auto-select).
- Configure options:
- Crawl depth — How many levels of links to follow
- Max pages — Maximum pages to scrape
- Include/exclude patterns — URL patterns to filter
- Click Start Scraping.
The scraped content is automatically chunked, embedded, and added to your agent's knowledge base.
Scraping Options
| Option | Description | Default |
|---|---|---|
| Engine | Which scraping engine to use | Auto-select |
| Crawl Depth | Levels of links to follow (0 = single page) | 0 |
| Max Pages | Maximum pages to scrape | 10 |
| Include Patterns | Only scrape URLs matching these patterns | None |
| Exclude Patterns | Skip URLs matching these patterns | None |
| Wait for JS | Wait for JavaScript to render before scraping | Engine default |
| Custom Headers | HTTP headers to send with requests | None |
Engine Selection Guide
Crawl4AI (Recommended Default)
Best for most use cases. Open-source, no API key needed, handles JavaScript rendering.
# Scrape a single page
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/docs",
"engine": "crawl4ai",
"agent_id": "agent_abc123"
}'Firecrawl
Best for JavaScript-heavy sites like SPAs, dashboards, and modern web apps. Handles anti-bot measures.
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://app.example.com/dashboard",
"engine": "firecrawl",
"agent_id": "agent_abc123",
"options": {
"wait_for_selector": ".content-loaded"
}
}'Requires a Firecrawl API key in your integration settings.
JinaReader
Best for clean text extraction from articles, documentation, and blog posts. Returns markdown-formatted content.
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://blog.example.com/my-article",
"engine": "jina",
"agent_id": "agent_abc123"
}'Newspaper
Best specifically for news articles. Extracts title, authors, publish date, and article text.
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.example.com/breaking-story",
"engine": "newspaper",
"agent_id": "agent_abc123"
}'Sitemap Crawling
For full-site indexing, provide a sitemap URL:
curl -X POST https://api.thinnest.ai/api/knowledge/scrape \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/sitemap.xml",
"type": "sitemap",
"agent_id": "agent_abc123",
"options": {
"max_pages": 50,
"include_patterns": ["/docs/*", "/blog/*"]
}
}'Multi-URL Scraping
Scrape multiple URLs in a single request:
curl -X POST https://api.thinnest.ai/api/knowledge/scrape/batch \
-H "Authorization: Bearer $THINNESTAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page-1",
"https://example.com/page-2",
"https://example.com/page-3"
],
"engine": "crawl4ai",
"agent_id": "agent_abc123"
}'URL Discovery
The platform can discover additional URLs from a starting page:
- Scrape the initial URL.
- Extract all internal links.
- Present discovered URLs for selection.
- Scrape selected URLs.
This is useful when you want to index a specific section of a website without a sitemap.
Content Processing Pipeline
After scraping, content goes through an automated pipeline:
Raw HTML → Clean Text → Chunking → Embedding → Vector StorageChunking Strategies
| Strategy | Description | Best For |
|---|---|---|
| Recursive | Splits by paragraphs, then sentences, then words | General content |
| Fixed-size | Splits into fixed character-length chunks | Uniform retrieval |
| Semantic | Splits by topic/meaning boundaries | Technical documentation |
| Agentic | AI determines optimal chunk boundaries | Complex documents |
Integration Keys
Some engines require API keys. Configure them in Settings > Integrations:
| Engine | Key Name | Get a Key |
|---|---|---|
| Firecrawl | FIRECRAWL_API_KEY | firecrawl.dev |
| JinaReader | JINA_API_KEY | jina.ai |
| Spider | SPIDER_API_KEY | spider.cloud |
Best Practices
- Start with Crawl4AI — Free, fast, and handles most sites. Switch engines only if needed.
- Use include/exclude patterns — Avoid scraping irrelevant pages (login, admin, etc.).
- Set reasonable max pages — Start small (10-20 pages) and increase as needed.
- Check content quality — After scraping, review the extracted content in the Knowledge tab.
- Re-scrape periodically — Web content changes. Set up periodic re-scraping for dynamic sites.
- Respect robots.txt — Be a good web citizen. Don't scrape sites that disallow it.