网站采集工具firecrawl

参考
反正就是一个很牛逼的网站爬取工具，支持纯JS网站，也就是现在流行的VUE等没有html的网站，原理是集成了一个无头chrome浏览器，等页面渲染了才爬取。
特性

整站爬取、单个页面爬取、纯JS网站爬取
提取为LLM支持的markdown格式，当然了，直接爬取HTML是基本操作
只抓取main页面，无意义的重复内容

相关文档和参考地址

官方帮助：https://docs.firecrawl.dev/introduction

网上的资料很多是V0版本的，但是现在firecrawl已经升级到V版本啦，而且v0版本将在2025年4月1日下线：https://docs.firecrawl.dev/v1-welcome

github源代码地址：https://github.com/mendableai/firecrawl

安装
下载源代码后，docker-compose build 生成镜像，再使用docker-compose up -d 运行
playwright-service
这是处理纯JS网站的服务，例如现在几乎所有的网站都是动态生成的，所以这个服务是必须的
极简.env文件，其实不要也可以跑
# 核心配置
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html

# 数据库及其他可选配置
USE_DB_AUTHENTICATION=false

API调用，现在GPT很强大，不懂的文GPT吧

整站爬取

curl -X POST http://localhost:3002/v1/crawl \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer YOUR_API_KEY' \
 -d '{
 "url": "https://docs.firecrawl.dev",
 "limit": 100,
 "scrapeOptions": {
 "formats": ["markdown", "html"]
 }
 }'

获取爬取状态、或者说是爬取的结果

curl -X GET http://localhost:3002/v1/crawl/<jobid>

抓取单个 URL

curl -X POST http://localhost:3002/v1/scrape \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer YOUR_API_KEY' \
 -d '{
 "url": "https://docs.firecrawl.dev",
 "formats": ["markdown", "html"]
 }'

获取网站地图

curl -X POST http://localhost:3002/v1/map \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer YOUR_API_KEY' \
 -d '{
 "url": "https://firecrawl.dev"
 }'

``

- 执行搜索

```shell

curl -X POST http://localhost:3002/v1/search \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer YOUR_API_KEY' \
 -d '{
 "query": "AI tools",
 "limit": 5,
 "scrapeOptions": {
 "formats": ["markdown"]
 }
 }'

结构化数据

curl -X POST http://localhost:3002/v1/extract \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer YOUR_API_KEY' \
 -d '{
 "url": "https://example.com",
 "extract": {
 "schema": {
 "type": "object",
 "properties": {
 "title": {"type": "string"},
 "price": {"type": "number"}
 }
 }
 }
 }'

批量抓取`

curl -X POST http://localhost:3002/v1/batch/scrape \
 -H 'Content-Type: application/json' \
 -H 'Authorization: Bearer YOUR_API_KEY' \
 -d '{
 "urls": ["https://example1.com", "https://example2.com"],
 "options": {
 "formats": ["markdown"]
 }
 }'