將網(wǎng)站轉(zhuǎn)變?yōu)榇竽Ｐ陀?xùn)練數(shù)據(jù)的神器：自動(dòng)化爬蟲工具FireCrawl，兩周斬獲4K Star！

jc_ipec 2024-05-23 發(fā)布于湖北

展開全文

https://mp.weixin.qq.com/s/KS93gpz73X20AD8L-3zz2Q

??將整個(gè)網(wǎng)站轉(zhuǎn)變?yōu)檫m用于大模型訓(xùn)練的 Markdown 或結(jié)構(gòu)化數(shù)據(jù)。使用單個(gè) API 進(jìn)行抓取、爬行、搜索和提取。

Hello，大家好，我是Aitrainee。今天給大家介紹一下Firecrawl，這是一個(gè)實(shí)用的爬蟲工具。

Firecrawl 是什么？

Firecrawl就像一個(gè)智能機(jī)器人，從你給定的網(wǎng)頁(yè)開始，自動(dòng)找到并訪問這個(gè)網(wǎng)站上的所有其他頁(yè)面。它會(huì)提取每個(gè)頁(yè)面中的主要內(nèi)容，去掉廣告和其他不需要的東西，然后把這些信息整理好，讓你方便使用。而且，它不需要網(wǎng)站提供的地圖文件來找到這些頁(yè)面。

Firecrawl可以從你指定的網(wǎng)頁(yè)開始，自動(dòng)訪問這個(gè)網(wǎng)站上所有能打開的子頁(yè)面。就像你點(diǎn)開一個(gè)鏈接后，它會(huì)繼續(xù)點(diǎn)開這個(gè)頁(yè)面里的所有鏈接，直到把所有頁(yè)面都訪問一遍。只要這些頁(yè)面沒有被網(wǎng)站的設(shè)置阻止（比如沒有被robots.txt文件禁止訪問），F(xiàn)irecrawl就能爬取它們。

此外，F(xiàn)irecrawl還會(huì)從每個(gè)頁(yè)面中提取有用的信息，去掉不重要的內(nèi)容（比如廣告和導(dǎo)航欄），并把這些數(shù)據(jù)整理成易于使用的格式，比如Markdown。

站點(diǎn)地圖是什么？

站點(diǎn)地圖（sitemap）是一個(gè)網(wǎng)站提供的文件，列出網(wǎng)站上的所有頁(yè)面。它幫助搜索引擎或爬蟲更快地找到和訪問這些頁(yè)面。站點(diǎn)地圖通常是一個(gè)XML文件，里面包含網(wǎng)站上所有頁(yè)面的鏈接。

總結(jié)一下：

1. Firecrawl會(huì)自動(dòng)從你給定的網(wǎng)頁(yè)開始，遍歷網(wǎng)站上的所有鏈接，爬取所有能訪問的頁(yè)面。
2. 它會(huì)去除雜亂信息，提取有用的數(shù)據(jù)并整理好。
3. 無需站點(diǎn)地圖，F(xiàn)irecrawl也能找到并爬取所有頁(yè)面。

演示視頻

油管博主：開發(fā)者文稿 / 字幕譯：Aitrainee，鏈接在這里：

https://www./watch?v=fDSM7chMo5E

下面提供官方的文檔介紹、相關(guān)資源、部署教程等，進(jìn)一步支撐你的行動(dòng)，以提升本文的幫助力。

?? Firecrawl

我們提供了易于使用的API托管版本。您可以在這里找到演示和文檔。您也可以自行托管后臺(tái)服務(wù)。

· API
· Python SDK
· Node SDK
· Langchain集成 ????
· Llama Index集成 ??
· Langchain JS集成 ????
· 想要其他SDK或集成？請(qǐng)通過打開issue告知我們。

要在本地運(yùn)行，請(qǐng)參考指南。

API密鑰

要使用API，您需要在 Firecrawl 注冊(cè)并獲取API密鑰。

爬取

用于爬取一個(gè)URL及其所有可訪問的子頁(yè)面。此操作提交一個(gè)爬取任務(wù)并返回一個(gè)作業(yè)ID以檢查爬取狀態(tài)。

curl -X POST https://api.firecrawl.dev/v0/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
'url': 'https://'
}'

返回一個(gè)作業(yè)ID

{ 'jobId': '1234-5678-9101' }

檢查爬取作業(yè)

用于檢查爬取作業(yè)的狀態(tài)并獲取其結(jié)果。

curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'

{
'status': 'completed',
'current': 22,
'total': 22,
'data': [
{
'content': 'Raw Content ',
'markdown': '# Markdown Content',
'provider': 'web-scraper',
'metadata': {
'title': 'Mendable | AI for CX and Sales',
'description': 'AI for CX and Sales',
'language': null,
'sourceURL': 'https://www./'
}
}
]
}

爬取

用于爬取一個(gè)URL并獲取其內(nèi)容。

curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
'url': 'https://'
}'

響應(yīng)：

{
'success': true,
'data': {
'content': 'Raw Content ',
'markdown': '# Markdown Content',
'provider': 'web-scraper',
'metadata': {
'title': 'Mendable | AI for CX and Sales',
'description': 'AI for CX and Sales',
'language': null,
'sourceURL': 'https://www./'
}
}
}

搜索（測(cè)試版）

用于搜索網(wǎng)絡(luò)，獲取最相關(guān)的結(jié)果，爬取每個(gè)頁(yè)面并返回Markdown格式的數(shù)據(jù)。

curl -X POST https://api.firecrawl.dev/v0/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
'query': 'firecrawl',
'pageOptions': {
'fetchPageContent': true // 設(shè)置為false可快速獲取搜索引擎結(jié)果頁(yè)面
}
}'

{
'success': true,
'data': [
{
'url': 'https://',
'markdown': '# Markdown Content',
'provider': 'web-scraper',
'metadata': {
'title': 'Mendable | AI for CX and Sales',
'description': 'AI for CX and Sales',
'language': null,
'sourceURL': 'https://www./'
}
}
]
}

智能提?。y(cè)試版）

用于從爬取的頁(yè)面中提取結(jié)構(gòu)化數(shù)據(jù)。

curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
'url': 'https://www./',
'extractorOptions': {
'mode': 'llm-extraction',
'extractionPrompt': 'Based on the information on the page, extract the information from the schema. ',
'extractionSchema': {
'type': 'object',
'properties': {
'company_mission': {
'type': 'string'
},
'supports_sso': {
'type': 'boolean'
},
'is_open_source': {
'type': 'boolean'
},
'is_in_yc': {
'type': 'boolean'
}
},
'required': [
'company_mission',
'supports_sso',
'is_open_source',
'is_in_yc'
]
}
}
}'

{
'success': true,
'data': {
'content': 'Raw Content',
'metadata': {
'title': 'Mendable',
'description': 'Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide',
'robots': 'follow, index',
'ogTitle': 'Mendable',
'ogDescription': 'Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide',
'ogUrl': 'https:///',
'ogImage': 'https:///mendable_new_og1.png',
'ogLocaleAlternate': [],
'ogSiteName': 'Mendable',
'sourceURL': 'https:///'
},
'llm_extraction': {
'company_mission': 'Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to',
'supports_sso': true,
'is_open_source': false,
'is_in_yc': true
}
}
}

使用Python SDK

安裝Python SDK

pip install firecrawl-py

爬取一個(gè)網(wǎng)站

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key='YOUR_API_KEY')

crawl_result = app.crawl_url('', {'crawlerOptions': {'excludes': ['blog/*']}})

# 獲取Markdown內(nèi)容
for result in crawl_result:
print(result['markdown'])

爬取一個(gè)URL

要爬取單個(gè)URL，請(qǐng)使用 scrape_url方法。它接收URL作為參數(shù)并返回爬取的數(shù)據(jù)字典。

url = 'https://'
scraped_data = app.scrape_url(url)

從URL中提取結(jié)構(gòu)化數(shù)據(jù)

通過LLM提取，您可以輕松地從任何URL中提取結(jié)構(gòu)化數(shù)據(jù)。我們支持Pydantic模型，使其更容易使用。以下是使用方法：

class ArticleSchema(BaseModel):
title: str
points: int 
by: str
commentsURL: str

class TopArticlesSchema(BaseModel):
top: List[ArticleSchema] = Field(..., max_items=5, description='Top 5

stories')

data = app.scrape_url('https://news.', {
'extractorOptions': {
'extractionSchema': TopArticlesSchema.model_json_schema(),
'mode': 'llm-extraction'
},
'pageOptions':{
'onlyMainContent': True
}
})
print(data['llm_extraction'])

搜索查詢

執(zhí)行網(wǎng)絡(luò)搜索，獲取頂級(jí)結(jié)果，提取每個(gè)頁(yè)面的數(shù)據(jù)，并返回它們的Markdown格式內(nèi)容。

query = 'What is Mendable?'
search_result = app.search(query)

使用Node SDK

安裝

要安裝Firecrawl Node SDK，可以使用npm：

npm install @mendable/firecrawl-js

使用方法

1. 從 firecrawl.dev 獲取API密鑰。
2. 將API密鑰設(shè)置為環(huán)境變量 FIRECRAWL_API_KEY，或?qū)⑵渥鳛閰?shù)傳遞給 FirecrawlApp類。

爬取URL

要爬取單個(gè)URL并進(jìn)行錯(cuò)誤處理，請(qǐng)使用 scrapeUrl方法。它接收URL作為參數(shù)并返回爬取的數(shù)據(jù)字典。

try {
const url = 'https://';
const scrapedData = await app.scrapeUrl(url);
console.log(scrapedData);
} catch (error) {
console.error(
'Error occurred while scraping:',
error.message
);
}

爬取網(wǎng)站

要爬取網(wǎng)站并進(jìn)行錯(cuò)誤處理，請(qǐng)使用 crawlUrl方法。它接收起始URL和可選參數(shù)作為參數(shù)。params參數(shù)允許您指定爬取任務(wù)的附加選項(xiàng)，例如最大爬取頁(yè)面數(shù)、允許的域和輸出格式。

const crawlUrl = 'https://';
const params = {
crawlerOptions: {
excludes: ['blog/'],
includes: [], // 留空以包含所有頁(yè)面
limit: 1000,
},
pageOptions: {
onlyMainContent: true
}
};
const waitUntilDone = true;
const timeout = 5;
const crawlResult = await app.crawlUrl(
crawlUrl,
params,
waitUntilDone,
timeout
);

檢查爬取狀態(tài)

要檢查爬取任務(wù)的狀態(tài)并進(jìn)行錯(cuò)誤處理，請(qǐng)使用 checkCrawlStatus方法。它接收作業(yè)ID作為參數(shù)并返回爬取任務(wù)的當(dāng)前狀態(tài)。

const status = await app.checkCrawlStatus(jobId);
console.log(status);

從URL中提取結(jié)構(gòu)化數(shù)據(jù)

通過LLM提取，您可以輕松地從任何URL中提取結(jié)構(gòu)化數(shù)據(jù)。我們支持zod模式，使其更容易使用。以下是使用方法：

import FirecrawlApp from '@mendable/firecrawl-js';
import { z } from 'zod';

const app = new FirecrawlApp({
apiKey: 'fc-YOUR_API_KEY',
});

// 定義要提取內(nèi)容的模式
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe('Hacker News上的前5個(gè)故事'),
});

const scrapeResult = await app.scrapeUrl('https://news.', {
extractorOptions: { extractionSchema: schema },
});

console.log(scrapeResult.data['llm_extraction']);

搜索查詢

通過 search方法，您可以在搜索引擎中搜索查詢并獲取頂級(jí)結(jié)果及每個(gè)結(jié)果的頁(yè)面內(nèi)容。該方法接收查詢作為參數(shù)并返回搜索結(jié)果。

const query = 'what is mendable?';
const searchResults = await app.search(query, {
pageOptions: {
fetchPageContent: true // 獲取每個(gè)搜索結(jié)果的頁(yè)面內(nèi)容
}
});

參考鏈接：
[1]https://github.com/mendableai/firecrawl

知音難求，自我修煉亦艱

抓住前沿技術(shù)的機(jī)遇，與我們一起成為創(chuàng)新的超級(jí)個(gè)體

（把握AIGC時(shí)代的個(gè)人力量）

— 完—

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： jc_ipec > 《爬蟲》

舉報(bào)/認(rèn)領(lǐng)