diff --git a/README-zh.md b/README-zh.md index 9ed4c2ae..57f4af47 100644 --- a/README-zh.md +++ b/README-zh.md @@ -193,37 +193,43 @@ Redis是非常受欢迎的Key-Value数据库,在Crawlab中主要实现节点 ## 与其他框架的集成 +[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) 提供了一些 `helper` 方法来让您的爬虫更好的集成到 Crawlab 中,例如保存结果数据到 Crawlab 中等等。 + +### 集成 Scrapy + +在 `settings.py` 中找到 `ITEM_PIPELINES`(`dict` 类型的变量),在其中添加如下内容。 + +```python +ITEM_PIPELINES = { + 'crawlab.pipelines.CrawlabMongoPipeline': 888, +} +``` + +然后,启动 Scrapy 爬虫,运行完成之后,您就应该能看到抓取结果出现在 **任务详情-结果** 里。 + +### 通用 Python 爬虫 + +将下列代码加入到您爬虫中的结果保存部分。 + +```python +# 引入保存结果方法 +from crawlab import save_item + +# 这是一个结果,需要为 dict 类型 +result = {'name': 'crawlab'} + +# 调用保存结果方法 +save_item(result) +``` + +然后,启动爬虫,运行完成之后,您就应该能看到抓取结果出现在 **任务详情-结果** 里。 + +### 其他框架和语言 + 爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中,并以此来关联抓取数据。另外,`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称。 在爬虫程序中,需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前,Crawlab只支持MongoDB。 -### 集成Scrapy - -以下是Crawlab跟Scrapy集成的例子,利用了Crawlab传过来的task_id和collection_name。 - -```python -import os -from pymongo import MongoClient - -MONGO_HOST = '192.168.99.100' -MONGO_PORT = 27017 -MONGO_DB = 'crawlab_test' - -# scrapy example in the pipeline -class JuejinPipeline(object): - mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT) - db = mongo[MONGO_DB] - col_name = os.environ.get('CRAWLAB_COLLECTION') - if not col_name: - col_name = 'test' - col = db[col_name] - - def process_item(self, item, spider): - item['task_id'] = os.environ.get('CRAWLAB_TASK_ID') - self.col.save(item) - return item -``` - ## 与其他框架比较 现在已经有一些爬虫管理框架了,因此为啥还要用Crawlab? diff --git a/README.md b/README.md index e90df67f..3f6224a8 100644 --- a/README.md +++ b/README.md @@ -192,35 +192,43 @@ Frontend is a SPA based on ## Integration with Other Frameworks -A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data. +[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) provides some `helper` methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results. + +⚠️Note: make sure you have already installed `crawlab-sdk` using pip. ### Scrapy -Below is an example to integrate Crawlab with Scrapy in pipelines. +In `settings.py` in your Scrapy project, find the variable named `ITEM_PIPELINES` (a `dict` variable). Add content below. ```python -import os -from pymongo import MongoClient - -MONGO_HOST = '192.168.99.100' -MONGO_PORT = 27017 -MONGO_DB = 'crawlab_test' - -# scrapy example in the pipeline -class JuejinPipeline(object): - mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT) - db = mongo[MONGO_DB] - col_name = os.environ.get('CRAWLAB_COLLECTION') - if not col_name: - col_name = 'test' - col = db[col_name] - - def process_item(self, item, spider): - item['task_id'] = os.environ.get('CRAWLAB_TASK_ID') - self.col.save(item) - return item +ITEM_PIPELINES = { + 'crawlab.pipelines.CrawlabMongoPipeline': 888, +} ``` +Then, start the Scrapy spider. After it's done, you should be able to see scraped results in **Task Detail -> Result** + +### General Python Spider + +Please add below content to your spider files to save results. + +```python +# import result saving method +from crawlab import save_item + +# this is a result record, must be dict type +result = {'name': 'crawlab'} + +# call result saving method +save_item(result) +``` + +Then, start the spider. After it's done, you should be able to see scraped results in **Task Detail -> Result** + +### Other Frameworks / Languages + +A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data. + ## Comparison with Other Frameworks There are existing spider management frameworks. So why use Crawlab?