updated README.md

2026-01-21 17:21:09 +01:00 · 2020-01-13 14:49:37 +08:00
parent b4004ad9de
commit a9820bdba2
2 changed files with 63 additions and 49 deletions
--- a/README-zh.md
+++ b/README-zh.md
@@ -193,37 +193,43 @@ Redis是非常受欢迎的Key-Value数据库，在Crawlab中主要实现节点

 ## 与其他框架的集成

+[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) 提供了一些 `helper` 方法来让您的爬虫更好的集成到 Crawlab 中，例如保存结果数据到 Crawlab 中等等。
+
+### 集成 Scrapy
+
+在 `settings.py` 中找到 `ITEM_PIPELINES`（`dict` 类型的变量），在其中添加如下内容。
+
+```python
+ITEM_PIPELINES = {
+    'crawlab.pipelines.CrawlabMongoPipeline': 888,
+}
+```
+
+然后，启动 Scrapy 爬虫，运行完成之后，您就应该能看到抓取结果出现在 **任务详情-结果** 里。
+
+### 通用 Python 爬虫
+
+将下列代码加入到您爬虫中的结果保存部分。
+
+```python
+# 引入保存结果方法
+from crawlab import save_item
+
+# 这是一个结果，需要为 dict 类型
+result = {'name': 'crawlab'}
+
+# 调用保存结果方法
+save_item(result)
+```
+
+然后，启动爬虫，运行完成之后，您就应该能看到抓取结果出现在 **任务详情-结果** 里。
+
+### 其他框架和语言
+
 爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中，并以此来关联抓取数据。另外，`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称。

 在爬虫程序中，需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前，Crawlab只支持MongoDB。

-### 集成Scrapy
-
-以下是Crawlab跟Scrapy集成的例子，利用了Crawlab传过来的task_id和collection_name。
-
-```python
-import os
-from pymongo import MongoClient
-
-MONGO_HOST = '192.168.99.100'
-MONGO_PORT = 27017
-MONGO_DB = 'crawlab_test'
-
-# scrapy example in the pipeline
-class JuejinPipeline(object):
-    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
-    db = mongo[MONGO_DB]
-    col_name = os.environ.get('CRAWLAB_COLLECTION')
-    if not col_name:
-        col_name = 'test'
-    col = db[col_name]
-
-    def process_item(self, item, spider):
-        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
-        self.col.save(item)
-        return item
-```
-
 ## 与其他框架比较

 现在已经有一些爬虫管理框架了，因此为啥还要用Crawlab？
--- a/README.md
+++ b/README.md
@@ -192,35 +192,43 @@ Frontend is a SPA based on

 ## Integration with Other Frameworks

-A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
+[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) provides some `helper` methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.
+
+⚠️Note: make sure you have already installed `crawlab-sdk` using pip.

 ### Scrapy

-Below is an example to integrate Crawlab with Scrapy in pipelines. 
+In `settings.py` in your Scrapy project, find the variable named `ITEM_PIPELINES` (a `dict` variable). Add content below.

 ```python
-import os
-from pymongo import MongoClient
-
-MONGO_HOST = '192.168.99.100'
-MONGO_PORT = 27017
-MONGO_DB = 'crawlab_test'
-
-# scrapy example in the pipeline
-class JuejinPipeline(object):
-    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
-    db = mongo[MONGO_DB]
-    col_name = os.environ.get('CRAWLAB_COLLECTION')
-    if not col_name:
-        col_name = 'test'
-    col = db[col_name]
-
-    def process_item(self, item, spider):
-        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
-        self.col.save(item)
-        return item
+ITEM_PIPELINES = {
+    'crawlab.pipelines.CrawlabMongoPipeline': 888,
+}
 ```

+Then, start the Scrapy spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
+
+### General Python Spider
+
+Please add below content to your spider files to save results.
+
+```python
+# import result saving method
+from crawlab import save_item
+
+# this is a result record, must be dict type
+result = {'name': 'crawlab'}
+
+# call result saving method
+save_item(result)
+```
+
+Then, start the spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
+
+### Other Frameworks / Languages
+
+A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
+
 ## Comparison with Other Frameworks

 There are existing spider management frameworks. So why use Crawlab?