updated README

This commit is contained in:
Marvin Zhang
2019-06-20 12:42:10 +08:00
parent 212c291a05
commit 312ba656cd
2 changed files with 32 additions and 55 deletions

View File

@@ -10,7 +10,7 @@
基于Celery的爬虫分布式爬虫管理平台支持多种编程语言以及多种爬虫框架.
[查看演示 Demo](http://114.67.75.98:8080) | [文档](https://tikazyq.github.io/crawlab)
[查看演示 Demo](http://114.67.75.98:8080) | [文档](https://tikazyq.github.io/crawlab-docs)
## 要求
- Python 3.6+
@@ -29,7 +29,7 @@
#### 首页
![](https://user-gold-cdn.xitu.io/2019/3/6/169524d4c7f117f7?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/home.png)
#### 爬虫列表
@@ -37,12 +37,20 @@
#### 爬虫详情 - 概览
![](https://user-gold-cdn.xitu.io/2019/3/6/169524e0794d6be1?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-detail-overview.png)
#### 爬虫详情 - 分析
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-detail-analytics.png)
#### 任务详情 - 抓取结果
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/task-detail-results.png)
#### 定时任务
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/schedule-generate-cron.png)
## 架构
Crawlab的架构跟Celery非常相似但是加入了包括前端、爬虫、Flower在内的额外模块以支持爬虫管理的功能。架构图如下。

View File

@@ -10,7 +10,7 @@
Celery-based web crawler admin platform for managing distributed web spiders regardless of languages and frameworks.
[Demo](http://114.67.75.98:8080) | [Documentation](https://tikazyq.github.io/crawlab)
[Demo](http://114.67.75.98:8080) | [Documentation](https://tikazyq.github.io/crawlab-docs)
## Pre-requisite
- Python 3.6+
@@ -20,49 +20,42 @@ Celery-based web crawler admin platform for managing distributed web spiders reg
## Installation
```bash
# install the requirements for backend
pip install -r requirements.txt
```
```bash
# install frontend node modules
cd frontend
npm install
```
## Configure
Please edit configuration file `config.py` to configure api and database connections.
## Quick Start
```bash
python manage.py serve
```
Threee methods:
1. [Docker](https://tikazyq.github.io/crawlab/Installation/Docker.md) (Recommended)
2. [Direct Deploy](https://tikazyq.github.io/crawlab/Installation/Direct.md)
3. [Preview](https://tikazyq.github.io/crawlab/Installation/Direct.md) (Quick start)
## Screenshot
#### Home Page
![](https://user-gold-cdn.xitu.io/2019/3/6/169524d4c7f117f7?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/home.png)
#### Spider List
![](https://user-gold-cdn.xitu.io/2019/3/6/169524daf9c8ccef?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-list.png)
#### Spider Detail - Overview
![](https://user-gold-cdn.xitu.io/2019/3/6/169524e0794d6be1?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-detail-overview.png)
#### Spider Detail - Analytics
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/spider-detail-analytics.png)
#### Task Detail - Results
![](https://user-gold-cdn.xitu.io/2019/3/6/169524e4064c7f0a?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/task-detail-results.png)
#### Cron Schedule
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/schedule-generate-cron.png)
## Architecture
Crawlab's architecture is very similar to Celery's, but a few more modules including Frontend, Spiders and Flower are added to feature the crawling management functionality.
![crawlab-architecture](./docs/img/crawlab-architecture.png)
![](https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/architecture.png)
### Nodes
@@ -70,16 +63,7 @@ Nodes are actually the workers defined in Celery. A node is running and connecte
### Spiders
##### Auto Discovery
In `config.py` file, edit `PROJECT_SOURCE_FILE_FOLDER` as the directory where the spiders projects are located. The web app will discover spider projects automatically. How simple is that!
##### Deploy Spiders
All spiders need to be deployed to a specific node before crawling. Simply click "Deploy" button on spider detail page and the spiders will be deployed to all active nodes.
##### Run Spiders
After deploying the spider, you can click "Run" button on spider detail page and select a specific node to start crawling. It will triggers a task for the crawling, where you can see in detail in tasks page.
The spider source codes and configured crawling rules are stored on `App`, which need to be deployed to each `worker` node.
### Tasks
@@ -146,26 +130,11 @@ Crawlab is easy to use, general enough to adapt spiders in any language and any
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Admin Platform | Y | Y | Y
| [Scrapyd](https://github.com/scrapy/scrapyd) | Web Service | Y | N | N/A
## TODOs
##### Backend
- [ ] File Management
- [ ] MySQL Database Support
- [ ] Task Restart
- [ ] Node Monitoring
- [ ] More spider examples
##### Frontend
- [x] Task Stats/Analytics
- [x] Table Filters
- [x] Multi-Language Support (中文)
- [ ] Login & User Management
- [ ] General Search
## Community & Sponsorship
If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group. Or you scan the Alipay QR code below to give us a reward to upgrade our teamwork software or buy a coffee.
<p align="center">
<img src="https://user-gold-cdn.xitu.io/2019/3/15/169814cbd5e600e9?imageslim" height="360">
<img src="https://raw.githubusercontent.com/tikazyq/crawlab/master/docs/img/payment.jpg" height="360">
<img src="https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/qrcode.png" height="360">
<img src="https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/payment.jpg" height="360">
</p>