2019-03-06 13:14:59 +08:00
2019-02-17 11:14:22 +08:00
2019-03-06 13:14:26 +08:00
2019-03-06 10:52:38 +08:00
2019-03-06 13:14:26 +08:00
2019-03-06 13:14:26 +08:00
2019-03-06 13:14:26 +08:00
2019-03-03 11:00:50 +08:00
2019-03-06 13:14:59 +08:00

Crawlab

Celery-based web crawler admin platform for managing distributed web spiders regardless of languages and frameworks.

Pre-requisite

  • Python3
  • MongoDB
  • Redis

Installation

# install the requirements for backend
pip install -r ./crawlab/requirements.txt
# install frontend node modules
cd frontend
npm install

Configure

Please edit configuration file config.py to configure api and database connections.

Quick Start

# run all services
python manage.py run_all
# run frontend client
cd frontend
npm run dev

Screenshot

Home Page

home

Spider List

spider-list

Spider Detail - Overview

spider-list

Task Detail - Results

spider-list

Architecture

The architecture of Crawlab is as below. It's very similar to Celery architecture, but a few more modules including Frontend, Spiders and Flower are added to feature the crawling management functionality.

crawlab-architecture

Nodes

Nodes are actually the workers defined in Celery. A node is running and connected to a task queue, redis for example, to receive and run tasks. As spiders need to be deployed to the nodes, users should specify their ip addresses and ports before the deployment.

Spiders

Auto Discovery

In config.py file, edit PROJECT_SOURCE_FILE_FOLDER as the directory where the spiders projects are located. The web app will discover spider projects automatically.

Deploy Spiders

All spiders need to be deployed to a specific node before crawling. Simply click "Deploy" button on spider detail page and select the right node for the deployment.

Run Spiders

After deploying the spider, you can click "Run" button on spider detail page and select a specific node to start crawling. It will triggers a task for the crawling, where you can see in detail in tasks page.

Tasks

Tasks are triggered and run by the workers. Users can check the task status info and logs in the task detail page.

App

Broker

Description
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Readme BSD-3-Clause 30 MiB
Languages
Go 99.3%
Shell 0.6%
Dockerfile 0.1%