Crawlab

Celery-based web crawler admin platform for managing distributed web spiders regardless of languages and frameworks.

Pre-requisite

Python3
MongoDB
Redis

Installation

# install the requirements for backend
pip install -r ./crawlab/requirements.txt

cd frontend
npm install

Configure

Please edit configuration file config.py to configure api and database connections.

Quick Start

# run all services
python manage.py run_all

# run frontend client
cd frontend
npm run dev

Architecture

The architecture of Crawlab is as below. It's very similar to Celery architecture, but a few more modules including Frontend, Spiders and Flower are added to feature the crawling management functionality.

Nodes

Nodes are actually the workers defined in Celery. A node is running and connected to a task queue, redis for example, to receive and run tasks. As spiders need to be deployed to the nodes, users should specify their ip addresses and ports before the deployment.

Spiders

Auto Discovery

In config.py file, edit PROJECT_SOURCE_FILE_FOLDER as the directory where the spiders projects are located. The web app will discover spider projects automatically.

Deploy Spiders

All spiders need to be deployed to a specific node before crawling. Simply click "Deploy" button on spider detail page and select the right node for the deployment.

Run Spiders

After deploying the spider, you can click "Run" button on spider detail page and select a specific node to start crawling. It will triggers a task for the crawling, where you can see in detail in tasks page.

Tasks

Tasks are triggered and run by the workers. Users can check the task status info and logs in the task detail page.

1.8 KiB Raw Blame History