git-sync/crawlab

Fork 0

mirror of https://github.com/crawlab-team/crawlab.git synced 2026-01-21 17:21:09 +01:00

Go to file

Marvin Zhang 7321b9ac1d updated README.md

2019-03-06 13:14:59 +08:00

.idea/httpRequests

update SpiderDetail

2019-02-17 11:14:22 +08:00

crawlab

added Results

2019-03-06 13:14:26 +08:00

docs

updated README.md

2019-03-06 10:52:38 +08:00

frontend

added Results

2019-03-06 13:14:26 +08:00

spiders

added Results

2019-03-06 13:14:26 +08:00

.gitignore

added Results

2019-03-06 13:14:26 +08:00

LICENSE

added License

2019-03-03 11:00:50 +08:00

README.md

updated README.md

2019-03-06 13:14:59 +08:00

README.md

Crawlab

Celery-based web crawler admin platform for managing distributed web spiders regardless of languages and frameworks.

Pre-requisite

Python3
MongoDB
Redis

Installation

# install the requirements for backend
pip install -r ./crawlab/requirements.txt

# install frontend node modules
cd frontend
npm install

Configure

Please edit configuration file config.py to configure api and database connections.

Quick Start

# run all services
python manage.py run_all

# run frontend client
cd frontend
npm run dev

Screenshot

Home Page

Spider List

Spider Detail - Overview

Task Detail - Results

Architecture

The architecture of Crawlab is as below. It's very similar to Celery architecture, but a few more modules including Frontend, Spiders and Flower are added to feature the crawling management functionality.

Nodes

Nodes are actually the workers defined in Celery. A node is running and connected to a task queue, redis for example, to receive and run tasks. As spiders need to be deployed to the nodes, users should specify their ip addresses and ports before the deployment.

Spiders

Auto Discovery

In config.py file, edit PROJECT_SOURCE_FILE_FOLDER as the directory where the spiders projects are located. The web app will discover spider projects automatically.

Deploy Spiders

All spiders need to be deployed to a specific node before crawling. Simply click "Deploy" button on spider detail page and select the right node for the deployment.

Run Spiders

After deploying the spider, you can click "Run" button on spider detail page and select a specific node to start crawling. It will triggers a task for the crawling, where you can see in detail in tasks page.

Tasks

Tasks are triggered and run by the workers. Users can check the task status info and logs in the task detail page.

App

Broker

Description

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

crawlab crawler crawling-tasks docker go platform scrapy scrapyd-ui spider spiders-management web-crawler webcrawler webspider

Readme BSD-3-Clause 30 MiB