From d26f43e09eac4648ec50d9d284b4541bc6212c03 Mon Sep 17 00:00:00 2001 From: Marvin Zhang Date: Sat, 2 Mar 2019 10:07:05 +0800 Subject: [PATCH] updated README.md --- README.md | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 54 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1d60f103..20da8ba3 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,54 @@ -# crawlab -Centralized admin platform for managing web crawlers including scrapy, pyspider, webmagic and much more +# Crawlab +Celery-based web crawler admin platform for managing distributed web spiders regardless of languages and frameworks. + +## Pre-requisite +- Python3 +- MongoDB +- Redis + +## Installation + +```bash +pip install -r requirements.txt +``` + +## Configure + +Please edit configuration file `config.py` to configure api and database connections. + +## Quick Start +```bash +# run web app +python app.py + +# run flower app +python ./bin/run_flower.py + +# run worker +python ./bin/run_worker.py +``` + +```bash +# TODO: frontend +``` + +## Nodes + +Nodes are actually the workers defined in Celery. A node is running and connected to a task queue, redis for example, to receive and run tasks. As spiders need to be deployed to the nodes, users should specify their ip addresses and ports before the deployment. + +## Spiders + +#### Auto Discovery +In `config.py` file, edit `PROJECT_SOURCE_FILE_FOLDER` as the directory where the spiders projects are located. The web app will discover spider projects automatically. + +#### Deploy Spiders + +All spiders need to be deployed to a specific node before crawling. Simply click "Deploy" button on spider detail page and select the right node for the deployment. + +#### Run Spiders + +After deploying the spider, you can click "Run" button on spider detail page and select a specific node to start crawling. It will triggers a task for the crawling, where you can see in detail in tasks page. + +## Tasks + +Tasks are triggered and run by the workers. Users can check the task status info and logs in the task detail page. \ No newline at end of file