Merge branch 'develop'

This commit is contained in:
hantmac
2020-02-03 19:59:05 +08:00
559 changed files with 64227 additions and 7812 deletions

9
.gitattributes vendored Normal file
View File

@@ -0,0 +1,9 @@
*.md linguist-language=Go
*.yml linguist-language=Go
*.html linguist-language=Go
*.js linguist-language=Go
*.xml linguist-language=Go
*.css linguist-language=Go
*.sql linguist-language=Go
*.uml linguist-language=Go
*.cmd linguist-language=Go

24
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@@ -0,0 +1,24 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: 'bug'
assignees: ''
---
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
**Expected behavior**
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem.

23
.github/ISSUE_TEMPLATE/bug_report_zh.md vendored Normal file
View File

@@ -0,0 +1,23 @@
---
name: Bug 报告
about: 创建一份 Bug 报告帮助我们优化产品
title: ''
labels: 'bug'
assignees: ''
---
**Bug 描述**
例如 xxx xxx 功能不工作
**复现步骤**
Bug 复现步骤如下
1.
2.
3.
**期望结果**
xxx 能工作
**截屏**
![截屏1](http://static-docs.crawlab.cn/login.png)

View File

@@ -0,0 +1,17 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: 'enhancement'
assignees: ''
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

View File

@@ -0,0 +1,17 @@
---
name: 功能需求
about: 优化和功能需求建议
title: ''
labels: 'enhancement'
assignees: ''
---
**请描述该需求尝试解决的问题**
例如 xxx 我总是被当前 xxx 的设计所困扰
**请描述您认为可行的解决方案**
例如添加 xxx 功能能够解决问题
**考虑过的替代方案**
例如如果用 xxx也能解决该问题

3
.gitignore vendored
View File

@@ -121,4 +121,5 @@ _book/
.idea
*.lock
backend/spiders
backend/spiders
spiders/*.zip

190
CHANGELOG-zh.md Normal file
View File

@@ -0,0 +1,190 @@
# 0.4.5 (unkown)
### 功能 / 优化
- **交互式教程**. 引导用户了解 Crawlab 的主要功能.
- **加入全局环境变量**. 可以设置全局环境变量然后传入到所有爬虫程序中. [#177](https://github.com/crawlab-team/crawlab/issues/177)
- **项目**. 允许用户将爬虫关联到项目上. [#316](https://github.com/crawlab-team/crawlab/issues/316)
- **示例爬虫**. 当初始化时自动加入示例爬虫. [#379](https://github.com/crawlab-team/crawlab/issues/379)
- **用户管理优化**. 限制管理用户的权限. [#456](https://github.com/crawlab-team/crawlab/issues/456)
- **设置页面优化**.
- **任务结果页面优化**.
### Bug 修复
- **无法找到爬虫文件错误**. [#485](https://github.com/crawlab-team/crawlab/issues/485)
- **点击删除按钮导致跳转**. [#480](https://github.com/crawlab-team/crawlab/issues/480)
- **无法在空爬虫里创建文件**. [#479](https://github.com/crawlab-team/crawlab/issues/479)
- **下载结果错误**. [#465](https://github.com/crawlab-team/crawlab/issues/465)
- **crawlab-sdk CLI 错误**. [#458](https://github.com/crawlab-team/crawlab/issues/458)
- **页面刷新问题**. [#441](https://github.com/crawlab-team/crawlab/issues/441)
- **结果不支持 JSON**. [#202](https://github.com/crawlab-team/crawlab/issues/202)
- **修复删除爬虫后获取所有爬虫错误**.
- **修复 i18n 警告**.
# 0.4.4 (2020-01-17)
### 功能 / 优化
- **邮件通知**. 允许用户发送邮件消息通知.
- **钉钉机器人通知**. 允许用户发送钉钉机器人消息通知.
- **企业微信机器人通知**. 允许用户发送企业微信机器人消息通知.
- **API 地址优化**. 在前端加入相对路径因此用户不需要特别注明 `CRAWLAB_API_ADDRESS`.
- **SDK 兼容**. 允许用户通过 Crawlab SDK Scrapy 或通用爬虫集成.
- **优化文件管理**. 加入树状文件侧边栏让用户更方便的编辑文件.
- **高级定时任务 Cron**. 允许用户通过 Cron 可视化编辑器编辑定时任务.
### Bug 修复
- **`nil retuened` 错误**.
- **使用 HTTPS 出现的报错**.
- **无法在爬虫列表页运行可配置爬虫**.
- **上传爬虫文件缺少表单验证**.
# 0.4.3 (2020-01-07)
### 功能 / 优化
- **依赖安装**. 允许用户在平台 Web 界面安装/卸载依赖以及添加编程语言暂时只有 Node.js
- **Docker 中预装编程语言**. 允许 Docker 用户通过设置 `CRAWLAB_SERVER_LANG_NODE` `Y` 来预装 `Node.js` 环境.
- **在爬虫详情页添加定时任务列表**. 允许用户在爬虫详情页查看添加编辑定时任务. [#360](https://github.com/crawlab-team/crawlab/issues/360)
- **Cron 表达式与 Linux 一致**. 将表达式从 6 元素改为 5 元素 Linux 一致.
- **启用/禁用定时任务**. 允许用户启用/禁用定时任务. [#297](https://github.com/crawlab-team/crawlab/issues/297)
- **优化任务管理**. 允许用户批量删除任务. [#341](https://github.com/crawlab-team/crawlab/issues/341)
- **优化爬虫管理**. 允许用户在爬虫列表页对爬虫进行筛选和排序.
- **添加中文版 `CHANGELOG`**.
- **在顶部添加 Github 加星按钮**.
### Bug 修复
- **定时任务问题**. [#423](https://github.com/crawlab-team/crawlab/issues/423)
- **上传爬虫zip文件问题**. [#403](https://github.com/crawlab-team/crawlab/issues/403) [#407](https://github.com/crawlab-team/crawlab/issues/407)
- **因为网络原因导致崩溃**. [#340](https://github.com/crawlab-team/crawlab/issues/340)
- **定时任务无法正常运行**
- **定时任务列表列表错位问题**
- **刷新按钮跳转错误问题**
# 0.4.2 (2019-12-26)
### 功能 / 优化
- **免责声明**. 加入免责声明.
- **通过 API 获取版本号**. [#371](https://github.com/crawlab-team/crawlab/issues/371)
- **通过配置来允许用户注册**. [#346](https://github.com/crawlab-team/crawlab/issues/346)
- **允许添加新用户**.
- **更高级的文件管理**. 允许用户添加编辑重命名删除代码文件. [#286](https://github.com/crawlab-team/crawlab/issues/286)
- **优化爬虫创建流程**. 允许用户在上传 zip 文件前创建空的自定义爬虫.
- **优化任务管理**. 允许用户通过选择条件过滤任务. [#341](https://github.com/crawlab-team/crawlab/issues/341)
### Bug 修复
- **重复节点**. [#391](https://github.com/crawlab-team/crawlab/issues/391)
- **"mongodb no reachable" 错误**. [#373](https://github.com/crawlab-team/crawlab/issues/373)
# 0.4.1 (2019-12-13)
### 功能 / 优化
- **Spiderfile 优化**. 将阶段由数组更换为字典. [#358](https://github.com/crawlab-team/crawlab/issues/358)
- **百度统计更新**.
### Bug 修复
- **无法展示定时任务**. [#353](https://github.com/crawlab-team/crawlab/issues/353)
- **重复节点注册**. [#334](https://github.com/crawlab-team/crawlab/issues/334)
# 0.4.0 (2019-12-06)
### 功能 / 优化
- **可配置爬虫**. 允许用户添加 `Spiderfile` 来配置抓取规则.
- **执行模式**. 允许用户选择 3 种任务执行模式: *所有节点*, *指定节点* and *随机*.
### Bug 修复
- **任务意外被杀死**. [#306](https://github.com/crawlab-team/crawlab/issues/306)
- **文档更正**. [#301](https://github.com/crawlab-team/crawlab/issues/258) [#301](https://github.com/crawlab-team/crawlab/issues/258)
- **直接部署与 Windows 不兼容**. [#288](https://github.com/crawlab-team/crawlab/issues/288)
- **日志文件丢失**. [#269](https://github.com/crawlab-team/crawlab/issues/269)
# 0.3.5 (2019-10-28)
### 功能 / 优化
- **优雅关闭**. [详情](https://github.com/crawlab-team/crawlab/commit/63fab3917b5a29fd9770f9f51f1572b9f0420385)
- **节点信息优化**. [详情](https://github.com/crawlab-team/crawlab/commit/973251a0fbe7a2184ac0da09e0404a17c736aee7)
- **将系统环境变量添加到任务**. [详情](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3)
- **自动刷新任务日志**. [详情](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3)
- **允许 HTTPS 部署**. [详情](https://github.com/crawlab-team/crawlab/commit/5d8f6f0c56768a6e58f5e46cbf5adff8c7819228)
### Bug 修复
- **定时任务中无法获取爬虫列表**. [详情](https://github.com/crawlab-team/crawlab/commit/311f72da19094e3fa05ab4af49812f58843d8d93)
- **无法获取工作节点信息**. [详情](https://github.com/crawlab-team/crawlab/commit/6af06efc17685a9e232e8c2b5fd819ec7d2d1674)
- **运行爬虫任务时无法选择节点**. [详情](https://github.com/crawlab-team/crawlab/commit/31f8e03234426e97aed9b0bce6a50562f957edad)
- **结果量很大时无法获取结果数量**. [#260](https://github.com/crawlab-team/crawlab/issues/260)
- **定时任务中的节点问题**. [#244](https://github.com/crawlab-team/crawlab/issues/244)
# 0.3.1 (2019-08-25)
### 功能 / 优化
- **Docker 镜像优化**. Docker 镜像进一步分割成 alpine 镜像版本的 masterworkerfrontendSplit docker further into master, worker, frontend.
- **单元测试**. 用单元测试覆盖部分后端代码.
- **前端优化**. 登录页按钮大小上传 UI 提示.
- **更灵活的节点注册**. 允许用户传一个变量作为注册 key而不是默认的 MAC 地址.
### Bug 修复
- **上传大爬虫文件错误**. 上传大爬虫文件时的内存崩溃问题. [#150](https://github.com/crawlab-team/crawlab/issues/150)
- **无法同步爬虫**. 通过提高写权限等级来修复同步爬虫文件时的问题. [#114](https://github.com/crawlab-team/crawlab/issues/114)
- **爬虫页问题**. 通过删除 `Site` 字段来修复. [#112](https://github.com/crawlab-team/crawlab/issues/112)
- **节点展示问题**. 当在多个机器上跑 Docker 容器时节点无法正确展示. [#99](https://github.com/crawlab-team/crawlab/issues/99)
# 0.3.0 (2019-07-31)
### 功能 / 优化
- **Golang 后端**: 将后端由 Python 重构为 Golang很大的提高了稳定性和性能.
- **节点网络图**: 节点拓扑图可视化.
- **节点系统信息**: 可以查看包括操作系统CPU数量可执行文件在内的系统信息.
- **节点监控改进**: 节点通过 Redis 来监控和注册.
- **文件管理**: 可以在线编辑爬虫文件包括代码高亮.
- **登录页/注册页/用户管理**: 要求用户登录后才能使用 Crawlab, 允许用户注册和用户管理有一些基于角色的鉴权机制.
- **自动部署爬虫**: 爬虫将被自动部署或同步到所有在线节点.
- **更小的 Docker 镜像**: 瘦身版 Docker 镜像通过多阶段构建将 Docker 镜像大小从 1.3G 减小到 700M 左右.
### Bug 修复
- **节点状态**. 节点状态不会随着节点下线而更新. [#87](https://github.com/tikazyq/crawlab/issues/87)
- **爬虫部署错误**. 通过自动爬虫部署来修复 [#83](https://github.com/tikazyq/crawlab/issues/83)
- **节点无法显示**. 节点无法显示在线 [#81](https://github.com/tikazyq/crawlab/issues/81)
- **定时任务无法工作**. 通过 Golang 后端修复 [#64](https://github.com/tikazyq/crawlab/issues/64)
- **Flower 错误**. 通过 Golang 后端修复 [#57](https://github.com/tikazyq/crawlab/issues/57)
# 0.2.4 (2019-07-07)
### 功能 / 优化
- **文档**: 更优和更详细的文档.
- **更好的 Crontab**: 通过 UI 界面生成 Cron 表达式.
- **更优的性能**: 从原生 flask 引擎 切换到 `gunicorn`. [#78](https://github.com/tikazyq/crawlab/issues/78)
### Bug 修复
- **删除爬虫**. 删除爬虫时不止在数据库中删除还应该删除相关的文件夹任务和定时任务. [#69](https://github.com/tikazyq/crawlab/issues/69)
- **MongoDB 授权**. 允许用户注明 `authenticationDatabase` 来连接 `mongodb`. [#68](https://github.com/tikazyq/crawlab/issues/68)
- **Windows 兼容性**. 加入 `eventlet` `requirements.txt`. [#59](https://github.com/tikazyq/crawlab/issues/59)
# 0.2.3 (2019-06-12)
### 功能 / 优化
- **Docker**: 用户能够运行 Docker 镜像来加快部署.
- **CLI**: 允许用户通过命令行来执行 Crawlab 程序.
- **上传爬虫**: 允许用户上传自定义爬虫到 Crawlab.
- **预览时编辑字段**: 允许用户在可配置爬虫中预览数据时编辑字段.
### Bug 修复
- **爬虫分页**. 爬虫列表页中修复分页问题.
# 0.2.2 (2019-05-30)
### 功能 / 优化
- **自动抓取字段**: 在可配置爬虫列表页种自动抓取字段.
- **下载结果**: 允许下载结果为 CSV 文件.
- **百度统计**: 允许用户选择是否允许向百度统计发送统计数据.
### Bug 修复
- **结果页分页**. [#45](https://github.com/tikazyq/crawlab/issues/45)
- **定时任务重复触发**: Flask DEBUG 设置为 False 来保证定时任务无法重复触发. [#32](https://github.com/tikazyq/crawlab/issues/32)
- **前端环境**: 添加 `VUE_APP_BASE_URL` 作为生产环境模式变量然后 API 不会永远都是 `localhost` [#30](https://github.com/tikazyq/crawlab/issues/30)
# 0.2.1 (2019-05-27)
- **可配置爬虫**: 允许用户创建爬虫来抓取数据而不用编写代码.
# 0.2 (2019-05-10)
- **高级数据统计**: 爬虫详情页的高级数据统计.
- **网站数据**: 加入网站列表中国允许用户查看 robots.txt首页响应时间等信息.
# 0.1.1 (2019-04-23)
- **基础统计**: 用户可以查看基础统计数据包括爬虫和任务页中的失败任务数结果数.
- **近实时任务信息**: 周期性5 向服务器轮训数据来实现近实时查看任务信息.
- **定时任务**: 利用 apscheduler 实现定时任务允许用户设置类似 Cron 的定时任务.
# 0.1 (2019-04-17)
- **首次发布**

View File

@@ -1,3 +1,95 @@
# 0.4.5 (2020-02-03)
### Features / Enhancement
- **Interactive Tutorial**. Guide users through the main functionalities of Crawlab.
- **Global Environment Variables**. Allow users to set global environment variables, which will be passed into all spider programs. [#177](https://github.com/crawlab-team/crawlab/issues/177)
- **Project**. Allow users to link spiders to projects. [#316](https://github.com/crawlab-team/crawlab/issues/316)
- **Demo Spiders**. Added demo spiders when Crawlab is initialized. [#379](https://github.com/crawlab-team/crawlab/issues/379)
- **User Admin Optimization**. Restrict privilleges of admin users. [#456](https://github.com/crawlab-team/crawlab/issues/456)
- **Setting Page Optimization**.
- **Task Results Optimization**.
### Bug Fixes
- **Unable to find spider file error**. [#485](https://github.com/crawlab-team/crawlab/issues/485)
- **Click delete button results in redirect**. [#480](https://github.com/crawlab-team/crawlab/issues/480)
- **Unable to create files in an empty spider**. [#479](https://github.com/crawlab-team/crawlab/issues/479)
- **Download results error**. [#465](https://github.com/crawlab-team/crawlab/issues/465)
- **crawlab-sdk CLI error**. [#458](https://github.com/crawlab-team/crawlab/issues/458)
- **Page refresh issue**. [#441](https://github.com/crawlab-team/crawlab/issues/441)
- **Results not support JSON**. [#202](https://github.com/crawlab-team/crawlab/issues/202)
- **Getting all spider after deleting a spider**.
- **i18n warning**.
# 0.4.4 (2020-01-17)
### Features / Enhancement
- **Email Notification**. Allow users to send email notifications.
- **DingTalk Robot Notification**. Allow users to send DingTalk Robot notifications.
- **Wechat Robot Notification**. Allow users to send Wechat Robot notifications.
- **API Address Optimization**. Added relative URL path in frontend so that users don't have to specify `CRAWLAB_API_ADDRESS` explicitly.
- **SDK Compatiblity**. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
- **Enhanced File Management**. Added tree-like file sidebar to allow users to edit files much more easier.
- **Advanced Schedule Cron**. Allow users to edit schedule cron with visualized cron editor.
### Bug Fixes
- **`nil retuened` error**.
- **Error when using HTTPS**.
- **Unable to run Configurable Spiders on Spider List**.
- **Missing form validation before uploading spider files**.
# 0.4.3 (2020-01-07)
### Features / Enhancement
- **Dependency Installation**. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
- **Pre-install Programming Languages in Docker**. Allow Docker users to set `CRAWLAB_SERVER_LANG_NODE` as `Y` to pre-install `Node.js` environments.
- **Add Schedule List in Spider Detail Page**. Allow users to view / add / edit schedule cron jobs in the spider detail page. [#360](https://github.com/crawlab-team/crawlab/issues/360)
- **Align Cron Expression with Linux**. Change the expression of 6 elements to 5 elements as aligned in Linux.
- **Enable/Disable Schedule Cron**. Allow users to enable/disable the schedule jobs. [#297](https://github.com/crawlab-team/crawlab/issues/297)
- **Better Task Management**. Allow users to batch delete tasks. [#341](https://github.com/crawlab-team/crawlab/issues/341)
- **Better Spider Management**. Allow users to sort and filter spiders in the spider list page.
- **Added Chinese `CHANGELOG`**.
- **Added Github Star Button at Nav Bar**.
### Bug Fixes
- **Schedule Cron Task Issue**. [#423](https://github.com/crawlab-team/crawlab/issues/423)
- **Upload Spider Zip File Issue**. [#403](https://github.com/crawlab-team/crawlab/issues/403) [#407](https://github.com/crawlab-team/crawlab/issues/407)
- **Exit due to Network Failure**. [#340](https://github.com/crawlab-team/crawlab/issues/340)
- **Cron Jobs not Running Correctly**
- **Schedule List Columns Mis-positioned**
- **Clicking Refresh Button Redirected to 404 Page**
# 0.4.2 (2019-12-26)
### Features / Enhancement
- **Disclaimer**. Added page for Disclaimer.
- **Call API to fetch version**. [#371](https://github.com/crawlab-team/crawlab/issues/371)
- **Configure to allow user registration**. [#346](https://github.com/crawlab-team/crawlab/issues/346)
- **Allow adding new users**.
- **More Advanced File Management**. Allow users to add / edit / rename / delete files. [#286](https://github.com/crawlab-team/crawlab/issues/286)
- **Optimized Spider Creation Process**. Allow users to create an empty customized spider before uploading the zip file.
- **Better Task Management**. Allow users to filter tasks by selecting through certian criterions. [#341](https://github.com/crawlab-team/crawlab/issues/341)
### Bug Fixes
- **Duplicated nodes**. [#391](https://github.com/crawlab-team/crawlab/issues/391)
- **"mongodb no reachable" error**. [#373](https://github.com/crawlab-team/crawlab/issues/373)
# 0.4.1 (2019-12-13)
### Features / Enhancement
- **Spiderfile Optimization**. Stages changed from dictionary to array. [#358](https://github.com/crawlab-team/crawlab/issues/358)
- **Baidu Tongji Update**.
### Bug Fixes
- **Unable to display schedule tasks**. [#353](https://github.com/crawlab-team/crawlab/issues/353)
- **Duplicate node registration**. [#334](https://github.com/crawlab-team/crawlab/issues/334)
# 0.4.0 (2019-12-06)
### Features / Enhancement
- **Configurable Spider**. Allow users to add spiders using *Spiderfile* to configure crawling rules.
- **Execution Mode**. Allow users to select 3 modes for task execution: *All Nodes*, *Selected Nodes* and *Random*.
### Bug Fixes
- **Task accidentally killed**. [#306](https://github.com/crawlab-team/crawlab/issues/306)
- **Documentation fix**. [#301](https://github.com/crawlab-team/crawlab/issues/258) [#301](https://github.com/crawlab-team/crawlab/issues/258)
- **Direct deploy incompatible with Windows**. [#288](https://github.com/crawlab-team/crawlab/issues/288)
- **Log files lost**. [#269](https://github.com/crawlab-team/crawlab/issues/269)
# 0.3.5 (2019-10-28)
### Features / Enhancement
- **Graceful Showdown**. [detail](https://github.com/crawlab-team/crawlab/commit/63fab3917b5a29fd9770f9f51f1572b9f0420385)

12
DISCLAIMER-zh.md Normal file
View File

@@ -0,0 +1,12 @@
# 免责声明
本免责及隐私保护声明(以下简称免责声明本声明)适用于 Crawlab 开发组 (以下简称开发组)研发的系列软件(以下简称"Crawlab") 在您阅读本声明后若不同意此声明中的任何条款或对本声明存在质疑请立刻停止使用我们的软件若您已经开始或正在使用 Crawlab则表示您已阅读并同意本声明的所有条款之约定
1. 总则您通过安装 Crawlab 并使用 Crawlab 提供的服务与功能即表示您已经同意与开发组立本协议开发组可随时执行全权决定更改条款经修订的条款一经在 Github 免责声明页面上公布后立即自动生效
2. 本产品是基于Golang的分布式爬虫管理平台支持PythonNodeJSGoJavaPHP等多种编程语言以及多种爬虫框架
3. 一切因使用 Crawlab 而引致之任何意外疏忽合约毁坏诽谤版权或知识产权侵犯及其所造成的损失(包括在非官方站点下载 Crawlab 而感染电脑病毒)Crawlab 开发组概不负责亦不承担任何法律责任
4. 用户对使用 Crawlab 自行承担风险我们不做任何形式的保证 因网络状况通讯线路等任何技术原因而导致用户不能正常升级更新我们也不承担任何法律责任
5. 用户使用 Crawlab 对目标网站进行抓取时需遵从网络安全法等与爬虫相关的法律法规切勿擅自采集公民个人信息 DDoS 等方式造成目标网站瘫痪不遵从目标网站的 robots.txt 协议等非法手段
6. Crawlab 尊重并保护所有用户的个人隐私权不会窃取任何用户计算机中的信息
7. 系统的版权Crawlab 开发组对所有开发的或合作开发的产品拥有知识产权著作权版权和使用权这些产品受到适用的知识产权版权商标服务商标专利或其他法律的保护
8. 传播:任何公司或个人在网络上发布传播我们软件的行为都是允许的但因公司或个人传播软件可能造成的任何法律和刑事事件 Crawlab 开发组不负任何责任

12
DISCLAIMER.md Normal file
View File

@@ -0,0 +1,12 @@
# Disclaimer
This Disclaimer and privacy protection statement (hereinafter referred to as "disclaimer statement" or "this statement") is applicable to the series of software (hereinafter referred to as "crawlab") developed by crawlab development group (hereinafter referred to as "development group") after you read this statement, if you do not agree with any terms in this statement or have doubts about this statement, please stop using our software immediately. If you have started or are using crawlab, you have read and agree to all terms of this statement.
1. General: by installing crawlab and using the services and functions provided by crawlab, you have agreed to establish this agreement with the development team. The developer group may at any time change the terms at its sole discretion. The amended "terms" shall take effect automatically as soon as they are published on the GitHub disclaimer page.
2. This product is a distributed crawler management platform based on golang, supporting python, nodejs, go, Java, PHP and other programming languages as well as a variety of crawler frameworks.
3. The development team of crawlab shall not be responsible for any accident, negligence, contract damage, defamation, copyright or intellectual property infringement caused by the use of crawlab and any loss caused by it (including computer virus infection caused by downloading crawlab on the unofficial site), and shall not bear any legal responsibility.
4. The user shall bear the risk of using crawlab by himself, we do not make any form of guarantee, and we will not bear any legal responsibility for the user's failure to upgrade and update normally due to any technical reasons such as network condition and communication line.
5. When users use crawlab to grab the target website, they need to comply with the laws and regulations related to crawlers, such as the network security law. Do not collect personal information of citizens without authorization, cause the target website to be paralyzed by DDoS, or fail to comply with the robots.txt protocol and other illegal means of the target website.
6. Crawlab respects and protects the personal privacy of all users and will not steal any information from users' computers.
7. Copyright of the system: the crawleb development team owns the intellectual property rights, copyrights, copyrights and use rights for all developed or jointly developed products, which are protected by applicable intellectual property rights, copyrights, trademarks, service trademarks, patents or other laws.
8. Communication: any company or individual who publishes or disseminates our software on the Internet is allowed, but the crawlab development team shall not be responsible for any legal and criminal events that may be caused by the company or individual disseminating the software.

View File

@@ -15,34 +15,34 @@ WORKDIR /app
# install frontend
RUN npm config set unsafe-perm true
RUN npm install -g yarn && yarn install --registry=https://registry.npm.taobao.org
RUN npm install -g yarn && yarn install
RUN npm run build:prod
# images
FROM ubuntu:latest
ADD . /app
# set as non-interactive
ENV DEBIAN_FRONTEND noninteractive
# set CRAWLAB_IS_DOCKER
ENV CRAWLAB_IS_DOCKER Y
# install packages
RUN apt-get update \
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip \
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip nginx \
&& ln -s /usr/bin/pip3 /usr/local/bin/pip \
&& ln -s /usr/bin/python3 /usr/local/bin/python
# install backend
RUN pip install scrapy pymongo bs4 requests -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install scrapy pymongo bs4 requests crawlab-sdk scrapy-splash
# add files
ADD . /app
# copy backend files
COPY --from=backend-build /go/src/app .
COPY --from=backend-build /go/bin/crawlab /usr/local/bin
# install nginx
RUN apt-get -y install nginx
# copy frontend files
COPY --from=frontend-build /app/dist /app/dist
COPY --from=frontend-build /app/conf/crawlab.conf /etc/nginx/conf.d
@@ -57,4 +57,4 @@ EXPOSE 8080
EXPOSE 8000
# start backend
CMD ["/bin/sh", "/app/docker_init.sh"]
CMD ["/bin/bash", "/app/docker_init.sh"]

View File

@@ -4,44 +4,43 @@ WORKDIR /go/src/app
COPY ./backend .
ENV GO111MODULE on
ENV GOPROXY https://mirrors.aliyun.com/goproxy/
ENV GOPROXY https://goproxy.io
RUN go install -v ./...
FROM node:8.16.0 AS frontend-build
FROM node:8.16.0-alpine AS frontend-build
ADD ./frontend /app
WORKDIR /app
# install frontend
RUN npm install -g yarn && yarn install --registry=https://registry.npm.taobao.org
RUN npm config set unsafe-perm true
RUN npm install -g yarn && yarn install --registry=https://registry.npm.taobao.org # --sass_binary_site=https://npm.taobao.org/mirrors/node-sass/
RUN npm run build:prod
# images
FROM ubuntu:latest
ADD . /app
# set as non-interactive
ENV DEBIAN_FRONTEND noninteractive
# install packages
RUN apt-get update \
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip \
RUN chmod 777 /tmp \
&& apt-get update \
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip nginx \
&& ln -s /usr/bin/pip3 /usr/local/bin/pip \
&& ln -s /usr/bin/python3 /usr/local/bin/python
# install backend
RUN pip install scrapy pymongo bs4 requests -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install scrapy pymongo bs4 requests crawlab-sdk scrapy-splash -i https://pypi.tuna.tsinghua.edu.cn/simple
# add files
ADD . /app
# copy backend files
COPY --from=backend-build /go/src/app .
COPY --from=backend-build /go/bin/crawlab /usr/local/bin
# install nginx
RUN apt-get -y install nginx
# copy frontend files
COPY --from=frontend-build /app/dist /app/dist
COPY --from=frontend-build /app/conf/crawlab.conf /etc/nginx/conf.d
@@ -56,4 +55,4 @@ EXPOSE 8080
EXPOSE 8000
# start backend
CMD ["/bin/sh", "/app/docker_init.sh"]
CMD ["/bin/bash", "/app/docker_init.sh"]

View File

@@ -1,39 +1,68 @@
# Crawlab
![](http://114.67.75.98:8082/buildStatus/icon?job=crawlab%2Fmaster)
![](https://img.shields.io/github/release/crawlab-team/crawlab.svg)
![](https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg)
![](https://img.shields.io/github/issues/crawlab-team/crawlab.svg)
![](https://img.shields.io/github/contributors/crawlab-team/crawlab.svg)
![](https://img.shields.io/docker/pulls/tikazyq/crawlab)
![](https://img.shields.io/github/license/crawlab-team/crawlab.svg)
<p>
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
</a>
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
</a>
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
<img src="https://img.shields.io/github/release/crawlab-team/crawlab.svg?logo=github">
</a>
<a href="https://github.com/crawlab-team/crawlab/commits/master" target="_blank">
<img src="https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg">
</a>
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Abug" target="_blank">
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/bug.svg?label=bugs&color=red">
</a>
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement" target="_blank">
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/enhancement.svg?label=enhancements&color=cyan">
</a>
<a href="https://github.com/crawlab-team/crawlab/blob/master/LICENSE" target="_blank">
<img src="https://img.shields.io/github/license/crawlab-team/crawlab.svg">
</a>
</p>
中文 | [English](https://github.com/crawlab-team/crawlab)
[安装](#安装) | [运行](#运行) | [截图](#截图) | [架构](#架构) | [集成](#与其他框架的集成) | [比较](#与其他框架比较) | [相关文章](#相关文章) | [社区&赞助](#社区--赞助)
[安装](#安装) | [运行](#运行) | [截图](#截图) | [架构](#架构) | [集成](#与其他框架的集成) | [比较](#与其他框架比较) | [相关文章](#相关文章) | [社区&赞助](#社区--赞助) | [更新日志](https://github.com/crawlab-team/crawlab/blob/master/CHANGELOG-zh.md) | [免责声明](https://github.com/crawlab-team/crawlab/blob/master/DISCLAIMER-zh.md)
基于Golang的分布式爬虫管理平台支持PythonNodeJSGoJavaPHP等多种编程语言以及多种爬虫框架
[查看演示 Demo](http://crawlab.cn/demo) | [文档](https://tikazyq.github.io/crawlab-docs)
[查看演示 Demo](http://crawlab.cn/demo) | [文档](http://docs.crawlab.cn)
## 安装
三种方式:
1. [Docker](https://tikazyq.github.io/crawlab-docs/Installation/Docker.html)(推荐)
2. [直接部署](https://tikazyq.github.io/crawlab-docs/Installation/Direct.html)(了解内核)
3. [Kubernetes](https://mp.weixin.qq.com/s/3Q1BQATUIEE_WXcHPqhYbA)
1. [Docker](http://docs.crawlab.cn/Installation/Docker.html)(推荐)
2. [直接部署](http://docs.crawlab.cn/Installation/Direct.html)(了解内核)
3. [Kubernetes](https://juejin.im/post/5e0a02d851882549884c27ad) (多节点部署)
### 要求Docker
- Docker 18.03+
- Redis
- Redis 5.x+
- MongoDB 3.6+
- Docker Compose 1.24+ (可选但推荐)
### 要求直接部署
- Go 1.12+
- Node 8.12+
- Redis
- Redis 5.x+
- MongoDB 3.6+
## 快速开始
请打开命令行并执行下列命令请保证您已经提前安装了 `docker-compose`
```bash
git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d
```
接下来您可以看看 `docker-compose.yml` (包含详细配置参数)以及参考 [文档](http://docs.crawlab.cn) 来查看更多信息。
## 运行
### Docker
@@ -47,13 +76,11 @@ services:
image: tikazyq/crawlab:latest
container_name: master
environment:
CRAWLAB_API_ADDRESS: "http://localhost:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend
- "8080:8080"
depends_on:
- mongo
- redis
@@ -111,9 +138,9 @@ Docker部署的详情请见[相关文档](https://tikazyq.github.io/crawlab-d
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)
#### 爬虫文件
#### 爬虫文件编辑
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-file.png)
![](http://static-docs.crawlab.cn/file-edit.png)
#### 任务详情 - 抓取结果
@@ -121,13 +148,21 @@ Docker部署的详情请见[相关文档](https://tikazyq.github.io/crawlab-d
#### 定时任务
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/schedule.png)
![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)
#### 依赖安装
![](http://static-docs.crawlab.cn/node-install-dependencies.png)
#### 消息通知
<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
## 架构
Crawlab的架构包括了一个主节点Master Node和多个工作节点Worker Node以及负责通信和数据储存的Redis和MongoDB数据库
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/architecture.png)
![](http://static-docs.crawlab.cn/architecture.png)
前端应用向主节点请求数据主节点通过MongoDB和Redis来执行任务派发调度以及部署工作节点收到任务之后开始执行爬虫任务并将任务结果储存到MongoDB架构相对于`v0.3.0`之前的Celery版本有所精简去除了不必要的节点监控模块Flower节点监控主要由Redis完成
@@ -162,37 +197,43 @@ Redis是非常受欢迎的Key-Value数据库在Crawlab中主要实现节点
## 与其他框架的集成
[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) 提供了一些 `helper` 方法来让您的爬虫更好的集成到 Crawlab 中,例如保存结果数据到 Crawlab 中等等。
### 集成 Scrapy
`settings.py` 中找到 `ITEM_PIPELINES``dict` 类型的变量在其中添加如下内容
```python
ITEM_PIPELINES = {
'crawlab.pipelines.CrawlabMongoPipeline': 888,
}
```
然后启动 Scrapy 爬虫运行完成之后您就应该能看到抓取结果出现在 **任务详情-结果**
### 通用 Python 爬虫
将下列代码加入到您爬虫中的结果保存部分
```python
# 引入保存结果方法
from crawlab import save_item
# 这是一个结果,需要为 dict 类型
result = {'name': 'crawlab'}
# 调用保存结果方法
save_item(result)
```
然后启动爬虫运行完成之后您就应该能看到抓取结果出现在 **任务详情-结果**
### 其他框架和语言
爬虫任务本质上是由一个shell命令来实现的任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中并以此来关联抓取数据另外`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称
在爬虫程序中需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了当前Crawlab只支持MongoDB
### 集成Scrapy
以下是Crawlab跟Scrapy集成的例子利用了Crawlab传过来的task_id和collection_name
```python
import os
from pymongo import MongoClient
MONGO_HOST = '192.168.99.100'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'
# scrapy example in the pipeline
class JuejinPipeline(object):
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
db = mongo[MONGO_DB]
col_name = os.environ.get('CRAWLAB_COLLECTION')
if not col_name:
col_name = 'test'
col = db[col_name]
def process_item(self, item, spider):
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
self.col.save(item)
return item
```
## 与其他框架比较
现在已经有一些爬虫管理框架了因此为啥还要用Crawlab
@@ -201,13 +242,12 @@ class JuejinPipeline(object):
Crawlab使用起来很方便也很通用可以适用于几乎任何主流语言和框架它还有一个精美的前端界面让用户可以方便的管理和运行爬虫
|框架 | 类型 | 分布式 | 前端 | 依赖于Scrapyd |
|:---:|:---:|:---:|:---:|:---:|
| [Crawlab](https://github.com/crawlab-team/crawlab) | 管理平台 | Y | Y | N
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | 管理平台 | Y | Y | Y
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | 管理平台 | Y | Y | Y
| [Gerapy](https://github.com/Gerapy/Gerapy) | 管理平台 | Y | Y | Y
| [Scrapyd](https://github.com/scrapy/scrapyd) | 网络服务 | Y | N | N/A
|框架 | 技术 | 优点 | 缺点 | Github 统计数据 |
|:---|:---|:---|-----| :---- |
| [Crawlab](https://github.com/crawlab-team/crawlab) | Golang + Vue|不局限于 scrapy可以运行任何语言和框架的爬虫精美的 UI 界面,天然支持分布式爬虫,支持节点管理、爬虫管理、任务管理、定时任务、结果导出、数据统计、消息通知、可配置爬虫、在线编辑代码等功能|暂时不支持爬虫版本管理| ![](https://img.shields.io/github/stars/crawlab-team/crawlab) ![](https://img.shields.io/github/forks/crawlab-team/crawlab) |
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Python Flask + Vue|精美的 UI 界面,内置了 scrapy 日志解析器,有较多任务运行统计图表,支持节点管理、定时任务、邮件提醒、移动界面,算是 scrapy-based 中功能完善的爬虫管理平台|不支持 scrapy 以外的爬虫Python Flask 为后端,性能上有一定局限性| ![](https://img.shields.io/github/stars/my8100/scrapydweb) ![](https://img.shields.io/github/forks/my8100/scrapydweb) |
| [Gerapy](https://github.com/Gerapy/Gerapy) | Python Django + Vue|Gerapy 是崔庆才大神开发的爬虫管理平台,安装部署非常简单,同样基于 scrapyd有精美的 UI 界面,支持节点管理、代码编辑、可配置规则等功能|同样不支持 scrapy 以外的爬虫而且据使用者反馈1.0 版本有很多 bug期待 2.0 版本会有一定程度的改进| ![](https://img.shields.io/github/stars/Gerapy/Gerapy) ![](https://img.shields.io/github/forks/Gerapy/Gerapy) |
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Python Flask|基于 scrapyd开源版 Scrapyhub非常简洁的 UI 界面,支持定时任务|可能有些过于简洁了,不支持分页,不支持节点管理,不支持 scrapy 以外的爬虫| ![](https://img.shields.io/github/stars/DormyMo/SpiderKeeper) ![](https://img.shields.io/github/forks/DormyMo/SpiderKeeper) |
## Q&A
@@ -254,6 +294,9 @@ Crawlab使用起来很方便也很通用可以适用于几乎任何主流
<a href="https://github.com/hantmac">
<img src="https://avatars2.githubusercontent.com/u/7600925?s=460&v=4" height="80">
</a>
<a href="https://github.com/duanbin0414">
<img src="https://avatars3.githubusercontent.com/u/50389867?s=460&v=4" height="80">
</a>
## 社区 & 赞助

145
README.md
View File

@@ -1,39 +1,68 @@
# Crawlab
![](http://114.67.75.98:8082/buildStatus/icon?job=crawlab%2Fmaster)
![](https://img.shields.io/github/release/crawlab-team/crawlab.svg)
![](https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg)
![](https://img.shields.io/github/issues/crawlab-team/crawlab.svg)
![](https://img.shields.io/github/contributors/crawlab-team/crawlab.svg)
![](https://img.shields.io/docker/pulls/tikazyq/crawlab)
![](https://img.shields.io/github/license/crawlab-team/crawlab.svg)
<p>
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
<img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
</a>
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
</a>
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
<img src="https://img.shields.io/github/release/crawlab-team/crawlab.svg?logo=github">
</a>
<a href="https://github.com/crawlab-team/crawlab/commits/master" target="_blank">
<img src="https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg">
</a>
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Abug" target="_blank">
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/bug.svg?label=bugs&color=red">
</a>
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement" target="_blank">
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/enhancement.svg?label=enhancements&color=cyan">
</a>
<a href="https://github.com/crawlab-team/crawlab/blob/master/LICENSE" target="_blank">
<img src="https://img.shields.io/github/license/crawlab-team/crawlab.svg">
</a>
</p>
[中文](https://github.com/crawlab-team/crawlab/blob/master/README-zh.md) | English
[Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship)
[Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship) | [CHANGELOG](https://github.com/crawlab-team/crawlab/blob/master/CHANGELOG.md) | [Disclaimer](https://github.com/crawlab-team/crawlab/blob/master/DISCLAIMER.md)
Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.
[Demo](http://crawlab.cn/demo) | [Documentation](https://tikazyq.github.io/crawlab-docs)
[Demo](http://crawlab.cn/demo) | [Documentation](http://docs.crawlab.cn)
## Installation
Two methods:
1. [Docker](https://tikazyq.github.io/crawlab-docs/Installation/Docker.html) (Recommended)
2. [Direct Deploy](https://tikazyq.github.io/crawlab-docs/Installation/Direct.html) (Check Internal Kernel)
3. [Kubernetes](https://mp.weixin.qq.com/s/3Q1BQATUIEE_WXcHPqhYbA)
1. [Docker](http://docs.crawlab.cn/Installation/Docker.html) (Recommended)
2. [Direct Deploy](http://docs.crawlab.cn/Installation/Direct.html) (Check Internal Kernel)
3. [Kubernetes](https://juejin.im/post/5e0a02d851882549884c27ad) (Multi-Node Deployment)
### Pre-requisite (Docker)
- Docker 18.03+
- Redis
- Redis 5.x+
- MongoDB 3.6+
- Docker Compose 1.24+ (optional but recommended)
### Pre-requisite (Direct Deploy)
- Go 1.12+
- Node 8.12+
- Redis
- Redis 5.x+
- MongoDB 3.6+
## Quick Start
Please open the command line prompt and execute the command beloe. Make sure you have installed `docker-compose` in advance.
```bash
git clone https://github.com/crawlab-team/crawlab
cd crawlab
docker-compose up -d
```
Next, you can look into the `docker-compose.yml` (with detailed config params) and the [Documentation (Chinese)](http://docs.crawlab.cn) for further information.
## Run
### Docker
@@ -48,13 +77,11 @@ services:
image: tikazyq/crawlab:latest
container_name: master
environment:
CRAWLAB_API_ADDRESS: "http://localhost:8000"
CRAWLAB_SERVER_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
CRAWLAB_REDIS_ADDRESS: "redis"
ports:
- "8080:8080" # frontend
- "8000:8000" # backend
- "8080:8080"
depends_on:
- mongo
- redis
@@ -109,9 +136,9 @@ For Docker Deployment details, please refer to [relevant documentation](https://
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-analytics.png)
#### Spider Files
#### Spider File Edit
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/spider-file.png)
![](http://static-docs.crawlab.cn/file-edit.png)
#### Task Results
@@ -119,13 +146,21 @@ For Docker Deployment details, please refer to [relevant documentation](https://
#### Cron Job
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/schedule.png)
![](http://static-docs.crawlab.cn/schedule-v0.4.4.png)
#### Dependency Installation
![](http://static-docs.crawlab.cn/node-install-dependencies.png)
#### Notifications
<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
## Architecture
The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.
![](https://raw.githubusercontent.com/tikazyq/crawlab-docs/master/images/architecture.png)
![](http://static-docs.crawlab.cn/architecture.png)
The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before `v0.3.0`. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.
@@ -161,35 +196,43 @@ Frontend is a SPA based on
## Integration with Other Frameworks
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) provides some `helper` methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.
Note: make sure you have already installed `crawlab-sdk` using pip.
### Scrapy
Below is an example to integrate Crawlab with Scrapy in pipelines.
In `settings.py` in your Scrapy project, find the variable named `ITEM_PIPELINES` (a `dict` variable). Add content below.
```python
import os
from pymongo import MongoClient
MONGO_HOST = '192.168.99.100'
MONGO_PORT = 27017
MONGO_DB = 'crawlab_test'
# scrapy example in the pipeline
class JuejinPipeline(object):
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
db = mongo[MONGO_DB]
col_name = os.environ.get('CRAWLAB_COLLECTION')
if not col_name:
col_name = 'test'
col = db[col_name]
def process_item(self, item, spider):
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
self.col.save(item)
return item
ITEM_PIPELINES = {
'crawlab.pipelines.CrawlabMongoPipeline': 888,
}
```
Then, start the Scrapy spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
### General Python Spider
Please add below content to your spider files to save results.
```python
# import result saving method
from crawlab import save_item
# this is a result record, must be dict type
result = {'name': 'crawlab'}
# call result saving method
save_item(result)
```
Then, start the spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
### Other Frameworks / Languages
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
## Comparison with Other Frameworks
There are existing spider management frameworks. So why use Crawlab?
@@ -198,13 +241,12 @@ The reason is that most of the existing platforms are depending on Scrapyd, whic
Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.
|Framework | Type | Distributed | Frontend | Scrapyd-Dependent |
|:---:|:---:|:---:|:---:|:---:|
| [Crawlab](https://github.com/crawlab-team/crawlab) | Admin Platform | Y | Y | N
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Admin Platform | Y | Y | Y
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Admin Platform | Y | Y | Y
| [Gerapy](https://github.com/Gerapy/Gerapy) | Admin Platform | Y | Y | Y
| [Scrapyd](https://github.com/scrapy/scrapyd) | Web Service | Y | N | N/A
|Framework | Technology | Pros | Cons | Github Stats |
|:---|:---|:---|-----| :---- |
| [Crawlab](https://github.com/crawlab-team/crawlab) | Golang + Vue|Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider mangement, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc.|Not yet support spider versioning| ![](https://img.shields.io/github/stars/crawlab-team/crawlab) ![](https://img.shields.io/github/forks/crawlab-team/crawlab) |
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Python Flask + Vue|Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform.|Not support spiders other than Scrapy. Limited performance because of Python Flask backend.| ![](https://img.shields.io/github/stars/my8100/scrapydweb) ![](https://img.shields.io/github/forks/my8100/scrapydweb) |
| [Gerapy](https://github.com/Gerapy/Gerapy) | Python Django + Vue|Gerapy is built by web crawler guru [Germey Cui](https://github.com/Germey). Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc.|Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0| ![](https://img.shields.io/github/stars/Gerapy/Gerapy) ![](https://img.shields.io/github/forks/Gerapy/Gerapy) |
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Python Flask|Open-source Scrapyhub. Concise and simple UI interface. Support cron job.|Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.| ![](https://img.shields.io/github/stars/DormyMo/SpiderKeeper) ![](https://img.shields.io/github/forks/DormyMo/SpiderKeeper) |
## Contributors
<a href="https://github.com/tikazyq">
@@ -219,6 +261,9 @@ Crawlab is easy to use, general enough to adapt spiders in any language and any
<a href="https://github.com/hantmac">
<img src="https://avatars2.githubusercontent.com/u/7600925?s=460&v=4" height="80">
</a>
<a href="https://github.com/duanbin0414">
<img src="https://avatars3.githubusercontent.com/u/50389867?s=460&v=4" height="80">
</a>
## Community & Sponsorship

View File

@@ -15,20 +15,35 @@ redis:
log:
level: info
path: "/var/logs/crawlab"
isDeletePeriodically: "Y"
isDeletePeriodically: "N"
deleteFrequency: "@hourly"
server:
host: 0.0.0.0
port: 8000
master: "N"
master: "Y"
secret: "crawlab"
register:
# mac地址 或者 ip地址如果是ip则需要手动指定IP
type: "mac"
ip: ""
lang: # 安装语言环境, Y 为安装N 为不安装只对 Docker 有效
python: "Y"
node: "N"
spider:
path: "/app/spiders"
task:
workers: 4
other:
tmppath: "/tmp"
version: 0.4.5
setting:
allowRegister: "N"
notification:
mail:
server: ''
port: ''
senderEmail: ''
senderIdentity: ''
smtp:
user: ''
password: ''

View File

@@ -28,7 +28,7 @@ func (c *Config) Init() error {
}
viper.SetConfigType("yaml") // 设置配置文件格式为YAML
viper.AutomaticEnv() // 读取匹配的环境变量
viper.SetEnvPrefix("CRAWLAB") // 读取环境变量的前缀为APISERVER
viper.SetEnvPrefix("CRAWLAB") // 读取环境变量的前缀为CRAWLAB
replacer := strings.NewReplacer(".", "_")
viper.SetEnvKeyReplacer(replacer)
if err := viper.ReadInConfig(); err != nil { // viper解析配置文件

View File

@@ -0,0 +1,8 @@
package constants
const (
AnchorStartStage = "START_STAGE"
AnchorStartUrl = "START_URL"
AnchorItems = "ITEMS"
AnchorParsers = "PARSERS"
)

View File

@@ -0,0 +1,6 @@
package constants
const (
ASCENDING = "ascending"
DESCENDING = "descending"
)

View File

@@ -0,0 +1,6 @@
package constants
const (
EngineScrapy = "scrapy"
EngineColly = "colly"
)

View File

@@ -0,0 +1,13 @@
package constants
const (
NotificationTriggerOnTaskEnd = "notification_trigger_on_task_end"
NotificationTriggerOnTaskError = "notification_trigger_on_task_error"
NotificationTriggerNever = "notification_trigger_never"
)
const (
NotificationTypeMail = "notification_type_mail"
NotificationTypeDingTalk = "notification_type_ding_talk"
NotificationTypeWechat = "notification_type_wechat"
)

9
backend/constants/rpc.go Normal file
View File

@@ -0,0 +1,9 @@
package constants
const (
RpcInstallLang = "install_lang"
RpcInstallDep = "install_dep"
RpcUninstallDep = "uninstall_dep"
RpcGetDepList = "get_dep_list"
RpcGetInstalledDepList = "get_installed_dep_list"
)

View File

@@ -0,0 +1,10 @@
package constants
const (
ScheduleStatusStop = "stopped"
ScheduleStatusRunning = "running"
ScheduleStatusError = "error"
ScheduleStatusErrorNotFoundNode = "Not Found Node"
ScheduleStatusErrorNotFoundSpider = "Not Found Spider"
)

View File

@@ -0,0 +1,5 @@
package constants
const ScrapyProtectedStageNames = ""
const ScrapyProtectedFieldNames = "_id,task_id,ts"

View File

@@ -3,4 +3,5 @@ package constants
const (
Customized = "customized"
Configurable = "configurable"
Plugin = "plugin"
)

View File

@@ -5,3 +5,9 @@ const (
Linux = "linux"
Darwin = "darwin"
)
const (
Python = "python"
Nodejs = "node"
Java = "java"
)

View File

@@ -19,3 +19,9 @@ const (
TaskFinish string = "finish"
TaskCancel string = "cancel"
)
const (
RunTypeAllNodes string = "all-nodes"
RunTypeRandom string = "random"
RunTypeSelectedNodes string = "selected-nodes"
)

View File

@@ -61,11 +61,46 @@ func InitMongo() error {
dialInfo.Password = mongoPassword
dialInfo.Source = mongoAuth
}
sess, err := mgo.DialWithInfo(&dialInfo)
if err != nil {
return err
// mongo session
var sess *mgo.Session
// 错误次数
errNum := 0
// 重复尝试连接mongo
for {
var err error
// 连接mongo
sess, err = mgo.DialWithInfo(&dialInfo)
if err != nil {
// 如果连接错误休息1秒错误次数+1
time.Sleep(1 * time.Second)
errNum++
// 如果错误次数超过30返回错误
if errNum >= 30 {
return err
}
} else {
// 如果没有错误,退出循环
break
}
}
// 赋值给全局mongo session
Session = sess
}
//Add Unique index for 'key'
keyIndex := mgo.Index{
Key: []string{"key"},
Unique: true,
}
s, c := GetCol("nodes")
defer s.Close()
c.EnsureIndex(keyIndex)
return nil
}

View File

@@ -58,9 +58,9 @@ func (r *Redis) subscribe(ctx context.Context, consume ConsumeFunc, channel ...s
}
done <- nil
case <-tick.C:
//fmt.Printf("ping message \n")
if err := psc.Ping(""); err != nil {
done <- err
fmt.Printf("ping message error: %s \n", err)
//done <- err
}
case err := <-done:
close(done)

View File

@@ -4,10 +4,12 @@ import (
"context"
"crawlab/entity"
"crawlab/utils"
"errors"
"github.com/apex/log"
"github.com/gomodule/redigo/redis"
"github.com/spf13/viper"
"runtime/debug"
"strings"
"time"
)
@@ -17,14 +19,36 @@ type Redis struct {
pool *redis.Pool
}
type Mutex struct {
Name string
expiry time.Duration
tries int
delay time.Duration
value string
}
func NewRedisClient() *Redis {
return &Redis{pool: NewRedisPool()}
}
func (r *Redis) RPush(collection string, value interface{}) error {
c := r.pool.Get()
defer utils.Close(c)
if _, err := c.Do("RPUSH", collection, value); err != nil {
log.Error(err.Error())
debug.PrintStack()
return err
}
return nil
}
func (r *Redis) LPush(collection string, value interface{}) error {
c := r.pool.Get()
defer utils.Close(c)
if _, err := c.Do("RPUSH", collection, value); err != nil {
log.Error(err.Error())
debug.PrintStack()
return err
}
@@ -47,6 +71,7 @@ func (r *Redis) HSet(collection string, key string, value string) error {
defer utils.Close(c)
if _, err := c.Do("HSET", collection, key, value); err != nil {
log.Error(err.Error())
debug.PrintStack()
return err
}
@@ -58,7 +83,9 @@ func (r *Redis) HGet(collection string, key string) (string, error) {
defer utils.Close(c)
value, err2 := redis.String(c.Do("HGET", collection, key))
if err2 != nil {
if err2 != nil && err2 != redis.ErrNil {
log.Error(err2.Error())
debug.PrintStack()
return value, err2
}
return value, nil
@@ -69,6 +96,8 @@ func (r *Redis) HDel(collection string, key string) error {
defer utils.Close(c)
if _, err := c.Do("HDEL", collection, key); err != nil {
log.Error(err.Error())
debug.PrintStack()
return err
}
return nil
@@ -80,11 +109,27 @@ func (r *Redis) HKeys(collection string) ([]string, error) {
value, err2 := redis.Strings(c.Do("HKeys", collection))
if err2 != nil {
log.Error(err2.Error())
debug.PrintStack()
return []string{}, err2
}
return value, nil
}
func (r *Redis) BRPop(collection string, timeout int) (string, error) {
if timeout <= 0 {
timeout = 60
}
c := r.pool.Get()
defer utils.Close(c)
values, err := redis.Strings(c.Do("BRPOP", collection, timeout))
if err != nil {
return "", err
}
return values[1], nil
}
func NewRedisPool() *redis.Pool {
var address = viper.GetString("redis.address")
var port = viper.GetString("redis.port")
@@ -101,7 +146,7 @@ func NewRedisPool() *redis.Pool {
Dial: func() (conn redis.Conn, e error) {
return redis.DialURL(url,
redis.DialConnectTimeout(time.Second*10),
redis.DialReadTimeout(time.Second*10),
redis.DialReadTimeout(time.Second*600),
redis.DialWriteTimeout(time.Second*10),
)
},
@@ -143,3 +188,59 @@ func Sub(channel string, consume ConsumeFunc) error {
}
return nil
}
// 构建同步锁key
func (r *Redis) getLockKey(lockKey string) string {
lockKey = strings.ReplaceAll(lockKey, ":", "-")
return "nodes:lock:" + lockKey
}
// 获得锁
func (r *Redis) Lock(lockKey string) (int64, error) {
c := r.pool.Get()
defer utils.Close(c)
lockKey = r.getLockKey(lockKey)
ts := time.Now().Unix()
ok, err := c.Do("SET", lockKey, ts, "NX", "PX", 30000)
if err != nil {
log.Errorf("get lock fail with error: %s", err.Error())
debug.PrintStack()
return 0, err
}
if err == nil && ok == nil {
log.Errorf("the lockKey is locked: key=%s", lockKey)
return 0, errors.New("the lockKey is locked")
}
return ts, nil
}
func (r *Redis) UnLock(lockKey string, value int64) {
c := r.pool.Get()
defer utils.Close(c)
lockKey = r.getLockKey(lockKey)
getValue, err := redis.Int64(c.Do("GET", lockKey))
if err != nil {
log.Errorf("get lockKey error: %s", err.Error())
debug.PrintStack()
return
}
if getValue != value {
log.Errorf("the lockKey value diff: %d, %d", value, getValue)
return
}
v, err := redis.Int64(c.Do("DEL", lockKey))
if err != nil {
log.Errorf("unlock failed, error: %s", err.Error())
debug.PrintStack()
return
}
if v == 0 {
log.Errorf("unlock failed: key=%s", lockKey)
return
}
}

View File

@@ -3,15 +3,15 @@ package entity
import "strconv"
type Page struct {
Skip int
Limit int
PageNum int
Skip int
Limit int
PageNum int
PageSize int
}
func (p *Page)GetPage(pageNum string, pageSize string) {
func (p *Page) GetPage(pageNum string, pageSize string) {
p.PageNum, _ = strconv.Atoi(pageNum)
p.PageSize, _ = strconv.Atoi(pageSize)
p.Skip = p.PageSize * (p.PageNum - 1)
p.Limit = p.PageSize
}
}

View File

@@ -0,0 +1,40 @@
package entity
type ConfigSpiderData struct {
// 通用
Name string `yaml:"name" json:"name"`
DisplayName string `yaml:"display_name" json:"display_name"`
Col string `yaml:"col" json:"col"`
Remark string `yaml:"remark" json:"remark"`
Type string `yaml:"type" bson:"type"`
// 可配置爬虫
Engine string `yaml:"engine" json:"engine"`
StartUrl string `yaml:"start_url" json:"start_url"`
StartStage string `yaml:"start_stage" json:"start_stage"`
Stages []Stage `yaml:"stages" json:"stages"`
Settings map[string]string `yaml:"settings" json:"settings"`
// 自定义爬虫
Cmd string `yaml:"cmd" json:"cmd"`
}
type Stage struct {
Name string `yaml:"name" json:"name"`
IsList bool `yaml:"is_list" json:"is_list"`
ListCss string `yaml:"list_css" json:"list_css"`
ListXpath string `yaml:"list_xpath" json:"list_xpath"`
PageCss string `yaml:"page_css" json:"page_css"`
PageXpath string `yaml:"page_xpath" json:"page_xpath"`
PageAttr string `yaml:"page_attr" json:"page_attr"`
Fields []Field `yaml:"fields" json:"fields"`
}
type Field struct {
Name string `yaml:"name" json:"name"`
Css string `yaml:"css" json:"css"`
Xpath string `yaml:"xpath" json:"xpath"`
Attr string `yaml:"attr" json:"attr"`
NextStage string `yaml:"next_stage" json:"next_stage"`
Remark string `yaml:"remark" json:"remark"`
}

View File

@@ -13,3 +13,18 @@ type Executable struct {
FileName string `json:"file_name"`
DisplayName string `json:"display_name"`
}
type Lang struct {
Name string `json:"name"`
ExecutableName string `json:"executable_name"`
ExecutablePath string `json:"executable_path"`
DepExecutablePath string `json:"dep_executable_path"`
Installed bool `json:"installed"`
}
type Dependency struct {
Name string `json:"name"`
Version string `json:"version"`
Description string `json:"description"`
Installed bool `json:"installed"`
}

View File

@@ -11,10 +11,18 @@ require (
github.com/go-playground/locales v0.12.1 // indirect
github.com/go-playground/universal-translator v0.16.0 // indirect
github.com/gomodule/redigo v2.0.0+incompatible
github.com/imroc/req v0.2.4
github.com/leodido/go-urn v1.1.0 // indirect
github.com/matcornic/hermes v1.2.0
github.com/matcornic/hermes/v2 v2.0.2 // indirect
github.com/pkg/errors v0.8.1
github.com/royeo/dingrobot v1.0.0
github.com/satori/go.uuid v1.2.0
github.com/smartystreets/goconvey v0.0.0-20190731233626-505e41936337
github.com/spf13/viper v1.4.0
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc // indirect
gopkg.in/go-playground/validator.v9 v9.29.1
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737
gopkg.in/russross/blackfriday.v2 v2.0.0 // indirect
gopkg.in/yaml.v2 v2.2.2
)

View File

@@ -1,9 +1,15 @@
cloud.google.com/go v0.26.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw=
github.com/BurntSushi/toml v0.3.1 h1:WXkYYl6Yr3qBf1K79EBnL4mak0OimBfB0XUf9Vl28OQ=
github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
github.com/Masterminds/semver v1.4.2 h1:WBLTQ37jOCzSLtXNdoo8bNM8876KhNqOKvrlGITgsTc=
github.com/Masterminds/semver v1.4.2/go.mod h1:MB6lktGJrhw8PrUyiEoblNEGEQ+RzHPF078ddwwvV3Y=
github.com/Masterminds/sprig v2.16.0+incompatible h1:QZbMUPxRQ50EKAq3LFMnxddMu88/EUUG3qmxwtDmPsY=
github.com/Masterminds/sprig v2.16.0+incompatible/go.mod h1:y6hNFY5UBTIWBxnzTeuNhlNS5hqE0NB0E6fgfo2Br3o=
github.com/OneOfOne/xxhash v1.2.2/go.mod h1:HSdplMjZKSmBqAxg5vPj2TmRDmfkzw+cTzAElWljhcU=
github.com/alecthomas/template v0.0.0-20160405071501-a0175ee3bccc/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc=
github.com/alecthomas/units v0.0.0-20151022065526-2efee857e7cf/go.mod h1:ybxpYRFXyAe+OPACYpWeL0wqObRcbAqCMya13uyzqw0=
github.com/aokoli/goutils v1.0.1 h1:7fpzNGoJ3VA8qcrm++XEE1QUe0mIwNeLa02Nwq7RDkg=
github.com/aokoli/goutils v1.0.1/go.mod h1:SijmP0QR8LtwsmDs8Yii5Z/S4trXFGFC2oO5g9DP+DQ=
github.com/apex/log v1.1.1 h1:BwhRZ0qbjYtTob0I+2M+smavV0kOC8XgcnGZcyL9liA=
github.com/apex/log v1.1.1/go.mod h1:Ls949n1HFtXfbDcjiTTFQqkVUrte0puoIBfO3SVgwOA=
github.com/aphistic/golf v0.0.0-20180712155816-02c07f170c5a/go.mod h1:3NqKYiepwy8kCu4PNA+aP7WUV72eXWJeP9/r3/K9aLE=
@@ -56,6 +62,8 @@ github.com/gomodule/redigo v2.0.0+incompatible h1:K/R+8tc58AaqLkqG2Ol3Qk+DR/TlNu
github.com/gomodule/redigo v2.0.0+incompatible/go.mod h1:B4C85qUVwatsJoIUNIfCRsp7qO0iAmpGFZ4EELWSbC4=
github.com/google/btree v1.0.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
github.com/google/go-cmp v0.2.0/go.mod h1:oXzfMopK8JAjlY9xF4vHSVASa0yLyX7SntLO5aqRK0M=
github.com/google/uuid v1.0.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/google/uuid v1.1.1 h1:Gkbcsh/GbpXz7lPftLA3P6TYMwjCLYm83jiFQZF/3gY=
github.com/google/uuid v1.1.1/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1 h1:EGx4pi6eqNxGaHF6qqu48+N2wcFQ5qg5FXgOdqsJ5d8=
github.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1/go.mod h1:wJfORRmW1u3UXTncJ5qlYoELFm8eSnnEO6hX4iZ3EWY=
@@ -66,6 +74,14 @@ github.com/grpc-ecosystem/grpc-gateway v1.9.0/go.mod h1:vNeuVxBJEsws4ogUvrchl83t
github.com/hashicorp/hcl v1.0.0 h1:0Anlzjpi4vEasTeNFn2mLJgTSwt0+6sfsiTG8qcWGx4=
github.com/hashicorp/hcl v1.0.0/go.mod h1:E5yfLk+7swimpb2L/Alb/PJmXilQ/rhwaUYs4T20WEQ=
github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU=
github.com/huandu/xstrings v1.2.0 h1:yPeWdRnmynF7p+lLYz0H2tthW9lqhMJrQV/U7yy4wX0=
github.com/huandu/xstrings v1.2.0/go.mod h1:DvyZB1rfVYsBIigL8HwpZgxHwXozlTgGqn63UyNX5k4=
github.com/imdario/mergo v0.3.6 h1:xTNEAn+kxVO7dTZGu0CegyqKZmoWFI0rF8UxjlB2d28=
github.com/imdario/mergo v0.3.6/go.mod h1:2EnlNZ0deacrJVfApfmtdGgDfMuh/nq6Ok1EcJh5FfA=
github.com/imroc/req v0.2.4 h1:8XbvaQpERLAJV6as/cB186DtH5f0m5zAOtHEaTQ4ac0=
github.com/imroc/req v0.2.4/go.mod h1:J9FsaNHDTIVyW/b5r6/Df5qKEEEq2WzZKIgKSajd1AE=
github.com/jaytaylor/html2text v0.0.0-20180606194806-57d518f124b0 h1:xqgexXAGQgY3HAjNPSaCqn5Aahbo5TKsmhp8VRfr1iQ=
github.com/jaytaylor/html2text v0.0.0-20180606194806-57d518f124b0/go.mod h1:CVKlgaMiht+LXvHG173ujK6JUhZXKb2u/BQtjPDIvyk=
github.com/jmespath/go-jmespath v0.0.0-20180206201540-c2b33e8439af/go.mod h1:Nht3zPeWKUH0NzdCt2Blrr5ys8VGpn0CEB0cQHVjt7k=
github.com/jonboulle/clockwork v0.1.0/go.mod h1:Ii8DK3G1RaLaWxj9trq07+26W01tbo22gdxWY5EU2bo=
github.com/jpillora/backoff v0.0.0-20180909062703-3050d21c67d7/go.mod h1:2iMrUgbbvHEiQClaW2NsSzMyGHqN+rDFqY705q49KG0=
@@ -87,12 +103,17 @@ github.com/leodido/go-urn v1.1.0 h1:Sm1gr51B1kKyfD2BlRcLSiEkffoG96g6TPv6eRoEiB8=
github.com/leodido/go-urn v1.1.0/go.mod h1:+cyI34gQWZcE1eQU7NVgKkkzdXDQHr1dBMtdAPozLkw=
github.com/magiconair/properties v1.8.0 h1:LLgXmsheXeRoUOBOjtwPQCWIYqM/LU1ayDtDePerRcY=
github.com/magiconair/properties v1.8.0/go.mod h1:PppfXfuXeibc/6YijjN8zIbojt8czPbwD3XqdrwzmxQ=
github.com/matcornic/hermes v1.2.0 h1:AuqZpYcTOtTB7cahdevLfnhIpfzmpqw5Czv8vpdnFDU=
github.com/matcornic/hermes v1.2.0/go.mod h1:lujJomb016Xjv8wBnWlNvUdtmvowjjfkqri5J/+1hYc=
github.com/matcornic/hermes/v2 v2.0.2/go.mod h1:iVsJWSIS4NtMNtgan22sy6lt7pImok7bATGPWCoaKNY=
github.com/mattn/go-colorable v0.1.1/go.mod h1:FuOcm+DKB9mbwrcAfNl7/TZVBZ6rcnceauSikq3lYCQ=
github.com/mattn/go-colorable v0.1.2/go.mod h1:U0ppj6V5qS13XJ6of8GYAs25YV2eR4EVcfRqFIhoBtE=
github.com/mattn/go-isatty v0.0.5/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/mattn/go-isatty v0.0.7/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/mattn/go-isatty v0.0.8 h1:HLtExJ+uU2HOZ+wI0Tt5DtUDrx8yhUqDcp7fYERX4CE=
github.com/mattn/go-isatty v0.0.8/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/mattn/go-runewidth v0.0.3 h1:a+kO+98RDGEfo6asOGMmpodZq4FNtnGP54yps8BzLR4=
github.com/mattn/go-runewidth v0.0.3/go.mod h1:LwmH8dsx7+W8Uxz3IHJYH5QSwggIsqBzpuz5H//U1FU=
github.com/matttproud/golang_protobuf_extensions v1.0.1/go.mod h1:D8He9yQNgCq6Z5Ld7szi9bcBfOoFv/3dc6xSMkL2PC0=
github.com/mgutz/ansi v0.0.0-20170206155736-9520e82c474b/go.mod h1:01TrycV0kFyexm33Z7vhZRXopbI8J3TDReVlkTgMUxE=
github.com/mitchellh/mapstructure v1.1.2 h1:fmNYVwqnSfB9mZU6OS2O6GsXM+wcskZDuKQzvN1EDeE=
@@ -103,6 +124,8 @@ github.com/modern-go/reflect2 v1.0.1 h1:9f412s+6RmYXLWZSEzVVgPGK7C2PphHj5RJrvfx9
github.com/modern-go/reflect2 v1.0.1/go.mod h1:bx2lNnkwVCuqBIxFjflWJWanXIb3RllmbCylyMrvgv0=
github.com/mwitkow/go-conntrack v0.0.0-20161129095857-cc309e4a2223/go.mod h1:qRWi+5nqEBWmkhHvq77mSJWrCKwh8bxhgT7d/eI7P4U=
github.com/oklog/ulid v1.3.1/go.mod h1:CirwcVhetQ6Lv90oh/F+FBtV6XMibvdAFo93nm5qn4U=
github.com/olekukonko/tablewriter v0.0.1 h1:b3iUnf1v+ppJiOfNX4yxxqfWKMQPZR5yoh8urCTFX88=
github.com/olekukonko/tablewriter v0.0.1/go.mod h1:vsDQFd/mU46D+Z4whnwzcISnGGzXWMclvtLoiIKAKIo=
github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
github.com/pelletier/go-toml v1.2.0 h1:T5zMGML61Wp+FlcbWjRDT7yAxhJNAiPPLOFECq181zc=
@@ -123,9 +146,14 @@ github.com/prometheus/procfs v0.0.0-20190507164030-5867b95ac084/go.mod h1:TjEm7z
github.com/prometheus/tsdb v0.7.1/go.mod h1:qhTCs0VvXwvX/y3TZrWD7rabWM+ijKTux40TwIPHuXU=
github.com/rogpeppe/fastuuid v0.0.0-20150106093220-6724a57986af/go.mod h1:XWv6SoW27p1b0cqNHllgS5HIMJraePCO15w5zCzIWYg=
github.com/rogpeppe/fastuuid v1.1.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ=
github.com/royeo/dingrobot v1.0.0 h1:K4GrF+fOecNX0yi+oBKpfh7z0XP/8TzaIIHu1B2kKUQ=
github.com/royeo/dingrobot v1.0.0/go.mod h1:RqDM8E/hySCVwI2aUFRJAUGDcHHRnIhzNmbNG3bamQs=
github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/satori/go.uuid v1.2.0 h1:0uYX9dsZ2yD7q2RtLRtPSdGDWzjeM3TbMJP9utgA0ww=
github.com/satori/go.uuid v1.2.0/go.mod h1:dA0hQrYB0VpLJoorglMZABFdXlWrHn1NEOzdhQKdks0=
github.com/sergi/go-diff v1.0.0/go.mod h1:0CfEIISq7TuYL3j771MWULgwwjU+GofnZX9QAmXWZgo=
github.com/shurcooL/sanitized_anchor_name v1.0.0 h1:PdmoCO6wvbs+7yrJyMORt4/BmY5IYyJwS/kOiWx8mHo=
github.com/shurcooL/sanitized_anchor_name v1.0.0/go.mod h1:1NzhyTcUVG4SuEtjjoZeVRXNmyL/1OwPU0+IJeTBvfc=
github.com/sirupsen/logrus v1.2.0/go.mod h1:LxeOpSwHxABJmUn/MG1IvRgCAasNZTLOkJPxbbu5VWo=
github.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d/go.mod h1:OnSkiWE9lh6wB0YB77sQom3nweQdgAjqCqsofrRNTgc=
github.com/smartystreets/assertions v1.0.0 h1:UVQPSSmc3qtTi+zPPkCXvZX9VvW/xT/NsRvKfwY81a8=
@@ -146,6 +174,8 @@ github.com/spf13/pflag v1.0.3 h1:zPAT6CGy6wXeQ7NtTnaTerfKOsV6V6F8agHXFiazDkg=
github.com/spf13/pflag v1.0.3/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4=
github.com/spf13/viper v1.4.0 h1:yXHLWeravcrgGyFSyCgdYpXQ9dR9c/WED3pg1RhxqEU=
github.com/spf13/viper v1.4.0/go.mod h1:PTJ7Z/lr49W6bUbkmS1V3by4uWynFiR9p7+dSq/yZzE=
github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf h1:pvbZ0lM0XWPBqUKqFU8cmavspvIl9nulOYwdy6IFRRo=
github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf/go.mod h1:RJID2RhlZKId02nZ62WenDCkgHFerpIOmW0iT7GKmXM=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
@@ -165,12 +195,15 @@ go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
go.uber.org/zap v1.10.0/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
golang.org/x/crypto v0.0.0-20180904163835-0709b304e793/go.mod h1:6SG95UA2DQfeDnfUPMdvaQW0Q7yPrPDi9nlGo2tz2b4=
golang.org/x/crypto v0.0.0-20181029175232-7e6ffbd03851/go.mod h1:6SG95UA2DQfeDnfUPMdvaQW0Q7yPrPDi9nlGo2tz2b4=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20190426145343-a29dc8fdc734 h1:p/H982KKEjUnLJkM3tt/LemDnOc1GiZL5FCVlORJ5zo=
golang.org/x/crypto v0.0.0-20190426145343-a29dc8fdc734/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/lint v0.0.0-20181026193005-c67002cb31c3/go.mod h1:UVdnD1Gm6xHRNCYTkRU2/jEulfH38KcIWyp/GAMgvoE=
golang.org/x/lint v0.0.0-20190313153728-d0100b6bd8b3/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
golang.org/x/net v0.0.0-20180826012351-8a410e7b638d/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20180906233101-161cd47e91fd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20181029044818-c44066c5c816/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20181220203305-927f97764cc3/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
@@ -204,6 +237,8 @@ google.golang.org/genproto v0.0.0-20180817151627-c66870c02cf8/go.mod h1:JiN7NxoA
google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c=
google.golang.org/grpc v1.21.0/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ijfRaM=
gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc h1:2gGKlE2+asNV9m7xrywl36YYNnBG5ZQ0r/BOOxqPpmk=
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc/go.mod h1:m7x9LTH6d71AHyAX77c9yqWCCa3UKHcVEj9y7hAtKDk=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 h1:qIbj1fsPNlZgppZ+VLlY7N33q108Sa+fhmuc+sWQYwY=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
@@ -214,7 +249,11 @@ gopkg.in/go-playground/validator.v8 v8.18.2 h1:lFB4DoMU6B626w8ny76MV7VX6W2VHct2G
gopkg.in/go-playground/validator.v8 v8.18.2/go.mod h1:RX2a/7Ha8BgOhfk7j780h4/u/RRjR0eouCJSH80/M2Y=
gopkg.in/go-playground/validator.v9 v9.29.1 h1:SvGtYmN60a5CVKTOzMSyfzWDeZRxRuGvRQyEAKbw1xc=
gopkg.in/go-playground/validator.v9 v9.29.1/go.mod h1:+c9/zcJMFNgbLvly1L1V+PpxWdVbfP1avr/N00E2vyQ=
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737 h1:NvePS/smRcFQ4bMtTddFtknbGCtoBkJxGmpSpVRafCc=
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737/go.mod h1:LRQQ+SO6ZHR7tOkpBDuZnXENFzX8qRjMDMyPD6BRkCw=
gopkg.in/resty.v1 v1.12.0/go.mod h1:mDo4pnntr5jdWRML875a/NmxYqAlA73dVijT2AXvQQo=
gopkg.in/russross/blackfriday.v2 v2.0.0 h1:+FlnIV8DSQnT7NZ43hcVKcdJdzZoeCmJj4Ql8gq5keA=
gopkg.in/russross/blackfriday.v2 v2.0.0/go.mod h1:6sSBNz/GtOm/pJTuh5UmBK2ZHfmnxGbl2NZg1UliSOI=
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
gopkg.in/yaml.v2 v2.0.0-20170812160011-eb3733d160e7/go.mod h1:JAlM8MvJe8wmxCU4Bli9HhUf9+ttbYbLASfIpnQbh74=
gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=

View File

@@ -31,22 +31,23 @@ func main() {
log.Error("init config error:" + err.Error())
panic(err)
}
log.Info("初始化配置成功")
log.Info("initialized config successfully")
// 初始化日志设置
logLevel := viper.GetString("log.level")
if logLevel != "" {
log.SetLevelFromString(logLevel)
}
log.Info("初始化日志设置成功")
log.Info("initialized log config successfully")
if viper.GetString("log.isDeletePeriodically") == "Y" {
err := services.InitDeleteLogPeriodically()
if err != nil {
log.Error("Init DeletePeriodically Failed")
log.Error("init DeletePeriodically failed")
panic(err)
}
log.Info("初始化定期清理日志配置成功")
log.Info("initialized periodically cleaning log successfully")
} else {
log.Info("periodically cleaning log is switched off")
}
// 初始化Mongodb数据库
@@ -55,7 +56,7 @@ func main() {
debug.PrintStack()
panic(err)
}
log.Info("初始化Mongodb数据库成功")
log.Info("initialized MongoDB successfully")
// 初始化Redis数据库
if err := database.InitRedis(); err != nil {
@@ -63,7 +64,7 @@ func main() {
debug.PrintStack()
panic(err)
}
log.Info("初始化Redis数据库成功")
log.Info("initialized Redis successfully")
if model.IsMaster() {
// 初始化定时任务
@@ -72,7 +73,23 @@ func main() {
debug.PrintStack()
panic(err)
}
log.Info("初始化定时任务成功")
log.Info("initialized schedule successfully")
// 初始化用户服务
if err := services.InitUserService(); err != nil {
log.Error("init user service error:" + err.Error())
debug.PrintStack()
panic(err)
}
log.Info("initialized user service successfully")
// 初始化依赖服务
if err := services.InitDepsFetcher(); err != nil {
log.Error("init dependency fetcher error:" + err.Error())
debug.PrintStack()
panic(err)
}
log.Info("initialized dependency fetcher successfully")
}
// 初始化任务执行器
@@ -81,14 +98,14 @@ func main() {
debug.PrintStack()
panic(err)
}
log.Info("初始化任务执行器成功")
log.Info("initialized task executor successfully")
// 初始化节点服务
if err := services.InitNodeService(); err != nil {
log.Error("init node service error:" + err.Error())
panic(err)
}
log.Info("初始化节点配置成功")
log.Info("initialized node service successfully")
// 初始化爬虫服务
if err := services.InitSpiderService(); err != nil {
@@ -96,73 +113,133 @@ func main() {
debug.PrintStack()
panic(err)
}
log.Info("初始化爬虫服务成功")
log.Info("initialized spider service successfully")
// 初始化用户服务
if err := services.InitUserService(); err != nil {
log.Error("init user service error:" + err.Error())
// 初始化RPC服务
if err := services.InitRpcService(); err != nil {
log.Error("init rpc service error:" + err.Error())
debug.PrintStack()
panic(err)
}
log.Info("初始化用户服务成功")
log.Info("initialized rpc service successfully")
// 以下为主节点服务
if model.IsMaster() {
// 中间件
app.Use(middlewares.CORSMiddleware())
//app.Use(middlewares.AuthorizationMiddleware())
anonymousGroup := app.Group("/")
{
anonymousGroup.POST("/login", routes.Login) // 用户登录
anonymousGroup.PUT("/users", routes.PutUser) // 添加用户
anonymousGroup.POST("/login", routes.Login) // 用户登录
anonymousGroup.PUT("/users", routes.PutUser) // 添加用户
anonymousGroup.GET("/setting", routes.GetSetting) // 获取配置信息
// release版本
anonymousGroup.GET("/version", routes.GetVersion) // 获取发布的版本
}
authGroup := app.Group("/", middlewares.AuthorizationMiddleware())
{
// 路由
// 节点
authGroup.GET("/nodes", routes.GetNodeList) // 节点列表
authGroup.GET("/nodes/:id", routes.GetNode) // 节点详情
authGroup.POST("/nodes/:id", routes.PostNode) // 修改节点
authGroup.GET("/nodes/:id/tasks", routes.GetNodeTaskList) // 节点任务列表
authGroup.GET("/nodes/:id/system", routes.GetSystemInfo) // 节点任务列表
authGroup.DELETE("/nodes/:id", routes.DeleteNode) // 删除节点
{
authGroup.GET("/nodes", routes.GetNodeList) // 节点列表
authGroup.GET("/nodes/:id", routes.GetNode) // 节点详情
authGroup.POST("/nodes/:id", routes.PostNode) // 修改节点
authGroup.GET("/nodes/:id/tasks", routes.GetNodeTaskList) // 节点任务列表
authGroup.GET("/nodes/:id/system", routes.GetSystemInfo) // 节点任务列表
authGroup.DELETE("/nodes/:id", routes.DeleteNode) // 删除节点
authGroup.GET("/nodes/:id/langs", routes.GetLangList) // 节点语言环境列表
authGroup.GET("/nodes/:id/deps", routes.GetDepList) // 节点第三方依赖列表
authGroup.GET("/nodes/:id/deps/installed", routes.GetInstalledDepList) // 节点已安装第三方依赖列表
authGroup.POST("/nodes/:id/deps/install", routes.InstallDep) // 节点安装依赖
authGroup.POST("/nodes/:id/deps/uninstall", routes.UninstallDep) // 节点卸载依赖
authGroup.POST("/nodes/:id/langs/install", routes.InstallLang) // 节点安装语言
}
// 爬虫
authGroup.GET("/spiders", routes.GetSpiderList) // 爬虫列表
authGroup.GET("/spiders/:id", routes.GetSpider) // 爬虫详情
authGroup.POST("/spiders", routes.PutSpider) // 上传爬虫
authGroup.POST("/spiders/:id", routes.PostSpider) // 修改爬虫
authGroup.POST("/spiders/:id/publish", routes.PublishSpider) // 发布爬虫
authGroup.DELETE("/spiders/:id", routes.DeleteSpider) // 删除爬虫
authGroup.GET("/spiders/:id/tasks", routes.GetSpiderTasks) // 爬虫任务列表
authGroup.GET("/spiders/:id/file", routes.GetSpiderFile) // 爬虫文件读取
authGroup.POST("/spiders/:id/file", routes.PostSpiderFile) // 爬虫目录写入
authGroup.GET("/spiders/:id/dir", routes.GetSpiderDir) // 爬虫目录
authGroup.GET("/spiders/:id/stats", routes.GetSpiderStats) // 爬虫统计数据
authGroup.GET("/spider/types", routes.GetSpiderTypes) // 爬虫类型
{
authGroup.GET("/spiders", routes.GetSpiderList) // 爬虫列表
authGroup.GET("/spiders/:id", routes.GetSpider) // 爬虫详情
authGroup.PUT("/spiders", routes.PutSpider) // 添加爬虫
authGroup.POST("/spiders", routes.UploadSpider) // 上传爬虫
authGroup.POST("/spiders/:id", routes.PostSpider) // 修改爬虫
authGroup.POST("/spiders/:id/publish", routes.PublishSpider) // 发布爬虫
authGroup.POST("/spiders/:id/upload", routes.UploadSpiderFromId) // 上传爬虫ID
authGroup.DELETE("/spiders/:id", routes.DeleteSpider) // 删除爬虫
authGroup.GET("/spiders/:id/tasks", routes.GetSpiderTasks) // 爬虫任务列表
authGroup.GET("/spiders/:id/file/tree", routes.GetSpiderFileTree) // 爬虫文件目录树读取
authGroup.GET("/spiders/:id/file", routes.GetSpiderFile) // 爬虫文件读取
authGroup.POST("/spiders/:id/file", routes.PostSpiderFile) // 爬虫文件更改
authGroup.PUT("/spiders/:id/file", routes.PutSpiderFile) // 爬虫文件创建
authGroup.PUT("/spiders/:id/dir", routes.PutSpiderDir) // 爬虫目录创建
authGroup.DELETE("/spiders/:id/file", routes.DeleteSpiderFile) // 爬虫文件删除
authGroup.POST("/spiders/:id/file/rename", routes.RenameSpiderFile) // 爬虫文件重命名
authGroup.GET("/spiders/:id/dir", routes.GetSpiderDir) // 爬虫目录
authGroup.GET("/spiders/:id/stats", routes.GetSpiderStats) // 爬虫统计数据
authGroup.GET("/spiders/:id/schedules", routes.GetSpiderSchedules) // 爬虫定时任务
}
// 可配置爬虫
{
authGroup.GET("/config_spiders/:id/config", routes.GetConfigSpiderConfig) // 获取可配置爬虫配置
authGroup.POST("/config_spiders/:id/config", routes.PostConfigSpiderConfig) // 更改可配置爬虫配置
authGroup.PUT("/config_spiders", routes.PutConfigSpider) // 添加可配置爬虫
authGroup.POST("/config_spiders/:id", routes.PostConfigSpider) // 修改可配置爬虫
authGroup.POST("/config_spiders/:id/upload", routes.UploadConfigSpider) // 上传可配置爬虫
authGroup.POST("/config_spiders/:id/spiderfile", routes.PostConfigSpiderSpiderfile) // 上传可配置爬虫
authGroup.GET("/config_spiders_templates", routes.GetConfigSpiderTemplateList) // 获取可配置爬虫模版列表
}
// 任务
authGroup.GET("/tasks", routes.GetTaskList) // 任务列表
authGroup.GET("/tasks/:id", routes.GetTask) // 任务详情
authGroup.PUT("/tasks", routes.PutTask) // 派发任务
authGroup.DELETE("/tasks/:id", routes.DeleteTask) // 删除任务
authGroup.POST("/tasks/:id/cancel", routes.CancelTask) // 取消任务
authGroup.GET("/tasks/:id/log", routes.GetTaskLog) // 任务日志
authGroup.GET("/tasks/:id/results", routes.GetTaskResults) // 任务结果
authGroup.GET("/tasks/:id/results/download", routes.DownloadTaskResultsCsv) // 下载任务结果
{
authGroup.GET("/tasks", routes.GetTaskList) // 任务列表
authGroup.GET("/tasks/:id", routes.GetTask) // 任务详情
authGroup.PUT("/tasks", routes.PutTask) // 派发任务
authGroup.DELETE("/tasks/:id", routes.DeleteTask) // 删除任务
authGroup.DELETE("/tasks_multiple", routes.DeleteMultipleTask) // 删除多个任务
authGroup.DELETE("/tasks_by_status", routes.DeleteTaskByStatus) //删除指定状态的任务
authGroup.POST("/tasks/:id/cancel", routes.CancelTask) // 取消任务
authGroup.GET("/tasks/:id/log", routes.GetTaskLog) // 任务日志
authGroup.GET("/tasks/:id/results", routes.GetTaskResults) // 任务结果
authGroup.GET("/tasks/:id/results/download", routes.DownloadTaskResultsCsv) // 下载任务结果
}
// 定时任务
authGroup.GET("/schedules", routes.GetScheduleList) // 定时任务列表
authGroup.GET("/schedules/:id", routes.GetSchedule) // 定时任务详情
authGroup.PUT("/schedules", routes.PutSchedule) // 创建定时任务
authGroup.POST("/schedules/:id", routes.PostSchedule) // 修改定时任务
authGroup.DELETE("/schedules/:id", routes.DeleteSchedule) // 删除定时任务
{
authGroup.GET("/schedules", routes.GetScheduleList) // 定时任务列表
authGroup.GET("/schedules/:id", routes.GetSchedule) // 定时任务详情
authGroup.PUT("/schedules", routes.PutSchedule) // 创建定时任务
authGroup.POST("/schedules/:id", routes.PostSchedule) // 修改定时任务
authGroup.DELETE("/schedules/:id", routes.DeleteSchedule) // 删除定时任务
authGroup.POST("/schedules/:id/disable", routes.DisableSchedule) // 禁用定时任务
authGroup.POST("/schedules/:id/enable", routes.EnableSchedule) // 启用定时任务
}
// 用户
{
authGroup.GET("/users", routes.GetUserList) // 用户列表
authGroup.GET("/users/:id", routes.GetUser) // 用户详情
authGroup.POST("/users/:id", routes.PostUser) // 更改用户
authGroup.DELETE("/users/:id", routes.DeleteUser) // 删除用户
authGroup.GET("/me", routes.GetMe) // 获取自己账户
authGroup.POST("/me", routes.PostMe) // 修改自己账户
}
// 系统
{
authGroup.GET("/system/deps/:lang", routes.GetAllDepList) // 节点所有第三方依赖列表
authGroup.GET("/system/deps/:lang/:dep_name/json", routes.GetDepJson) // 节点第三方依赖JSON
}
// 全局变量
{
authGroup.GET("/variables", routes.GetVariableList) // 列表
authGroup.PUT("/variable", routes.PutVariable) // 新增
authGroup.POST("/variable/:id", routes.PostVariable) //修改
authGroup.DELETE("/variable/:id", routes.DeleteVariable) //删除
}
// 项目
{
authGroup.GET("/projects", routes.GetProjectList) // 列表
authGroup.GET("/projects/tags", routes.GetProjectTags) // 项目标签
authGroup.PUT("/projects", routes.PutProject) //修改
authGroup.POST("/projects/:id", routes.PostProject) // 新增
authGroup.DELETE("/projects/:id", routes.DeleteProject) //删除
}
// 统计数据
authGroup.GET("/stats/home", routes.GetHomeStats) // 首页统计数据
// 用户
authGroup.GET("/users", routes.GetUserList) // 用户列表
authGroup.GET("/users/:id", routes.GetUser) // 用户详情
authGroup.POST("/users/:id", routes.PostUser) // 更改用户
authGroup.DELETE("/users/:id", routes.DeleteUser) // 删除用户
authGroup.GET("/me", routes.GetMe) // 获取自己账户
// 文件
authGroup.GET("/file", routes.GetFile) // 获取文件
}
}

View File

@@ -42,12 +42,12 @@ func init() {
app.DELETE("/tasks/:id", DeleteTask) // 删除任务
app.GET("/tasks/:id/results", GetTaskResults) // 任务结果
app.GET("/tasks/:id/results/download", DownloadTaskResultsCsv) // 下载任务结果
app.GET("/spiders", GetSpiderList) // 爬虫列表
app.GET("/spiders/:id", GetSpider) // 爬虫详情
app.POST("/spiders/:id", PostSpider) // 修改爬虫
app.DELETE("/spiders/:id",DeleteSpider) // 删除爬虫
app.GET("/spiders/:id/tasks",GetSpiderTasks) // 爬虫任务列表
app.GET("/spiders/:id/dir",GetSpiderDir) // 爬虫目录
app.GET("/spiders", GetSpiderList) // 爬虫列表
app.GET("/spiders/:id", GetSpider) // 爬虫详情
app.POST("/spiders/:id", PostSpider) // 修改爬虫
app.DELETE("/spiders/:id", DeleteSpider) // 删除爬虫
app.GET("/spiders/:id/tasks", GetSpiderTasks) // 爬虫任务列表
app.GET("/spiders/:id/dir", GetSpiderDir) // 爬虫目录
}
//mock test, test data in ./mock

View File

@@ -10,17 +10,19 @@ import (
"time"
)
var NodeIdss = []bson.ObjectId{bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
bson.ObjectIdHex("5d429e6c19f7abede924fee1")}
var scheduleList = []model.Schedule{
{
Id: bson.ObjectId("5d429e6c19f7abede924fee2"),
Name: "test schedule",
SpiderId: "123",
NodeId: bson.ObjectId("5d429e6c19f7abede924fee2"),
NodeIds: NodeIdss,
Cron: "***1*",
EntryId: 10,
// 前端展示
SpiderName: "test scedule",
NodeName: "测试节点",
CreateTs: time.Now(),
UpdateTs: time.Now(),
@@ -29,12 +31,11 @@ var scheduleList = []model.Schedule{
Id: bson.ObjectId("xx429e6c19f7abede924fee2"),
Name: "test schedule2",
SpiderId: "234",
NodeId: bson.ObjectId("5d429e6c19f7abede924fee2"),
NodeIds: NodeIdss,
Cron: "***1*",
EntryId: 10,
// 前端展示
SpiderName: "test scedule2",
NodeName: "测试节点",
CreateTs: time.Now(),
UpdateTs: time.Now(),
@@ -100,8 +101,10 @@ func PutSchedule(c *gin.Context) {
}
// 如果node_id为空则置为空ObjectId
if item.NodeId == "" {
item.NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
for _, NodeId := range item.NodeIds {
if NodeId == "" {
NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
}
}
c.JSON(http.StatusOK, Response{

View File

@@ -75,12 +75,11 @@ func TestPostSchedule(t *testing.T) {
Id: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
Name: "test schedule",
SpiderId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
NodeId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
NodeIds: NodeIdss,
Cron: "***1*",
EntryId: 10,
// 前端展示
SpiderName: "test scedule",
NodeName: "测试节点",
CreateTs: time.Now(),
UpdateTs: time.Now(),
@@ -112,12 +111,11 @@ func TestPutSchedule(t *testing.T) {
Id: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
Name: "test schedule",
SpiderId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
NodeId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
NodeIds: NodeIdss,
Cron: "***1*",
EntryId: 10,
// 前端展示
SpiderName: "test scedule",
NodeName: "测试节点",
CreateTs: time.Now(),
UpdateTs: time.Now(),

View File

@@ -6,8 +6,6 @@ import (
"net/http"
)
var taskDailyItems = []model.TaskDailyItem{
{
Date: "2019/08/19",

View File

@@ -1 +1 @@
package mock
package mock

View File

@@ -1 +1 @@
package mock
package mock

View File

@@ -0,0 +1,26 @@
package config_spider
import "crawlab/entity"
func GetAllFields(data entity.ConfigSpiderData) []entity.Field {
var fields []entity.Field
for _, stage := range data.Stages {
for _, field := range stage.Fields {
fields = append(fields, field)
}
}
return fields
}
func GetStartStageName(data entity.ConfigSpiderData) string {
// 如果 start_stage 设置了且在 stages 里,则返回
if data.StartStage != "" {
return data.StartStage
}
// 否则返回第一个 stage
for _, stage := range data.Stages {
return stage.Name
}
return ""
}

View File

@@ -0,0 +1,259 @@
package config_spider
import (
"crawlab/constants"
"crawlab/entity"
"crawlab/model"
"crawlab/utils"
"errors"
"fmt"
"path/filepath"
)
type ScrapyGenerator struct {
Spider model.Spider
ConfigData entity.ConfigSpiderData
}
// 生成爬虫文件
func (g ScrapyGenerator) Generate() error {
// 生成 items.py
if err := g.ProcessItems(); err != nil {
return err
}
// 生成 spider.py
if err := g.ProcessSpider(); err != nil {
return err
}
return nil
}
// 生成 items.py
func (g ScrapyGenerator) ProcessItems() error {
// 待处理文件名
src := g.Spider.Src
filePath := filepath.Join(src, "config_spider", "items.py")
// 获取所有字段
fields := g.GetAllFields()
// 字段名列表(包含默认字段名)
fieldNames := []string{
"_id",
"task_id",
"ts",
}
// 加入字段
for _, field := range fields {
fieldNames = append(fieldNames, field.Name)
}
// 将字段名转化为python代码
str := ""
for _, fieldName := range fieldNames {
line := g.PadCode(fmt.Sprintf("%s = scrapy.Field()", fieldName), 1)
str += line
}
// 将占位符替换为代码
if err := utils.SetFileVariable(filePath, constants.AnchorItems, str); err != nil {
return err
}
return nil
}
// 生成 spider.py
func (g ScrapyGenerator) ProcessSpider() error {
// 待处理文件名
src := g.Spider.Src
filePath := filepath.Join(src, "config_spider", "spiders", "spider.py")
// 替换 start_stage
if err := utils.SetFileVariable(filePath, constants.AnchorStartStage, "parse_"+GetStartStageName(g.ConfigData)); err != nil {
return err
}
// 替换 start_url
if err := utils.SetFileVariable(filePath, constants.AnchorStartUrl, g.ConfigData.StartUrl); err != nil {
return err
}
// 替换 parsers
strParser := ""
for _, stage := range g.ConfigData.Stages {
stageName := stage.Name
stageStr := g.GetParserString(stageName, stage)
strParser += stageStr
}
if err := utils.SetFileVariable(filePath, constants.AnchorParsers, strParser); err != nil {
return err
}
return nil
}
func (g ScrapyGenerator) GetParserString(stageName string, stage entity.Stage) string {
// 构造函数定义行
strDef := g.PadCode(fmt.Sprintf("def parse_%s(self, response):", stageName), 1)
strParse := ""
if stage.IsList {
// 列表逻辑
strParse = g.GetListParserString(stageName, stage)
} else {
// 非列表逻辑
strParse = g.GetNonListParserString(stageName, stage)
}
// 构造
str := fmt.Sprintf(`%s%s`, strDef, strParse)
return str
}
func (g ScrapyGenerator) PadCode(str string, num int) string {
res := ""
for i := 0; i < num; i++ {
res += " "
}
res += str
res += "\n"
return res
}
func (g ScrapyGenerator) GetNonListParserString(stageName string, stage entity.Stage) string {
str := ""
// 获取或构造item
str += g.PadCode("item = Item() if response.meta.get('item') is None else response.meta.get('item')", 2)
// 遍历字段列表
for _, f := range stage.Fields {
line := fmt.Sprintf(`item['%s'] = response.%s.extract_first()`, f.Name, g.GetExtractStringFromField(f))
line = g.PadCode(line, 2)
str += line
}
// next stage 字段
if f, err := g.GetNextStageField(stage); err == nil {
// 如果找到 next stage 字段,进行下一个回调
str += g.PadCode(fmt.Sprintf(`yield scrapy.Request(url="get_real_url(response, item['%s'])", callback=self.parse_%s, meta={'item': item})`, f.Name, f.NextStage), 2)
} else {
// 如果没找到 next stage 字段,返回 item
str += g.PadCode(fmt.Sprintf(`yield item`), 2)
}
// 加入末尾换行
str += g.PadCode("", 0)
return str
}
func (g ScrapyGenerator) GetListParserString(stageName string, stage entity.Stage) string {
str := ""
// 获取前一个 stage 的 item
str += g.PadCode(`prev_item = response.meta.get('item')`, 2)
// for 循环遍历列表
str += g.PadCode(fmt.Sprintf(`for elem in response.%s:`, g.GetListString(stage)), 2)
// 构造item
str += g.PadCode(`item = Item()`, 3)
// 遍历字段列表
for _, f := range stage.Fields {
line := fmt.Sprintf(`item['%s'] = elem.%s.extract_first()`, f.Name, g.GetExtractStringFromField(f))
line = g.PadCode(line, 3)
str += line
}
// 把前一个 stage 的 item 值赋给当前 item
str += g.PadCode(`if prev_item is not None:`, 3)
str += g.PadCode(`for key, value in prev_item.items():`, 4)
str += g.PadCode(`item[key] = value`, 5)
// next stage 字段
if f, err := g.GetNextStageField(stage); err == nil {
// 如果找到 next stage 字段,进行下一个回调
str += g.PadCode(fmt.Sprintf(`yield scrapy.Request(url=get_real_url(response, item['%s']), callback=self.parse_%s, meta={'item': item})`, f.Name, f.NextStage), 3)
} else {
// 如果没找到 next stage 字段,返回 item
str += g.PadCode(fmt.Sprintf(`yield item`), 3)
}
// 分页
if stage.PageCss != "" || stage.PageXpath != "" {
str += g.PadCode(fmt.Sprintf(`next_url = response.%s.extract_first()`, g.GetExtractStringFromStage(stage)), 2)
str += g.PadCode(fmt.Sprintf(`yield scrapy.Request(url=get_real_url(response, next_url), callback=self.parse_%s, meta={'item': prev_item})`, stageName), 2)
}
// 加入末尾换行
str += g.PadCode("", 0)
return str
}
// 获取所有字段
func (g ScrapyGenerator) GetAllFields() []entity.Field {
return GetAllFields(g.ConfigData)
}
// 获取包含 next stage 的字段
func (g ScrapyGenerator) GetNextStageField(stage entity.Stage) (entity.Field, error) {
for _, field := range stage.Fields {
if field.NextStage != "" {
return field, nil
}
}
return entity.Field{}, errors.New("cannot find next stage field")
}
func (g ScrapyGenerator) GetExtractStringFromField(f entity.Field) string {
if f.Css != "" {
// 如果为CSS
if f.Attr == "" {
// 文本
return fmt.Sprintf(`css('%s::text')`, f.Css)
} else {
// 属性
return fmt.Sprintf(`css('%s::attr("%s")')`, f.Css, f.Attr)
}
} else {
// 如果为XPath
if f.Attr == "" {
// 文本
return fmt.Sprintf(`xpath('string(%s)')`, f.Xpath)
} else {
// 属性
return fmt.Sprintf(`xpath('%s/@%s')`, f.Xpath, f.Attr)
}
}
}
func (g ScrapyGenerator) GetExtractStringFromStage(stage entity.Stage) string {
// 分页元素属性,默认为 href
pageAttr := "href"
if stage.PageAttr != "" {
pageAttr = stage.PageAttr
}
if stage.PageCss != "" {
// 如果为CSS
return fmt.Sprintf(`css('%s::attr("%s")')`, stage.PageCss, pageAttr)
} else {
// 如果为XPath
return fmt.Sprintf(`xpath('%s/@%s')`, stage.PageXpath, pageAttr)
}
}
func (g ScrapyGenerator) GetListString(stage entity.Stage) string {
if stage.ListCss != "" {
return fmt.Sprintf(`css('%s')`, stage.ListCss)
} else {
return fmt.Sprintf(`xpath('%s')`, stage.ListXpath)
}
}

View File

@@ -20,10 +20,13 @@ type GridFs struct {
}
type File struct {
Name string `json:"name"`
Path string `json:"path"`
IsDir bool `json:"is_dir"`
Size int64 `json:"size"`
Name string `json:"name"`
Path string `json:"path"`
RelativePath string `json:"relative_path"`
IsDir bool `json:"is_dir"`
Size int64 `json:"size"`
Children []File `json:"children"`
Label string `json:"label"`
}
func (f *GridFs) Remove() {

View File

@@ -55,7 +55,7 @@ func GetCurrentNode() (Node, error) {
for {
// 如果错误次数超过10次
if errNum >= 10 {
panic("cannot get current node")
return node, errors.New("cannot get current node")
}
// 尝试获取节点
@@ -63,7 +63,9 @@ func GetCurrentNode() (Node, error) {
// 如果获取失败
if err != nil {
// 如果为主节点,表示为第一次注册,插入节点信息
if IsMaster() {
// update: 增加具体错误过滤。防止加入多个master节点后续需要职责拆分
//只在master节点运行的时候才检测master节点的信息是否存在
if IsMaster() && err == mgo.ErrNotFound {
// 获取本机信息
ip, mac, key, err := GetNodeBaseInfo()
if err != nil {
@@ -143,6 +145,7 @@ func (n *Node) GetTasks() ([]Task, error) {
return tasks, nil
}
// 节点列表
func GetNodeList(filter interface{}) ([]Node, error) {
s, c := database.GetCol("nodes")
defer s.Close()
@@ -156,6 +159,7 @@ func GetNodeList(filter interface{}) ([]Node, error) {
return results, nil
}
// 节点信息
func GetNode(id bson.ObjectId) (Node, error) {
var node Node
@@ -169,13 +173,14 @@ func GetNode(id bson.ObjectId) (Node, error) {
defer s.Close()
if err := c.FindId(id).One(&node); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
//log.Errorf("get node error: %s, id: %s", err.Error(), id.Hex())
//debug.PrintStack()
return node, err
}
return node, nil
}
// 节点信息
func GetNodeByKey(key string) (Node, error) {
s, c := database.GetCol("nodes")
defer s.Close()
@@ -191,6 +196,7 @@ func GetNodeByKey(key string) (Node, error) {
return node, nil
}
// 更新节点
func UpdateNode(id bson.ObjectId, item Node) error {
s, c := database.GetCol("nodes")
defer s.Close()
@@ -206,6 +212,7 @@ func UpdateNode(id bson.ObjectId, item Node) error {
return nil
}
// 任务列表
func GetNodeTaskList(id bson.ObjectId) ([]Task, error) {
node, err := GetNode(id)
if err != nil {
@@ -218,6 +225,7 @@ func GetNodeTaskList(id bson.ObjectId) ([]Task, error) {
return tasks, nil
}
// 节点数
func GetNodeCount(query interface{}) (int, error) {
s, c := database.GetCol("nodes")
defer s.Close()

146
backend/model/project.go Normal file
View File

@@ -0,0 +1,146 @@
package model
import (
"crawlab/constants"
"crawlab/database"
"github.com/apex/log"
"github.com/globalsign/mgo/bson"
"runtime/debug"
"time"
)
type Project struct {
Id bson.ObjectId `json:"_id" bson:"_id"`
Name string `json:"name" bson:"name"`
Description string `json:"description" bson:"description"`
Tags []string `json:"tags" bson:"tags"`
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
// 前端展示
Spiders []Spider `json:"spiders" bson:"spiders"`
}
func (p *Project) Save() error {
s, c := database.GetCol("projects")
defer s.Close()
p.UpdateTs = time.Now()
if err := c.UpdateId(p.Id, p); err != nil {
debug.PrintStack()
return err
}
return nil
}
func (p *Project) Add() error {
s, c := database.GetCol("projects")
defer s.Close()
p.Id = bson.NewObjectId()
p.UpdateTs = time.Now()
p.CreateTs = time.Now()
if err := c.Insert(p); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
return nil
}
func (p *Project) GetSpiders() ([]Spider, error) {
s, c := database.GetCol("spiders")
defer s.Close()
var query interface{}
if p.Id.Hex() == constants.ObjectIdNull {
query = bson.M{
"$or": []bson.M{
{"project_id": p.Id},
{"project_id": bson.M{"$exists": false}},
},
}
} else {
query = bson.M{"project_id": p.Id}
}
var spiders []Spider
if err := c.Find(query).All(&spiders); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return spiders, err
}
return spiders, nil
}
func GetProject(id bson.ObjectId) (Project, error) {
s, c := database.GetCol("projects")
defer s.Close()
var p Project
if err := c.Find(bson.M{"_id": id}).One(&p); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return p, err
}
return p, nil
}
func GetProjectList(filter interface{}, skip int, sortKey string) ([]Project, error) {
s, c := database.GetCol("projects")
defer s.Close()
var projects []Project
if err := c.Find(filter).Skip(skip).Limit(constants.Infinite).Sort(sortKey).All(&projects); err != nil {
debug.PrintStack()
return projects, err
}
return projects, nil
}
func GetProjectListTotal(filter interface{}) (int, error) {
s, c := database.GetCol("projects")
defer s.Close()
var result int
result, err := c.Find(filter).Count()
if err != nil {
return result, err
}
return result, nil
}
func UpdateProject(id bson.ObjectId, item Project) error {
s, c := database.GetCol("projects")
defer s.Close()
var result Project
if err := c.FindId(id).One(&result); err != nil {
debug.PrintStack()
return err
}
if err := item.Save(); err != nil {
return err
}
return nil
}
func RemoveProject(id bson.ObjectId) error {
s, c := database.GetCol("projects")
defer s.Close()
var result User
if err := c.FindId(id).One(&result); err != nil {
return err
}
if err := c.RemoveId(id); err != nil {
return err
}
return nil
}

View File

@@ -12,19 +12,23 @@ import (
)
type Schedule struct {
Id bson.ObjectId `json:"_id" bson:"_id"`
Name string `json:"name" bson:"name"`
Description string `json:"description" bson:"description"`
SpiderId bson.ObjectId `json:"spider_id" bson:"spider_id"`
NodeId bson.ObjectId `json:"node_id" bson:"node_id"`
NodeKey string `json:"node_key" bson:"node_key"`
Cron string `json:"cron" bson:"cron"`
EntryId cron.EntryID `json:"entry_id" bson:"entry_id"`
Param string `json:"param" bson:"param"`
Id bson.ObjectId `json:"_id" bson:"_id"`
Name string `json:"name" bson:"name"`
Description string `json:"description" bson:"description"`
SpiderId bson.ObjectId `json:"spider_id" bson:"spider_id"`
Cron string `json:"cron" bson:"cron"`
EntryId cron.EntryID `json:"entry_id" bson:"entry_id"`
Param string `json:"param" bson:"param"`
RunType string `json:"run_type" bson:"run_type"`
NodeIds []bson.ObjectId `json:"node_ids" bson:"node_ids"`
Status string `json:"status" bson:"status"`
Enabled bool `json:"enabled" bson:"enabled"`
UserId bson.ObjectId `json:"user_id" bson:"user_id"`
// 前端展示
SpiderName string `json:"spider_name" bson:"spider_name"`
NodeName string `json:"node_name" bson:"node_name"`
Nodes []Node `json:"nodes" bson:"nodes"`
Message string `json:"message" bson:"message"`
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
@@ -46,27 +50,6 @@ func (sch *Schedule) Delete() error {
return c.RemoveId(sch.Id)
}
func (sch *Schedule) SyncNodeIdAndSpiderId(node Node, spider Spider) {
sch.syncNodeId(node)
sch.syncSpiderId(spider)
}
func (sch *Schedule) syncNodeId(node Node) {
if node.Id.Hex() == sch.NodeId.Hex() {
return
}
sch.NodeId = node.Id
_ = sch.Save()
}
func (sch *Schedule) syncSpiderId(spider Spider) {
if spider.Id.Hex() == sch.SpiderId.Hex() {
return
}
sch.SpiderId = spider.Id
_ = sch.Save()
}
func GetScheduleList(filter interface{}) ([]Schedule, error) {
s, c := database.GetCol("schedules")
defer s.Close()
@@ -79,28 +62,25 @@ func GetScheduleList(filter interface{}) ([]Schedule, error) {
var schs []Schedule
for _, schedule := range schedules {
// 获取节点名称
if schedule.NodeId == bson.ObjectIdHex(constants.ObjectIdNull) {
// 选择所有节点
schedule.NodeName = "All Nodes"
} else {
// 选择单一节点
node, err := GetNode(schedule.NodeId)
if err != nil {
log.Errorf(err.Error())
continue
schedule.Nodes = []Node{}
if schedule.RunType == constants.RunTypeSelectedNodes {
for _, nodeId := range schedule.NodeIds {
// 选择单一节点
node, _ := GetNode(nodeId)
schedule.Nodes = append(schedule.Nodes, node)
}
schedule.NodeName = node.Name
}
// 获取爬虫名称
spider, err := GetSpider(schedule.SpiderId)
if err != nil && err == mgo.ErrNotFound {
log.Errorf("get spider by id: %s, error: %s", schedule.SpiderId.Hex(), err.Error())
debug.PrintStack()
_ = schedule.Delete()
continue
schedule.Status = constants.ScheduleStatusError
schedule.Message = constants.ScheduleStatusErrorNotFoundSpider
} else {
schedule.SpiderName = spider.Name
}
schedule.SpiderName = spider.Name
schs = append(schs, schedule)
}
return schs, nil
@@ -125,12 +105,8 @@ func UpdateSchedule(id bson.ObjectId, item Schedule) error {
if err := c.FindId(id).One(&result); err != nil {
return err
}
node, err := GetNode(item.NodeId)
if err != nil {
return err
}
item.NodeKey = node.Key
item.UpdateTs = time.Now()
if err := item.Save(); err != nil {
return err
}
@@ -141,15 +117,9 @@ func AddSchedule(item Schedule) error {
s, c := database.GetCol("schedules")
defer s.Close()
node, err := GetNode(item.NodeId)
if err != nil {
return err
}
item.Id = bson.NewObjectId()
item.CreateTs = time.Now()
item.UpdateTs = time.Now()
item.NodeKey = node.Key
if err := c.Insert(&item); err != nil {
debug.PrintStack()

View File

@@ -1,11 +1,17 @@
package model
import (
"crawlab/constants"
"crawlab/database"
"crawlab/entity"
"crawlab/utils"
"errors"
"github.com/apex/log"
"github.com/globalsign/mgo"
"github.com/globalsign/mgo/bson"
"gopkg.in/yaml.v2"
"io/ioutil"
"path/filepath"
"runtime/debug"
"time"
)
@@ -25,25 +31,21 @@ type Spider struct {
Site string `json:"site" bson:"site"` // 爬虫网站
Envs []Env `json:"envs" bson:"envs"` // 环境变量
Remark string `json:"remark" bson:"remark"` // 备注
Src string `json:"src" bson:"src"` // 源码位置
ProjectId bson.ObjectId `json:"project_id" bson:"project_id"` // 项目ID
// 自定义爬虫
Src string `json:"src" bson:"src"` // 源码位置
Cmd string `json:"cmd" bson:"cmd"` // 执行命令
// 可配置爬虫
Template string `json:"template" bson:"template"` // Spiderfile模版
// 前端展示
LastRunTs time.Time `json:"last_run_ts"` // 最后一次执行时间
LastStatus string `json:"last_status"` // 最后执行状态
// TODO: 可配置爬虫
//Fields []interface{} `json:"fields"`
//DetailFields []interface{} `json:"detail_fields"`
//CrawlType string `json:"crawl_type"`
//StartUrl string `json:"start_url"`
//UrlPattern string `json:"url_pattern"`
//ItemSelector string `json:"item_selector"`
//ItemSelectorType string `json:"item_selector_type"`
//PaginationSelector string `json:"pagination_selector"`
//PaginationSelectorType string `json:"pagination_selector_type"`
LastRunTs time.Time `json:"last_run_ts"` // 最后一次执行时间
LastStatus string `json:"last_status"` // 最后执行状态
Config entity.ConfigSpiderData `json:"config"` // 可配置爬虫配置
// 时间
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
}
@@ -55,6 +57,11 @@ func (spider *Spider) Save() error {
spider.UpdateTs = time.Now()
// 兼容没有项目ID的爬虫
if spider.ProjectId.Hex() == "" {
spider.ProjectId = bson.ObjectIdHex(constants.ObjectIdNull)
}
if err := c.UpdateId(spider.Id, spider); err != nil {
debug.PrintStack()
return err
@@ -98,24 +105,29 @@ func (spider *Spider) GetLastTask() (Task, error) {
return tasks[0], nil
}
// 删除爬虫
func (spider *Spider) Delete() error {
s, c := database.GetCol("spiders")
defer s.Close()
return c.RemoveId(spider.Id)
}
// 爬虫列表
func GetSpiderList(filter interface{}, skip int, limit int) ([]Spider, int, error) {
// 获取爬虫列表
func GetSpiderList(filter interface{}, skip int, limit int, sortStr string) ([]Spider, int, error) {
s, c := database.GetCol("spiders")
defer s.Close()
// 获取爬虫列表
var spiders []Spider
if err := c.Find(filter).Skip(skip).Limit(limit).Sort("+name").All(&spiders); err != nil {
if err := c.Find(filter).Skip(skip).Limit(limit).Sort(sortStr).All(&spiders); err != nil {
debug.PrintStack()
return spiders, 0, err
}
if spiders == nil {
spiders = []Spider{}
}
// 遍历爬虫列表
for i, spider := range spiders {
// 获取最后一次任务
@@ -136,7 +148,7 @@ func GetSpiderList(filter interface{}, skip int, limit int) ([]Spider, int, erro
return spiders, count, nil
}
// 获取爬虫
// 获取爬虫(根据FileId)
func GetSpiderByFileId(fileId bson.ObjectId) *Spider {
s, c := database.GetCol("spiders")
defer s.Close()
@@ -150,34 +162,44 @@ func GetSpiderByFileId(fileId bson.ObjectId) *Spider {
return result
}
// 获取爬虫
func GetSpiderByName(name string) *Spider {
s, c := database.GetCol("spiders")
defer s.Close()
var result *Spider
if err := c.Find(bson.M{"name": name}).One(&result); err != nil {
log.Errorf("get spider error: %s, spider_name: %s", err.Error(), name)
debug.PrintStack()
return nil
}
return result
}
// 获取爬虫
func GetSpider(id bson.ObjectId) (Spider, error) {
// 获取爬虫(根据名称)
func GetSpiderByName(name string) Spider {
s, c := database.GetCol("spiders")
defer s.Close()
var result Spider
if err := c.FindId(id).One(&result); err != nil {
if err := c.Find(bson.M{"name": name}).One(&result); err != nil && err != mgo.ErrNotFound {
log.Errorf("get spider error: %s, spider_name: %s", err.Error(), name)
//debug.PrintStack()
return result
}
return result
}
// 获取爬虫(根据ID)
func GetSpider(id bson.ObjectId) (Spider, error) {
s, c := database.GetCol("spiders")
defer s.Close()
// 获取爬虫
var spider Spider
if err := c.FindId(id).One(&spider); err != nil {
if err != mgo.ErrNotFound {
log.Errorf("get spider error: %s, id: %id", err.Error(), id.Hex())
debug.PrintStack()
}
return result, err
return spider, err
}
return result, nil
// 如果为可配置爬虫,获取爬虫配置
if spider.Type == constants.Configurable && utils.Exists(filepath.Join(spider.Src, "Spiderfile")) {
config, err := GetConfigSpiderData(spider)
if err != nil {
return spider, err
}
spider.Config = config
}
return spider, nil
}
// 更新爬虫
@@ -217,10 +239,12 @@ func RemoveSpider(id bson.ObjectId) error {
s, gf := database.GetGridFs("files")
defer s.Close()
if err := gf.RemoveId(result.FileId); err != nil {
log.Error("remove file error, id:" + result.FileId.Hex())
debug.PrintStack()
return err
if result.FileId.Hex() != constants.ObjectIdNull {
if err := gf.RemoveId(result.FileId); err != nil {
log.Error("remove file error, id:" + result.FileId.Hex())
debug.PrintStack()
return err
}
}
return nil
@@ -245,7 +269,7 @@ func RemoveAllSpider() error {
return nil
}
// 爬虫总数
// 获取爬虫总数
func GetSpiderCount() (int, error) {
s, c := database.GetCol("spiders")
defer s.Close()
@@ -257,23 +281,29 @@ func GetSpiderCount() (int, error) {
return count, nil
}
// 爬虫类型
func GetSpiderTypes() ([]*entity.SpiderType, error) {
s, c := database.GetCol("spiders")
defer s.Close()
// 获取爬虫定时任务
func GetConfigSpiderData(spider Spider) (entity.ConfigSpiderData, error) {
// 构造配置数据
configData := entity.ConfigSpiderData{}
group := bson.M{
"$group": bson.M{
"_id": "$type",
"count": bson.M{"$sum": 1},
},
}
var types []*entity.SpiderType
if err := c.Pipe([]bson.M{group}).All(&types); err != nil {
log.Errorf("get spider types error: %s", err.Error())
debug.PrintStack()
return nil, err
// 校验爬虫类别
if spider.Type != constants.Configurable {
return configData, errors.New("not a configurable spider")
}
return types, nil
// Spiderfile 目录
sfPath := filepath.Join(spider.Src, "Spiderfile")
// 读取YAML文件
yamlFile, err := ioutil.ReadFile(sfPath)
if err != nil {
return configData, err
}
// 反序列化
if err := yaml.Unmarshal(yamlFile, &configData); err != nil {
return configData, err
}
return configData, nil
}

View File

@@ -25,6 +25,7 @@ type Task struct {
RuntimeDuration float64 `json:"runtime_duration" bson:"runtime_duration"`
TotalDuration float64 `json:"total_duration" bson:"total_duration"`
Pid int `json:"pid" bson:"pid"`
UserId bson.ObjectId `json:"user_id" bson:"user_id"`
// 前端数据
SpiderName string `json:"spider_name"`
@@ -61,6 +62,7 @@ func (t *Task) Save() error {
defer s.Close()
t.UpdateTs = time.Now()
if err := c.UpdateId(t.Id, t); err != nil {
log.Errorf("update task error: %s", err.Error())
debug.PrintStack()
return err
}
@@ -93,7 +95,7 @@ func (t *Task) GetResults(pageNum int, pageSize int) (results []interface{}, tot
query := bson.M{
"task_id": t.Id,
}
if err = c.Find(query).Skip((pageNum - 1) * pageSize).Limit(pageSize).Sort("-create_ts").All(&results); err != nil {
if err = c.Find(query).Skip((pageNum - 1) * pageSize).Limit(pageSize).All(&results); err != nil {
return
}
@@ -116,18 +118,12 @@ func GetTaskList(filter interface{}, skip int, limit int, sortKey string) ([]Tas
for i, task := range tasks {
// 获取爬虫名称
spider, err := task.GetSpider()
if err != nil || spider.Id.Hex() == "" {
_ = spider.Delete()
} else {
if spider, err := task.GetSpider(); err == nil {
tasks[i].SpiderName = spider.DisplayName
}
// 获取节点名称
node, err := task.GetNode()
if node.Id.Hex() == "" || err != nil {
_ = task.Delete()
} else {
if node, err := task.GetNode(); err == nil {
tasks[i].NodeName = node.Name
}
}
@@ -141,6 +137,8 @@ func GetTaskListTotal(filter interface{}) (int, error) {
var result int
result, err := c.Find(filter).Count()
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return result, err
}
return result, nil
@@ -152,6 +150,7 @@ func GetTask(id string) (Task, error) {
var task Task
if err := c.FindId(id).One(&task); err != nil {
log.Infof("get task error: %s, id: %s", err.Error(), id)
debug.PrintStack()
return task, err
}
@@ -166,6 +165,8 @@ func AddTask(item Task) error {
item.UpdateTs = time.Now()
if err := c.Insert(&item); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
return nil
@@ -177,6 +178,8 @@ func RemoveTask(id string) error {
var result Task
if err := c.FindId(id).One(&result); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
@@ -187,6 +190,20 @@ func RemoveTask(id string) error {
return nil
}
func RemoveTaskByStatus(status string) error {
tasks, err := GetTaskList(bson.M{"status": status}, 0, constants.Infinite, "-create_ts")
if err != nil {
log.Error("get tasks error:" + err.Error())
}
for _, task := range tasks {
if err := RemoveTask(task.Id); err != nil {
log.Error("remove task error:" + err.Error())
continue
}
}
return nil
}
// 删除task by spider_id
func RemoveTaskBySpiderId(id bson.ObjectId) error {
tasks, err := GetTaskList(bson.M{"spider_id": id}, 0, constants.Infinite, "-create_ts")

View File

@@ -16,11 +16,20 @@ type User struct {
Username string `json:"username" bson:"username"`
Password string `json:"password" bson:"password"`
Role string `json:"role" bson:"role"`
Email string `json:"email" bson:"email"`
Setting UserSetting `json:"setting" bson:"setting"`
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
}
type UserSetting struct {
NotificationTrigger string `json:"notification_trigger" bson:"notification_trigger"`
DingTalkRobotWebhook string `json:"ding_talk_robot_webhook" bson:"ding_talk_robot_webhook"`
WechatRobotWebhook string `json:"wechat_robot_webhook" bson:"wechat_robot_webhook"`
EnabledNotifications []string `json:"enabled_notifications" bson:"enabled_notifications"`
}
func (user *User) Save() error {
s, c := database.GetCol("users")
defer s.Close()

97
backend/model/variable.go Normal file
View File

@@ -0,0 +1,97 @@
package model
import (
"crawlab/database"
"errors"
"github.com/apex/log"
"github.com/globalsign/mgo/bson"
"runtime/debug"
)
/**
全局变量
*/
type Variable struct {
Id bson.ObjectId `json:"_id" bson:"_id"`
Key string `json:"key" bson:"key"`
Value string `json:"value" bson:"value"`
Remark string `json:"remark" bson:"remark"`
}
func (model *Variable) Save() error {
s, c := database.GetCol("variable")
defer s.Close()
if err := c.UpdateId(model.Id, model); err != nil {
log.Errorf("update variable error: %s", err.Error())
return err
}
return nil
}
func (model *Variable) Add() error {
s, c := database.GetCol("variable")
defer s.Close()
// key 去重
_, err := GetByKey(model.Key)
if err == nil {
return errors.New("key already exists")
}
model.Id = bson.NewObjectId()
if err := c.Insert(model); err != nil {
log.Errorf("add variable error: %s", err.Error())
debug.PrintStack()
return err
}
return nil
}
func (model *Variable) Delete() error {
s, c := database.GetCol("variable")
defer s.Close()
if err := c.RemoveId(model.Id); err != nil {
log.Errorf("remove variable error: %s", err.Error())
debug.PrintStack()
return err
}
return nil
}
func GetByKey(key string) (Variable, error) {
s, c := database.GetCol("variable")
defer s.Close()
var model Variable
if err := c.Find(bson.M{"key": key}).One(&model); err != nil {
log.Errorf("variable found error: %s, key: %s", err.Error(), key)
return model, err
}
return model, nil
}
func GetVariable(id bson.ObjectId) (Variable, error) {
s, c := database.GetCol("variable")
defer s.Close()
var model Variable
if err := c.FindId(id).One(&model); err != nil {
log.Errorf("variable found error: %s", err.Error())
return model, err
}
return model, nil
}
func GetVariableList() []Variable {
s, c := database.GetCol("variable")
defer s.Close()
var list []Variable
if err := c.Find(nil).All(&list); err != nil {
}
return list
}

View File

@@ -0,0 +1,316 @@
package routes
import (
"crawlab/constants"
"crawlab/entity"
"crawlab/model"
"crawlab/services"
"crawlab/utils"
"fmt"
"github.com/gin-gonic/gin"
"github.com/globalsign/mgo/bson"
"github.com/spf13/viper"
"gopkg.in/yaml.v2"
"io"
"io/ioutil"
"net/http"
"os"
"path/filepath"
"strings"
)
// 添加可配置爬虫
func PutConfigSpider(c *gin.Context) {
var spider model.Spider
if err := c.ShouldBindJSON(&spider); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
// 爬虫名称不能为空
if spider.Name == "" {
HandleErrorF(http.StatusBadRequest, c, "spider name should not be empty")
return
}
// 模版名不能为空
if spider.Template == "" {
HandleErrorF(http.StatusBadRequest, c, "spider template should not be empty")
return
}
// 判断爬虫是否存在
if spider := model.GetSpiderByName(spider.Name); spider.Name != "" {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("spider for '%s' already exists", spider.Name))
return
}
// 设置爬虫类别
spider.Type = constants.Configurable
// 将FileId置空
spider.FileId = bson.ObjectIdHex(constants.ObjectIdNull)
// 创建爬虫目录
spiderDir := filepath.Join(viper.GetString("spider.path"), spider.Name)
if utils.Exists(spiderDir) {
if err := os.RemoveAll(spiderDir); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
if err := os.MkdirAll(spiderDir, 0777); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
spider.Src = spiderDir
// 复制Spiderfile模版
contentByte, err := ioutil.ReadFile("./template/spiderfile/Spiderfile." + spider.Template)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
f, err := os.Create(filepath.Join(spider.Src, "Spiderfile"))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
defer f.Close()
if _, err := f.Write(contentByte); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 添加爬虫到数据库
if err := spider.Add(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: spider,
})
}
// 更改可配置爬虫
func PostConfigSpider(c *gin.Context) {
PostSpider(c)
}
// 上传可配置爬虫Spiderfile
func UploadConfigSpider(c *gin.Context) {
id := c.Param("id")
// 获取爬虫
var spider model.Spider
spider, err := model.GetSpider(bson.ObjectIdHex(id))
if err != nil {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("cannot find spider (id: %s)", id))
}
// 获取上传文件
file, header, err := c.Request.FormFile("file")
if err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
// 文件名称必须为Spiderfile
filename := header.Filename
if filename != "Spiderfile" && filename != "Spiderfile.yaml" && filename != "Spiderfile.yml" {
HandleErrorF(http.StatusBadRequest, c, "filename must be 'Spiderfile(.yaml|.yml)'")
return
}
// 爬虫目录
spiderDir := filepath.Join(viper.GetString("spider.path"), spider.Name)
// 爬虫Spiderfile文件路径
sfPath := filepath.Join(spiderDir, filename)
// 创建如果不存在或打开Spiderfile如果存在
var f *os.File
if utils.Exists(sfPath) {
f, err = os.OpenFile(sfPath, os.O_WRONLY, 0777)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
}
} else {
f, err = os.Create(sfPath)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
}
}
// 将上传的文件拷贝到爬虫Spiderfile文件
_, err = io.Copy(f, file)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 关闭Spiderfile文件
_ = f.Close()
// 构造配置数据
configData := entity.ConfigSpiderData{}
// 读取YAML文件
yamlFile, err := ioutil.ReadFile(sfPath)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 反序列化
if err := yaml.Unmarshal(yamlFile, &configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 根据序列化后的数据处理爬虫文件
if err := services.ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func PostConfigSpiderSpiderfile(c *gin.Context) {
type Body struct {
Content string `json:"content"`
}
id := c.Param("id")
// 文件内容
var reqBody Body
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
content := reqBody.Content
// 获取爬虫
var spider model.Spider
spider, err := model.GetSpider(bson.ObjectIdHex(id))
if err != nil {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("cannot find spider (id: %s)", id))
return
}
// 反序列化
var configData entity.ConfigSpiderData
if err := yaml.Unmarshal([]byte(content), &configData); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
// 校验configData
if err := services.ValidateSpiderfile(configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 写文件
if err := ioutil.WriteFile(filepath.Join(spider.Src, "Spiderfile"), []byte(content), os.ModePerm); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 根据序列化后的数据处理爬虫文件
if err := services.ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func PostConfigSpiderConfig(c *gin.Context) {
id := c.Param("id")
// 获取爬虫
var spider model.Spider
spider, err := model.GetSpider(bson.ObjectIdHex(id))
if err != nil {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("cannot find spider (id: %s)", id))
return
}
// 反序列化配置数据
var configData entity.ConfigSpiderData
if err := c.ShouldBindJSON(&configData); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
// 校验configData
if err := services.ValidateSpiderfile(configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 替换Spiderfile文件
if err := services.GenerateSpiderfileFromConfigData(spider, configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 根据序列化后的数据处理爬虫文件
if err := services.ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func GetConfigSpiderConfig(c *gin.Context) {
id := c.Param("id")
// 校验ID
if !bson.IsObjectIdHex(id) {
HandleErrorF(http.StatusBadRequest, c, "invalid id")
}
// 获取爬虫
spider, err := model.GetSpider(bson.ObjectIdHex(id))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: spider.Config,
})
}
// 获取模版名称列表
func GetConfigSpiderTemplateList(c *gin.Context) {
var data []string
for _, fInfo := range utils.ListDir("./template/spiderfile") {
templateName := strings.Replace(fInfo.Name(), "Spiderfile.", "", -1)
data = append(data, templateName)
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: data,
})
}

190
backend/routes/projects.go Normal file
View File

@@ -0,0 +1,190 @@
package routes
import (
"crawlab/constants"
"crawlab/database"
"crawlab/model"
"github.com/gin-gonic/gin"
"github.com/globalsign/mgo/bson"
"net/http"
)
func GetProjectList(c *gin.Context) {
tag := c.Query("tag")
// 筛选条件
query := bson.M{}
if tag != "" {
query["tags"] = tag
}
// 获取列表
projects, err := model.GetProjectList(query, 0, "+_id")
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 获取总数
total, err := model.GetProjectListTotal(query)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 获取每个项目的爬虫列表
for i, p := range projects {
spiders, err := p.GetSpiders()
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
projects[i].Spiders = spiders
}
// 获取未被分配的爬虫数量
if tag == "" {
noProject := model.Project{
Id: bson.ObjectIdHex(constants.ObjectIdNull),
Name: "No Project",
Description: "Not assigned to any project",
}
spiders, err := noProject.GetSpiders()
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
noProject.Spiders = spiders
projects = append(projects, noProject)
}
c.JSON(http.StatusOK, ListResponse{
Status: "ok",
Message: "success",
Data: projects,
Total: total,
})
}
func PutProject(c *gin.Context) {
// 绑定请求数据
var p model.Project
if err := c.ShouldBindJSON(&p); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
if err := p.Add(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func PostProject(c *gin.Context) {
id := c.Param("id")
if !bson.IsObjectIdHex(id) {
HandleErrorF(http.StatusBadRequest, c, "invalid id")
}
var item model.Project
if err := c.ShouldBindJSON(&item); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
if err := model.UpdateProject(bson.ObjectIdHex(id), item); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func DeleteProject(c *gin.Context) {
id := c.Param("id")
if !bson.IsObjectIdHex(id) {
HandleErrorF(http.StatusBadRequest, c, "invalid id")
return
}
// 从数据库中删除该爬虫
if err := model.RemoveProject(bson.ObjectIdHex(id)); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 获取相关的爬虫
var spiders []model.Spider
s, col := database.GetCol("spiders")
defer s.Close()
if err := col.Find(bson.M{"project_id": bson.ObjectIdHex(id)}).All(&spiders); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 将爬虫的项目ID置空
for _, spider := range spiders {
spider.ProjectId = bson.ObjectIdHex(constants.ObjectIdNull)
if err := spider.Save(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func GetProjectTags(c *gin.Context) {
type Result struct {
Tag string `json:"tag" bson:"tag"`
}
s, col := database.GetCol("projects")
defer s.Close()
pipeline := []bson.M{
{
"$unwind": "$tags",
},
{
"$group": bson.M{
"_id": "$tags",
},
},
{
"$sort": bson.M{
"_id": 1,
},
},
{
"$addFields": bson.M{
"tag": "$_id",
},
},
}
var items []Result
if err := col.Pipe(pipeline).All(&items); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: items,
})
}

View File

@@ -14,11 +14,7 @@ func GetScheduleList(c *gin.Context) {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: results,
})
HandleSuccessData(c, results)
}
func GetSchedule(c *gin.Context) {
@@ -29,11 +25,8 @@ func GetSchedule(c *gin.Context) {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: result,
})
HandleSuccessData(c, result)
}
func PostSchedule(c *gin.Context) {
@@ -48,7 +41,7 @@ func PostSchedule(c *gin.Context) {
// 验证cron表达式
if err := services.ParserCron(newItem.Cron); err != nil {
HandleError(http.StatusOK, c, err)
HandleError(http.StatusInternalServerError, c, err)
return
}
@@ -65,10 +58,7 @@ func PostSchedule(c *gin.Context) {
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
HandleSuccess(c)
}
func PutSchedule(c *gin.Context) {
@@ -82,10 +72,13 @@ func PutSchedule(c *gin.Context) {
// 验证cron表达式
if err := services.ParserCron(item.Cron); err != nil {
HandleError(http.StatusOK, c, err)
HandleError(http.StatusInternalServerError, c, err)
return
}
// 加入用户ID
item.UserId = services.GetCurrentUser(c).Id
// 更新数据库
if err := model.AddSchedule(item); err != nil {
HandleError(http.StatusInternalServerError, c, err)
@@ -98,10 +91,7 @@ func PutSchedule(c *gin.Context) {
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
HandleSuccess(c)
}
func DeleteSchedule(c *gin.Context) {
@@ -119,8 +109,25 @@ func DeleteSchedule(c *gin.Context) {
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
HandleSuccess(c)
}
// 停止定时任务
func DisableSchedule(c *gin.Context) {
id := c.Param("id")
if err := services.Sched.Disable(bson.ObjectIdHex(id)); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
HandleSuccess(c)
}
// 运行定时任务
func EnableSchedule(c *gin.Context) {
id := c.Param("id")
if err := services.Sched.Enable(bson.ObjectIdHex(id)); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
HandleSuccess(c)
}

33
backend/routes/setting.go Normal file
View File

@@ -0,0 +1,33 @@
package routes
import (
"github.com/gin-gonic/gin"
"github.com/spf13/viper"
"net/http"
)
type SettingBody struct {
AllowRegister string `json:"allow_register"`
}
func GetVersion(c *gin.Context) {
version := viper.GetString("version")
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: version,
})
}
func GetSetting(c *gin.Context) {
allowRegister := viper.GetString("setting.allowRegister")
body := SettingBody{AllowRegister: allowRegister}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: body,
})
}

View File

@@ -7,6 +7,7 @@ import (
"crawlab/model"
"crawlab/services"
"crawlab/utils"
"fmt"
"github.com/apex/log"
"github.com/gin-gonic/gin"
"github.com/globalsign/mgo"
@@ -17,6 +18,7 @@ import (
"io/ioutil"
"net/http"
"os"
"path"
"path/filepath"
"runtime/debug"
"strconv"
@@ -25,22 +27,49 @@ import (
)
func GetSpiderList(c *gin.Context) {
pageNum, _ := c.GetQuery("pageNum")
pageSize, _ := c.GetQuery("pageSize")
pageNum, _ := c.GetQuery("page_num")
pageSize, _ := c.GetQuery("page_size")
keyword, _ := c.GetQuery("keyword")
pid, _ := c.GetQuery("project_id")
t, _ := c.GetQuery("type")
sortKey, _ := c.GetQuery("sort_key")
sortDirection, _ := c.GetQuery("sort_direction")
// 筛选
filter := bson.M{
"name": bson.M{"$regex": bson.RegEx{Pattern: keyword, Options: "im"}},
}
if t != "" {
if t != "" && t != "all" {
filter["type"] = t
}
if pid == "" {
// do nothing
} else if pid == constants.ObjectIdNull {
filter["$or"] = []bson.M{
{"project_id": bson.ObjectIdHex(pid)},
{"project_id": bson.M{"$exists": false}},
}
} else {
filter["project_id"] = bson.ObjectIdHex(pid)
}
// 排序
sortStr := "-_id"
if sortKey != "" && sortDirection != "" {
if sortDirection == constants.DESCENDING {
sortStr = "-" + sortKey
} else if sortDirection == constants.ASCENDING {
sortStr = "+" + sortKey
} else {
HandleErrorF(http.StatusBadRequest, c, "invalid sort_direction")
}
}
// 分页
page := &entity.Page{}
page.GetPage(pageNum, pageSize)
results, count, err := model.GetSpiderList(filter, page.Skip, page.Limit)
results, count, err := model.GetSpiderList(filter, page.Skip, page.Limit, sortStr)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
@@ -117,6 +146,64 @@ func PublishSpider(c *gin.Context) {
}
func PutSpider(c *gin.Context) {
var spider model.Spider
if err := c.ShouldBindJSON(&spider); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
// 爬虫名称不能为空
if spider.Name == "" {
HandleErrorF(http.StatusBadRequest, c, "spider name should not be empty")
return
}
// 判断爬虫是否存在
if spider := model.GetSpiderByName(spider.Name); spider.Name != "" {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("spider for '%s' already exists", spider.Name))
return
}
// 设置爬虫类别
spider.Type = constants.Customized
// 将FileId置空
spider.FileId = bson.ObjectIdHex(constants.ObjectIdNull)
// 创建爬虫目录
spiderDir := filepath.Join(viper.GetString("spider.path"), spider.Name)
if utils.Exists(spiderDir) {
if err := os.RemoveAll(spiderDir); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
if err := os.MkdirAll(spiderDir, 0777); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
spider.Src = spiderDir
// 添加爬虫到数据库
if err := spider.Add(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 同步到GridFS
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: spider,
})
}
func UploadSpider(c *gin.Context) {
// 从body中获取文件
uploadFile, err := c.FormFile("file")
if err != nil {
@@ -125,6 +212,144 @@ func PutSpider(c *gin.Context) {
return
}
// 获取参数
name := c.PostForm("name")
displayName := c.PostForm("display_name")
col := c.PostForm("col")
cmd := c.PostForm("cmd")
// 如果不为zip文件返回错误
if !strings.HasSuffix(uploadFile.Filename, ".zip") {
HandleError(http.StatusBadRequest, c, errors.New("not a valid zip file"))
return
}
// 以防tmp目录不存在
tmpPath := viper.GetString("other.tmppath")
if !utils.Exists(tmpPath) {
if err := os.MkdirAll(tmpPath, os.ModePerm); err != nil {
log.Error("mkdir other.tmppath dir error:" + err.Error())
debug.PrintStack()
HandleError(http.StatusBadRequest, c, errors.New("mkdir other.tmppath dir error"))
return
}
}
// 保存到本地临时文件
randomId := uuid.NewV4()
tmpFilePath := filepath.Join(tmpPath, randomId.String()+".zip")
if err := c.SaveUploadedFile(uploadFile, tmpFilePath); err != nil {
log.Error("save upload file error: " + err.Error())
debug.PrintStack()
HandleError(http.StatusInternalServerError, c, err)
return
}
// 获取 GridFS 实例
s, gf := database.GetGridFs("files")
defer s.Close()
// 判断文件是否已经存在
var gfFile model.GridFs
if err := gf.Find(bson.M{"filename": uploadFile.Filename}).One(&gfFile); err == nil {
// 已经存在文件,则删除
_ = gf.RemoveId(gfFile.Id)
}
// 上传到GridFs
fid, err := services.UploadToGridFs(uploadFile.Filename, tmpFilePath)
if err != nil {
log.Errorf("upload to grid fs error: %s", err.Error())
debug.PrintStack()
return
}
idx := strings.LastIndex(uploadFile.Filename, "/")
targetFilename := uploadFile.Filename[idx+1:]
// 判断爬虫是否存在
spiderName := strings.Replace(targetFilename, ".zip", "", 1)
if name != "" {
spiderName = name
}
spider := model.GetSpiderByName(spiderName)
if spider.Name == "" {
// 保存爬虫信息
srcPath := viper.GetString("spider.path")
spider := model.Spider{
Name: spiderName,
DisplayName: spiderName,
Type: constants.Customized,
Src: filepath.Join(srcPath, spiderName),
FileId: fid,
}
if name != "" {
spider.Name = name
}
if displayName != "" {
spider.DisplayName = displayName
}
if col != "" {
spider.Col = col
}
if cmd != "" {
spider.Cmd = cmd
}
_ = spider.Add()
} else {
if name != "" {
spider.Name = name
}
if displayName != "" {
spider.DisplayName = displayName
}
if col != "" {
spider.Col = col
}
if cmd != "" {
spider.Cmd = cmd
}
// 更新file_id
spider.FileId = fid
_ = spider.Save()
}
// 发起同步
services.PublishAllSpiders()
// 获取爬虫
spider = model.GetSpiderByName(spiderName)
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: spider,
})
}
func UploadSpiderFromId(c *gin.Context) {
// TODO: 与 UploadSpider 部分逻辑重复,需要优化代码
// 爬虫ID
spiderId := c.Param("id")
// 获取爬虫
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
if err != nil {
if err == mgo.ErrNotFound {
HandleErrorF(http.StatusNotFound, c, "cannot find spider")
} else {
HandleError(http.StatusInternalServerError, c, err)
}
return
}
// 从body中获取文件
uploadFile, err := c.FormFile("file")
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 如果不为zip文件返回错误
if !strings.HasSuffix(uploadFile.Filename, ".zip") {
debug.PrintStack()
@@ -153,6 +378,7 @@ func PutSpider(c *gin.Context) {
return
}
// 获取 GridFS 实例
s, gf := database.GetGridFs("files")
defer s.Close()
@@ -171,28 +397,12 @@ func PutSpider(c *gin.Context) {
return
}
idx := strings.LastIndex(uploadFile.Filename, "/")
targetFilename := uploadFile.Filename[idx+1:]
// 更新file_id
spider.FileId = fid
_ = spider.Save()
// 判断爬虫是否存在
spiderName := strings.Replace(targetFilename, ".zip", "", 1)
spider := model.GetSpiderByName(spiderName)
if spider == nil {
// 保存爬虫信息
srcPath := viper.GetString("spider.path")
spider := model.Spider{
Name: spiderName,
DisplayName: spiderName,
Type: constants.Customized,
Src: filepath.Join(srcPath, spiderName),
FileId: fid,
}
_ = spider.Add()
} else {
// 更新file_id
spider.FileId = fid
_ = spider.Save()
}
// 发起同步
services.PublishSpider(spider)
c.JSON(http.StatusOK, Response{
Status: "ok",
@@ -241,6 +451,8 @@ func GetSpiderTasks(c *gin.Context) {
})
}
// 爬虫文件管理
func GetSpiderDir(c *gin.Context) {
// 爬虫ID
id := c.Param("id")
@@ -282,6 +494,12 @@ func GetSpiderDir(c *gin.Context) {
})
}
type SpiderFileReqBody struct {
Path string `json:"path"`
Content string `json:"content"`
NewPath string `json:"new_path"`
}
func GetSpiderFile(c *gin.Context) {
// 爬虫ID
id := c.Param("id")
@@ -310,9 +528,34 @@ func GetSpiderFile(c *gin.Context) {
})
}
type SpiderFileReqBody struct {
Path string `json:"path"`
Content string `json:"content"`
func GetSpiderFileTree(c *gin.Context) {
// 爬虫ID
id := c.Param("id")
// 获取爬虫
spider, err := model.GetSpider(bson.ObjectIdHex(id))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 获取目录下文件列表
spiderPath := viper.GetString("spider.path")
spiderFilePath := filepath.Join(spiderPath, spider.Name)
// 获取文件目录树
fileNodeTree, err := services.GetFileNodeTree(spiderFilePath, 0)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 返回结果
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: fileNodeTree,
})
}
func PostSpiderFile(c *gin.Context) {
@@ -339,6 +582,12 @@ func PostSpiderFile(c *gin.Context) {
return
}
// 同步到GridFS
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 返回结果
c.JSON(http.StatusOK, Response{
Status: "ok",
@@ -346,17 +595,158 @@ func PostSpiderFile(c *gin.Context) {
})
}
// 爬虫类型
func GetSpiderTypes(c *gin.Context) {
types, err := model.GetSpiderTypes()
func PutSpiderFile(c *gin.Context) {
spiderId := c.Param("id")
var reqBody SpiderFileReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 文件路径
filePath := path.Join(spider.Src, reqBody.Path)
// 如果文件已存在,则报错
if utils.Exists(filePath) {
HandleErrorF(http.StatusInternalServerError, c, fmt.Sprintf(`%s already exists`, filePath))
return
}
// 写入文件
if err := ioutil.WriteFile(filePath, []byte(reqBody.Content), 0777); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 同步到GridFS
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func PutSpiderDir(c *gin.Context) {
spiderId := c.Param("id")
var reqBody SpiderFileReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 文件路径
filePath := path.Join(spider.Src, reqBody.Path)
// 如果文件已存在,则报错
if utils.Exists(filePath) {
HandleErrorF(http.StatusInternalServerError, c, fmt.Sprintf(`%s already exists`, filePath))
return
}
// 创建文件夹
if err := os.MkdirAll(filePath, 0777); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 同步到GridFS
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func DeleteSpiderFile(c *gin.Context) {
spiderId := c.Param("id")
var reqBody SpiderFileReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
filePath := path.Join(spider.Src, reqBody.Path)
if err := os.RemoveAll(filePath); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 同步到GridFS
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func RenameSpiderFile(c *gin.Context) {
spiderId := c.Param("id")
var reqBody SpiderFileReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
}
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 原文件路径
filePath := path.Join(spider.Src, reqBody.Path)
newFilePath := path.Join(path.Join(path.Dir(filePath), reqBody.NewPath))
// 如果新文件已存在,则报错
if utils.Exists(newFilePath) {
HandleErrorF(http.StatusInternalServerError, c, fmt.Sprintf(`%s already exists`, newFilePath))
return
}
// 重命名
if err := os.Rename(filePath, newFilePath); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 删除原文件
if err := os.RemoveAll(filePath); err != nil {
HandleError(http.StatusInternalServerError, c, err)
}
// 同步到GridFS
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: types,
})
}
@@ -479,3 +869,25 @@ func GetSpiderStats(c *gin.Context) {
},
})
}
func GetSpiderSchedules(c *gin.Context) {
id := c.Param("id")
if !bson.IsObjectIdHex(id) {
HandleErrorF(http.StatusBadRequest, c, "spider_id is invalid")
return
}
// 获取定时任务
list, err := model.GetScheduleList(bson.M{"spider_id": bson.ObjectIdHex(id)})
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: list,
})
}

316
backend/routes/system.go Normal file
View File

@@ -0,0 +1,316 @@
package routes
import (
"crawlab/constants"
"crawlab/entity"
"crawlab/services"
"fmt"
"github.com/gin-gonic/gin"
"net/http"
"strings"
)
func GetLangList(c *gin.Context) {
nodeId := c.Param("id")
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: services.GetLangList(nodeId),
})
}
func GetDepList(c *gin.Context) {
nodeId := c.Param("id")
lang := c.Query("lang")
depName := c.Query("dep_name")
var depList []entity.Dependency
if lang == constants.Python {
list, err := services.GetPythonDepList(nodeId, depName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
depList = list
} else if lang == constants.Nodejs {
list, err := services.GetNodejsDepList(nodeId, depName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
depList = list
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: depList,
})
}
func GetInstalledDepList(c *gin.Context) {
nodeId := c.Param("id")
lang := c.Query("lang")
var depList []entity.Dependency
if lang == constants.Python {
if services.IsMasterNode(nodeId) {
list, err := services.GetPythonLocalInstalledDepList(nodeId)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
depList = list
} else {
list, err := services.GetPythonRemoteInstalledDepList(nodeId)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
depList = list
}
} else if lang == constants.Nodejs {
if services.IsMasterNode(nodeId) {
list, err := services.GetNodejsLocalInstalledDepList(nodeId)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
depList = list
} else {
list, err := services.GetNodejsRemoteInstalledDepList(nodeId)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
depList = list
}
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: depList,
})
}
func GetAllDepList(c *gin.Context) {
lang := c.Param("lang")
depName := c.Query("dep_name")
// 获取所有依赖列表
var list []string
if lang == constants.Python {
_list, err := services.GetPythonDepListFromRedis()
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
list = _list
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
return
}
// 过滤依赖列表
var depList []string
for _, name := range list {
if strings.HasPrefix(strings.ToLower(name), strings.ToLower(depName)) {
depList = append(depList, name)
}
}
// 只取前20
var returnList []string
for i, name := range depList {
if i >= 10 {
break
}
returnList = append(returnList, name)
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: returnList,
})
}
func InstallDep(c *gin.Context) {
type ReqBody struct {
Lang string `json:"lang"`
DepName string `json:"dep_name"`
}
nodeId := c.Param("id")
var reqBody ReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
if reqBody.Lang == constants.Python {
if services.IsMasterNode(nodeId) {
_, err := services.InstallPythonLocalDep(reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
} else {
_, err := services.InstallPythonRemoteDep(nodeId, reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else if reqBody.Lang == constants.Nodejs {
if services.IsMasterNode(nodeId) {
_, err := services.InstallNodejsLocalDep(reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
} else {
_, err := services.InstallNodejsRemoteDep(nodeId, reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", reqBody.Lang))
return
}
// TODO: check if install is successful
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func UninstallDep(c *gin.Context) {
type ReqBody struct {
Lang string `json:"lang"`
DepName string `json:"dep_name"`
}
nodeId := c.Param("id")
var reqBody ReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
}
if reqBody.Lang == constants.Python {
if services.IsMasterNode(nodeId) {
_, err := services.UninstallPythonLocalDep(reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
} else {
_, err := services.UninstallPythonRemoteDep(nodeId, reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else if reqBody.Lang == constants.Nodejs {
if services.IsMasterNode(nodeId) {
_, err := services.UninstallNodejsLocalDep(reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
} else {
_, err := services.UninstallNodejsRemoteDep(nodeId, reqBody.DepName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", reqBody.Lang))
return
}
// TODO: check if uninstall is successful
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func GetDepJson(c *gin.Context) {
depName := c.Param("dep_name")
lang := c.Param("lang")
var dep entity.Dependency
if lang == constants.Python {
_dep, err := services.FetchPythonDepInfo(depName)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
dep = _dep
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
return
}
c.Header("Cache-Control", "max-age=86400")
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: dep,
})
}
func InstallLang(c *gin.Context) {
type ReqBody struct {
Lang string `json:"lang"`
}
nodeId := c.Param("id")
var reqBody ReqBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
if reqBody.Lang == constants.Nodejs {
if services.IsMasterNode(nodeId) {
_, err := services.InstallNodejsLocalLang()
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
} else {
_, err := services.InstallNodejsRemoteLang(nodeId)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else {
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", reqBody.Lang))
return
}
// TODO: check if install is successful
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}

View File

@@ -9,7 +9,6 @@ import (
"encoding/csv"
"github.com/gin-gonic/gin"
"github.com/globalsign/mgo/bson"
uuid "github.com/satori/go.uuid"
"net/http"
)
@@ -18,6 +17,7 @@ type TaskListRequestData struct {
PageSize int `form:"page_size"`
NodeId string `form:"node_id"`
SpiderId string `form:"spider_id"`
Status string `form:"status"`
}
type TaskResultsRequestData struct {
@@ -29,14 +29,14 @@ func GetTaskList(c *gin.Context) {
// 绑定数据
data := TaskListRequestData{}
if err := c.ShouldBindQuery(&data); err != nil {
HandleError(http.StatusBadRequest, c, err)
HandleError(http.StatusInternalServerError, c, err)
return
}
if data.PageNum == 0 {
data.PageNum = 1
}
if data.PageSize == 0 {
data.PageNum = 10
data.PageSize = 10
}
// 过滤条件
@@ -47,6 +47,10 @@ func GetTaskList(c *gin.Context) {
if data.SpiderId != "" {
query["spider_id"] = bson.ObjectIdHex(data.SpiderId)
}
//新增根据任务状态获取task列表
if data.Status != "" {
query["status"] = data.Status
}
// 获取任务列表
tasks, err := model.GetTaskList(query, (data.PageNum-1)*data.PageSize, data.PageSize, "-create_ts")
@@ -78,49 +82,117 @@ func GetTask(c *gin.Context) {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: result,
})
HandleSuccessData(c, result)
}
func PutTask(c *gin.Context) {
// 生成任务ID
id := uuid.NewV4()
type TaskRequestBody struct {
SpiderId bson.ObjectId `json:"spider_id"`
RunType string `json:"run_type"`
NodeIds []bson.ObjectId `json:"node_ids"`
Param string `json:"param"`
}
// 绑定数据
var t model.Task
if err := c.ShouldBindJSON(&t); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
t.Id = id.String()
t.Status = constants.StatusPending
// 如果没有传入node_id则置为null
if t.NodeId.Hex() == "" {
t.NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
}
// 将任务存入数据库
if err := model.AddTask(t); err != nil {
var reqBody TaskRequestBody
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 加入任务队列
if err := services.AssignTask(t); err != nil {
HandleError(http.StatusInternalServerError, c, err)
if reqBody.RunType == constants.RunTypeAllNodes {
// 所有节点
nodes, err := model.GetNodeList(nil)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
for _, node := range nodes {
t := model.Task{
SpiderId: reqBody.SpiderId,
NodeId: node.Id,
Param: reqBody.Param,
UserId: services.GetCurrentUser(c).Id,
}
if err := services.AddTask(t); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else if reqBody.RunType == constants.RunTypeRandom {
// 随机
t := model.Task{
SpiderId: reqBody.SpiderId,
Param: reqBody.Param,
UserId: services.GetCurrentUser(c).Id,
}
if err := services.AddTask(t); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
} else if reqBody.RunType == constants.RunTypeSelectedNodes {
// 指定节点
for _, nodeId := range reqBody.NodeIds {
t := model.Task{
SpiderId: reqBody.SpiderId,
NodeId: nodeId,
Param: reqBody.Param,
UserId: services.GetCurrentUser(c).Id,
}
if err := services.AddTask(t); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
} else {
HandleErrorF(http.StatusInternalServerError, c, "invalid run_type")
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
HandleSuccess(c)
}
func DeleteTaskByStatus(c *gin.Context) {
status := c.Query("status")
//删除相应的日志文件
if err := services.RemoveLogByTaskStatus(status); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
//删除该状态下的task
if err := model.RemoveTaskByStatus(status); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
HandleSuccess(c)
}
// 删除多个任务
func DeleteMultipleTask(c *gin.Context) {
ids := make(map[string][]string)
if err := c.ShouldBindJSON(&ids); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
list := ids["ids"]
for _, id := range list {
if err := services.RemoveLogByTaskId(id); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
if err := model.RemoveTask(id); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
}
HandleSuccess(c)
}
// 删除单个任务
func DeleteTask(c *gin.Context) {
id := c.Param("id")
@@ -129,33 +201,22 @@ func DeleteTask(c *gin.Context) {
HandleError(http.StatusInternalServerError, c, err)
return
}
// 删除task
if err := model.RemoveTask(id); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
HandleSuccess(c)
}
func GetTaskLog(c *gin.Context) {
id := c.Param("id")
logStr, err := services.GetTaskLog(id)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: logStr,
})
HandleSuccessData(c, logStr)
}
func GetTaskResults(c *gin.Context) {
@@ -164,7 +225,7 @@ func GetTaskResults(c *gin.Context) {
// 绑定数据
data := TaskResultsRequestData{}
if err := c.ShouldBindQuery(&data); err != nil {
HandleError(http.StatusBadRequest, c, err)
HandleError(http.StatusInternalServerError, c, err)
return
}
@@ -266,9 +327,5 @@ func CancelTask(c *gin.Context) {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
HandleSuccess(c)
}

View File

@@ -21,6 +21,8 @@ type UserListRequestData struct {
type UserRequestData struct {
Username string `json:"username"`
Password string `json:"password"`
Role string `json:"role"`
Email string `json:"email"`
}
func GetUser(c *gin.Context) {
@@ -88,13 +90,13 @@ func PutUser(c *gin.Context) {
return
}
// 添加用户
user := model.User{
Username: strings.ToLower(reqData.Username),
Password: utils.EncryptPassword(reqData.Password),
Role: constants.RoleNormal,
// 默认为正常用户
if reqData.Role == "" {
reqData.Role = constants.RoleNormal
}
if err := user.Add(); err != nil {
// 添加用户
if err := services.CreateNewUser(reqData.Username, reqData.Password, reqData.Role, reqData.Email); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
@@ -199,3 +201,41 @@ func GetMe(c *gin.Context) {
User: user,
}, nil)
}
func PostMe(c *gin.Context) {
ctx := context.WithGinContext(c)
user := ctx.User()
if user == nil {
ctx.FailedWithError(constants.ErrorUserNotFound, http.StatusUnauthorized)
return
}
var reqBody model.User
if err := c.ShouldBindJSON(&reqBody); err != nil {
HandleErrorF(http.StatusBadRequest, c, "invalid request")
return
}
if reqBody.Email != "" {
user.Email = reqBody.Email
}
if reqBody.Password != "" {
user.Password = utils.EncryptPassword(reqBody.Password)
}
if reqBody.Setting.NotificationTrigger != "" {
user.Setting.NotificationTrigger = reqBody.Setting.NotificationTrigger
}
if reqBody.Setting.DingTalkRobotWebhook != "" {
user.Setting.DingTalkRobotWebhook = reqBody.Setting.DingTalkRobotWebhook
}
if reqBody.Setting.WechatRobotWebhook != "" {
user.Setting.WechatRobotWebhook = reqBody.Setting.WechatRobotWebhook
}
user.Setting.EnabledNotifications = reqBody.Setting.EnabledNotifications
if err := user.Save(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
c.JSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}

View File

@@ -1,17 +1,15 @@
package routes
import (
"github.com/apex/log"
"github.com/gin-gonic/gin"
"net/http"
"runtime/debug"
)
func HandleError(statusCode int, c *gin.Context, err error) {
log.Errorf("handle error:" + err.Error())
debug.PrintStack()
c.AbortWithStatusJSON(statusCode, Response{
Status: "ok",
Message: "error",
Status: "error",
Message: "failure",
Error: err.Error(),
})
}
@@ -24,3 +22,18 @@ func HandleErrorF(statusCode int, c *gin.Context, err string) {
Error: err,
})
}
func HandleSuccess(c *gin.Context) {
c.AbortWithStatusJSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
})
}
func HandleSuccessData(c *gin.Context, data interface{}) {
c.AbortWithStatusJSON(http.StatusOK, Response{
Status: "ok",
Message: "success",
Data: data,
})
}

View File

@@ -0,0 +1,62 @@
package routes
import (
"crawlab/model"
"github.com/gin-gonic/gin"
"github.com/globalsign/mgo/bson"
"net/http"
)
// 新增
func PutVariable(c *gin.Context) {
var variable model.Variable
if err := c.ShouldBindJSON(&variable); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
if err := variable.Add(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
HandleSuccess(c)
}
// 修改
func PostVariable(c *gin.Context) {
var id = c.Param("id")
var variable model.Variable
if err := c.ShouldBindJSON(&variable); err != nil {
HandleError(http.StatusBadRequest, c, err)
return
}
variable.Id = bson.ObjectIdHex(id)
if err := variable.Save(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
HandleSuccess(c)
}
// 删除
func DeleteVariable(c *gin.Context) {
var idStr = c.Param("id")
var id = bson.ObjectIdHex(idStr)
variable, err := model.GetVariable(id)
if err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
variable.Id = id
if err := variable.Delete(); err != nil {
HandleError(http.StatusInternalServerError, c, err)
return
}
HandleSuccess(c)
}
// 列表
func GetVariableList(c *gin.Context) {
list := model.GetVariableList()
HandleSuccessData(c, list)
}

View File

@@ -0,0 +1,17 @@
#!/bin/env bash
# install nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.35.2/install.sh | bash
export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
# install Node.js v8.12
nvm install 8.12
# create soft links
ln -s $HOME/.nvm/versions/node/v8.12.0/bin/npm /usr/local/bin/npm
ln -s $HOME/.nvm/versions/node/v8.12.0/bin/node /usr/local/bin/node
# environments manipulation
export NODE_PATH=$HOME.nvm/versions/node/v8.12.0/lib/node_modules
export PATH=$NODE_PATH:$PATH

View File

@@ -0,0 +1,273 @@
package services
import (
"crawlab/constants"
"crawlab/database"
"crawlab/entity"
"crawlab/model"
"crawlab/model/config_spider"
"crawlab/services/spider_handler"
"crawlab/utils"
"errors"
"fmt"
"github.com/apex/log"
"github.com/globalsign/mgo/bson"
uuid "github.com/satori/go.uuid"
"github.com/spf13/viper"
"gopkg.in/yaml.v2"
"os"
"path/filepath"
"strings"
)
func GenerateConfigSpiderFiles(spider model.Spider, configData entity.ConfigSpiderData) error {
// 校验Spiderfile正确性
if err := ValidateSpiderfile(configData); err != nil {
return err
}
// 构造代码生成器
generator := config_spider.ScrapyGenerator{
Spider: spider,
ConfigData: configData,
}
// 生成代码
if err := generator.Generate(); err != nil {
return err
}
return nil
}
// 验证Spiderfile
func ValidateSpiderfile(configData entity.ConfigSpiderData) error {
// 获取所有字段
fields := config_spider.GetAllFields(configData)
// 校验是否存在 start_url
if configData.StartUrl == "" {
return errors.New("spiderfile invalid: start_url is empty")
}
// 校验是否存在 start_stage
if configData.StartStage == "" {
return errors.New("spiderfile invalid: start_stage is empty")
}
// 校验是否存在 stages
if len(configData.Stages) == 0 {
return errors.New("spiderfile invalid: stages is empty")
}
// 校验stages
dict := map[string]int{}
for _, stage := range configData.Stages {
stageName := stage.Name
// stage 名称不能为空
if stageName == "" {
return errors.New("spiderfile invalid: stage name is empty")
}
// stage 名称不能为保留字符串
// NOTE: 如果有其他Engine可以扩展默认为Scrapy
if configData.Engine == "" || configData.Engine == constants.EngineScrapy {
if strings.Contains(constants.ScrapyProtectedStageNames, stageName) {
return errors.New(fmt.Sprintf("spiderfile invalid: stage name '%s' is protected", stageName))
}
} else {
return errors.New(fmt.Sprintf("spiderfile invalid: engine '%s' is not implemented", configData.Engine))
}
// stage 名称不能重复
if dict[stageName] == 1 {
return errors.New(fmt.Sprintf("spiderfile invalid: stage name '%s' is duplicated", stageName))
}
dict[stageName] = 1
// stage 字段不能为空
if len(stage.Fields) == 0 {
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has no fields", stageName))
}
// 是否包含 next_stage
hasNextStage := false
// 遍历字段列表
for _, field := range stage.Fields {
// stage 的 next stage 只能有一个
if field.NextStage != "" {
if hasNextStage {
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has more than 1 next_stage", stageName))
}
hasNextStage = true
}
// 字段里 css 和 xpath 只能包含一个
if field.Css != "" && field.Xpath != "" {
return errors.New(fmt.Sprintf("spiderfile invalid: field '%s' in stage '%s' has both css and xpath set which is prohibited", field.Name, stageName))
}
}
// stage 里 page_css 和 page_xpath 只能包含一个
if stage.PageCss != "" && stage.PageXpath != "" {
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has both page_css and page_xpath set which is prohibited", stageName))
}
// stage 里 list_css 和 list_xpath 只能包含一个
if stage.ListCss != "" && stage.ListXpath != "" {
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has both list_css and list_xpath set which is prohibited", stageName))
}
// 如果 stage 的 is_list 为 true 但 list_css 为空,报错
if stage.IsList && (stage.ListCss == "" && stage.ListXpath == "") {
return errors.New("spiderfile invalid: stage with is_list = true should have either list_css or list_xpath being set")
}
}
// 校验字段唯一性
if !IsUniqueConfigSpiderFields(fields) {
return errors.New("spiderfile invalid: fields not unique")
}
// 字段名称不能为保留字符串
for _, field := range fields {
if strings.Contains(constants.ScrapyProtectedFieldNames, field.Name) {
return errors.New(fmt.Sprintf("spiderfile invalid: field name '%s' is protected", field.Name))
}
}
return nil
}
func IsUniqueConfigSpiderFields(fields []entity.Field) bool {
dict := map[string]int{}
for _, field := range fields {
if dict[field.Name] == 1 {
return false
}
dict[field.Name] = 1
}
return true
}
func ProcessSpiderFilesFromConfigData(spider model.Spider, configData entity.ConfigSpiderData) error {
spiderDir := spider.Src
// 删除已有的爬虫文件
for _, fInfo := range utils.ListDir(spiderDir) {
// 不删除Spiderfile
if fInfo.Name() == "Spiderfile" {
continue
}
// 删除其他文件
if err := os.RemoveAll(filepath.Join(spiderDir, fInfo.Name())); err != nil {
return err
}
}
// 拷贝爬虫文件
tplDir := "./template/scrapy"
for _, fInfo := range utils.ListDir(tplDir) {
// 跳过Spiderfile
if fInfo.Name() == "Spiderfile" {
continue
}
srcPath := filepath.Join(tplDir, fInfo.Name())
if fInfo.IsDir() {
dirPath := filepath.Join(spiderDir, fInfo.Name())
if err := utils.CopyDir(srcPath, dirPath); err != nil {
return err
}
} else {
if err := utils.CopyFile(srcPath, filepath.Join(spiderDir, fInfo.Name())); err != nil {
return err
}
}
}
// 更改爬虫文件
if err := GenerateConfigSpiderFiles(spider, configData); err != nil {
return err
}
// 打包为 zip 文件
files, err := utils.GetFilesFromDir(spiderDir)
if err != nil {
return err
}
randomId := uuid.NewV4()
tmpFilePath := filepath.Join(viper.GetString("other.tmppath"), spider.Name+"."+randomId.String()+".zip")
spiderZipFileName := spider.Name + ".zip"
if err := utils.Compress(files, tmpFilePath); err != nil {
return err
}
// 获取 GridFS 实例
s, gf := database.GetGridFs("files")
defer s.Close()
// 判断文件是否已经存在
var gfFile model.GridFs
if err := gf.Find(bson.M{"filename": spiderZipFileName}).One(&gfFile); err == nil {
// 已经存在文件,则删除
_ = gf.RemoveId(gfFile.Id)
}
// 上传到GridFs
fid, err := UploadToGridFs(spiderZipFileName, tmpFilePath)
if err != nil {
log.Errorf("upload to grid fs error: %s", err.Error())
return err
}
// 保存爬虫 FileId
spider.FileId = fid
_ = spider.Save()
// 获取爬虫同步实例
spiderSync := spider_handler.SpiderSync{
Spider: spider,
}
// 获取gfFile
gfFile2 := model.GetGridFs(spider.FileId)
// 生成MD5
spiderSync.CreateMd5File(gfFile2.Md5)
return nil
}
func GenerateSpiderfileFromConfigData(spider model.Spider, configData entity.ConfigSpiderData) error {
// Spiderfile 路径
sfPath := filepath.Join(spider.Src, "Spiderfile")
// 生成Yaml内容
sfContentByte, err := yaml.Marshal(configData)
if err != nil {
return err
}
// 打开文件
var f *os.File
if utils.Exists(sfPath) {
f, err = os.OpenFile(sfPath, os.O_WRONLY|os.O_TRUNC, 0777)
} else {
f, err = os.OpenFile(sfPath, os.O_CREATE, 0777)
}
if err != nil {
return err
}
defer f.Close()
// 写入内容
if _, err := f.Write(sfContentByte); err != nil {
return err
}
return nil
}

65
backend/services/file.go Normal file
View File

@@ -0,0 +1,65 @@
package services
import (
"crawlab/model"
"github.com/apex/log"
"os"
"path"
"runtime/debug"
"strings"
)
func GetFileNodeTree(dstPath string, level int) (f model.File, err error) {
return getFileNodeTree(dstPath, level, dstPath)
}
func getFileNodeTree(dstPath string, level int, rootPath string) (f model.File, err error) {
dstF, err := os.Open(dstPath)
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return f, err
}
defer dstF.Close()
fileInfo, err := dstF.Stat()
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return f, nil
}
if !fileInfo.IsDir() { //如果dstF是文件
return model.File{
Label: fileInfo.Name(),
Name: fileInfo.Name(),
Path: strings.Replace(dstPath, rootPath, "", -1),
IsDir: false,
Size: fileInfo.Size(),
Children: nil,
}, nil
} else { //如果dstF是文件夹
dir, err := dstF.Readdir(0) //获取文件夹下各个文件或文件夹的fileInfo
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return f, nil
}
f = model.File{
Label: path.Base(dstPath),
Name: path.Base(dstPath),
Path: strings.Replace(dstPath, rootPath, "", -1),
IsDir: true,
Size: 0,
Children: nil,
}
for _, subFileInfo := range dir {
subFileNode, err := getFileNodeTree(path.Join(dstPath, subFileInfo.Name()), level+1, rootPath)
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return f, err
}
f.Children = append(f.Children, subFileNode)
}
return f, nil
}
}

View File

@@ -49,10 +49,8 @@ func GetRemoteLog(task model.Task) (logStr string, err error) {
select {
case logStr = <-ch:
log.Infof("get remote log")
break
case <-time.After(30 * time.Second):
logStr = "get remote log timeout"
break
}
return logStr, nil
@@ -119,6 +117,18 @@ func RemoveLogByTaskId(id string) error {
return nil
}
func RemoveLogByTaskStatus(status string) error {
tasks, err := model.GetTaskList(bson.M{"status": status}, 0, constants.Infinite, "-create_ts")
if err != nil {
log.Error("get tasks error:" + err.Error())
return err
}
for _, task := range tasks {
RemoveLogByTaskId(task.Id)
}
return nil
}
func removeLog(t model.Task) {
if err := RemoveLocalLog(t.LogPath); err != nil {
log.Errorf("remove local log error: %s", err.Error())

View File

@@ -12,6 +12,7 @@ import (
"encoding/json"
"fmt"
"github.com/apex/log"
"github.com/globalsign/mgo"
"github.com/globalsign/mgo/bson"
"github.com/gomodule/redigo/redis"
"runtime/debug"
@@ -50,36 +51,44 @@ func GetNodeData() (Data, error) {
return data, err
}
func GetRedisNode(key string) (*Data, error) {
// 获取节点数据
value, err := database.RedisClient.HGet("nodes", key)
if err != nil {
log.Errorf(err.Error())
return nil, err
}
// 解析节点列表数据
var data Data
if err := json.Unmarshal([]byte(value), &data); err != nil {
log.Errorf(err.Error())
return nil, err
}
return &data, nil
}
// 更新所有节点状态
func UpdateNodeStatus() {
// 从Redis获取节点keys
list, err := database.RedisClient.HKeys("nodes")
if err != nil {
log.Errorf(err.Error())
log.Errorf("get redis node keys error: %s", err.Error())
return
}
// 遍历节点keys
for _, key := range list {
// 获取节点数据
value, err := database.RedisClient.HGet("nodes", key)
data, err := GetRedisNode(key)
if err != nil {
log.Errorf(err.Error())
return
continue
}
// 解析节点列表数据
var data Data
if err := json.Unmarshal([]byte(value), &data); err != nil {
log.Errorf(err.Error())
return
}
// 如果记录的更新时间超过60秒该节点被认为离线
if time.Now().Unix()-data.UpdateTsUnix > 60 {
// 在Redis中删除该节点
if err := database.RedisClient.HDel("nodes", data.Key); err != nil {
log.Errorf(err.Error())
log.Errorf("delete redis node key error:%s, key:%s", err.Error(), data.Key)
}
continue
}
@@ -94,22 +103,21 @@ func UpdateNodeStatus() {
model.ResetNodeStatusToOffline(list)
}
func handleNodeInfo(key string, data Data) {
// 处理节点信息
func handleNodeInfo(key string, data *Data) {
// 添加同步锁
v, err := database.RedisClient.Lock(key)
if err != nil {
return
}
defer database.RedisClient.UnLock(key, v)
// 更新节点信息到数据库
s, c := database.GetCol("nodes")
defer s.Close()
// 同个key可能因为并发被注册多次
var nodes []model.Node
_ = c.Find(bson.M{"key": key}).All(&nodes)
if len(nodes) > 1 {
for _, node := range nodes {
_ = c.RemoveId(node.Id)
}
}
var node model.Node
if err := c.Find(bson.M{"key": key}).One(&node); err != nil {
if err := c.Find(bson.M{"key": key}).One(&node); err != nil && err == mgo.ErrNotFound {
// 数据库不存在该节点
node = model.Node{
Key: key,
@@ -126,7 +134,7 @@ func handleNodeInfo(key string, data Data) {
log.Errorf(err.Error())
return
}
} else {
} else if node.Key != "" {
// 数据库存在该节点
node.Status = constants.StatusOnline
node.UpdateTs = time.Now()
@@ -160,6 +168,7 @@ func UpdateNodeData() {
debug.PrintStack()
return
}
// 构造节点数据
data := Data{
Key: key,
@@ -177,10 +186,12 @@ func UpdateNodeData() {
debug.PrintStack()
return
}
if err := database.RedisClient.HSet("nodes", key, utils.BytesToString(dataBytes)); err != nil {
log.Errorf(err.Error())
return
}
}
func MasterNodeCallback(message redis.Message) (err error) {
@@ -258,7 +269,7 @@ func InitNodeService() error {
return err
}
// 如果为主节点,每30秒刷新所有节点信息
// 如果为主节点,每10秒刷新所有节点信息
if model.IsMaster() {
spec := "*/10 * * * * *"
if _, err := c.AddFunc(spec, UpdateNodeStatus); err != nil {

View File

@@ -0,0 +1,138 @@
package notification
import (
"errors"
"github.com/apex/log"
"github.com/matcornic/hermes"
"gopkg.in/gomail.v2"
"net/mail"
"os"
"runtime/debug"
"strconv"
)
func SendMail(toEmail string, toName string, subject string, content string) error {
// hermes instance
h := hermes.Hermes{
Theme: new(hermes.Default),
Product: hermes.Product{
Name: "Crawlab Team",
Copyright: "© 2019 Crawlab, Made by Crawlab-Team",
},
}
// config
port, _ := strconv.Atoi(os.Getenv("CRAWLAB_NOTIFICATION_MAIL_PORT"))
password := os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SMTP_PASSWORD")
SMTPUser := os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SMTP_USER")
smtpConfig := smtpAuthentication{
Server: os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SERVER"),
Port: port,
SenderEmail: os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SENDEREMAIL"),
SenderIdentity: os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SENDERIDENTITY"),
SMTPPassword: password,
SMTPUser: SMTPUser,
}
options := sendOptions{
To: toEmail,
Subject: subject,
}
// email instance
email := hermes.Email{
Body: hermes.Body{
Name: toName,
FreeMarkdown: hermes.Markdown(content + GetFooter()),
},
}
// generate html
html, err := h.GenerateHTML(email)
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
// generate text
text, err := h.GeneratePlainText(email)
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
// send the email
if err := send(smtpConfig, options, html, text); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
return nil
}
type smtpAuthentication struct {
Server string
Port int
SenderEmail string
SenderIdentity string
SMTPUser string
SMTPPassword string
}
// sendOptions are options for sending an email
type sendOptions struct {
To string
Subject string
}
// send sends the email
func send(smtpConfig smtpAuthentication, options sendOptions, htmlBody string, txtBody string) error {
if smtpConfig.Server == "" {
return errors.New("SMTP server config is empty")
}
if smtpConfig.Port == 0 {
return errors.New("SMTP port config is empty")
}
if smtpConfig.SMTPUser == "" {
return errors.New("SMTP user is empty")
}
if smtpConfig.SenderIdentity == "" {
return errors.New("SMTP sender identity is empty")
}
if smtpConfig.SenderEmail == "" {
return errors.New("SMTP sender email is empty")
}
if options.To == "" {
return errors.New("no receiver emails configured")
}
from := mail.Address{
Name: smtpConfig.SenderIdentity,
Address: smtpConfig.SenderEmail,
}
m := gomail.NewMessage()
m.SetHeader("From", from.String())
m.SetHeader("To", options.To)
m.SetHeader("Subject", options.Subject)
m.SetBody("text/plain", txtBody)
m.AddAlternative("text/html", htmlBody)
d := gomail.NewPlainDialer(smtpConfig.Server, smtpConfig.Port, smtpConfig.SMTPUser, smtpConfig.SMTPPassword)
return d.DialAndSend(m)
}
func GetFooter() string {
return `
[Github](https://github.com/crawlab-team/crawlab) | [Documentation](http://docs.crawlab.cn) | [Docker](https://hub.docker.com/r/tikazyq/crawlab)
`
}

View File

@@ -0,0 +1,59 @@
package notification
import (
"errors"
"github.com/apex/log"
"github.com/imroc/req"
"runtime/debug"
)
func SendMobileNotification(webhook string, title string, content string) error {
type ResBody struct {
ErrCode int `json:"errcode"`
ErrMsg string `json:"errmsg"`
}
// 请求头
header := req.Header{
"Content-Type": "application/json; charset=utf-8",
}
// 请求数据
data := req.Param{
"msgtype": "markdown",
"markdown": req.Param{
"title": title,
"text": content,
"content": content,
},
"at": req.Param{
"atMobiles": []string{},
"isAtAll": false,
},
}
// 发起请求
res, err := req.Post(webhook, header, req.BodyJSON(&data))
if err != nil {
log.Errorf("dingtalk notification error: " + err.Error())
debug.PrintStack()
return err
}
// 解析响应
var resBody ResBody
if err := res.ToJSON(&resBody); err != nil {
log.Errorf("dingtalk notification error: " + err.Error())
debug.PrintStack()
return err
}
// 判断响应是否报错
if resBody.ErrCode != 0 {
log.Errorf("dingtalk notification error: " + resBody.ErrMsg)
debug.PrintStack()
return errors.New(resBody.ErrMsg)
}
return nil
}

View File

@@ -6,6 +6,7 @@ import (
"net"
"reflect"
"runtime/debug"
"sync"
)
type Register interface {
@@ -97,25 +98,31 @@ func getMac() (string, error) {
var register Register
// 获得注册器
func GetRegister() Register {
if register != nil {
return register
}
var once sync.Once
registerType := viper.GetString("server.register.type")
if registerType == "mac" {
register = &MacRegister{}
} else {
ip := viper.GetString("server.register.ip")
if ip == "" {
log.Error("server.register.ip is empty")
debug.PrintStack()
return nil
func GetRegister() Register {
once.Do(func() {
if register != nil {
register = register
}
register = &IpRegister{
Ip: ip,
registerType := viper.GetString("server.register.type")
if registerType == "mac" {
register = &MacRegister{}
} else {
ip := viper.GetString("server.register.ip")
if ip == "" {
log.Error("server.register.ip is empty")
debug.PrintStack()
register = nil
}
register = &IpRegister{
Ip: ip,
}
}
}
log.Info("register type is :" + reflect.TypeOf(register).String())
log.Info("register type is :" + reflect.TypeOf(register).String())
})
return register
}

234
backend/services/rpc.go Normal file
View File

@@ -0,0 +1,234 @@
package services
import (
"crawlab/constants"
"crawlab/database"
"crawlab/entity"
"crawlab/model"
"crawlab/utils"
"encoding/json"
"fmt"
"github.com/apex/log"
"github.com/gomodule/redigo/redis"
uuid "github.com/satori/go.uuid"
"runtime/debug"
)
type RpcMessage struct {
Id string `json:"id"`
Method string `json:"method"`
Params map[string]string `json:"params"`
Result string `json:"result"`
}
func RpcServerInstallLang(msg RpcMessage) RpcMessage {
lang := GetRpcParam("lang", msg.Params)
if lang == constants.Nodejs {
output, _ := InstallNodejsLocalLang()
msg.Result = output
}
return msg
}
func RpcClientInstallLang(nodeId string, lang string) (output string, err error) {
params := map[string]string{}
params["lang"] = lang
data, err := RpcClientFunc(nodeId, constants.RpcInstallLang, params, 600)()
if err != nil {
return
}
output = data
return
}
func RpcServerInstallDep(msg RpcMessage) RpcMessage {
lang := GetRpcParam("lang", msg.Params)
depName := GetRpcParam("dep_name", msg.Params)
if lang == constants.Python {
output, _ := InstallPythonLocalDep(depName)
msg.Result = output
}
return msg
}
func RpcClientInstallDep(nodeId string, lang string, depName string) (output string, err error) {
params := map[string]string{}
params["lang"] = lang
params["dep_name"] = depName
data, err := RpcClientFunc(nodeId, constants.RpcInstallDep, params, 10)()
if err != nil {
return
}
output = data
return
}
func RpcServerUninstallDep(msg RpcMessage) RpcMessage {
lang := GetRpcParam("lang", msg.Params)
depName := GetRpcParam("dep_name", msg.Params)
if lang == constants.Python {
output, _ := UninstallPythonLocalDep(depName)
msg.Result = output
}
return msg
}
func RpcClientUninstallDep(nodeId string, lang string, depName string) (output string, err error) {
params := map[string]string{}
params["lang"] = lang
params["dep_name"] = depName
data, err := RpcClientFunc(nodeId, constants.RpcUninstallDep, params, 60)()
if err != nil {
return
}
output = data
return
}
func RpcServerGetInstalledDepList(nodeId string, msg RpcMessage) RpcMessage {
lang := GetRpcParam("lang", msg.Params)
if lang == constants.Python {
depList, _ := GetPythonLocalInstalledDepList(nodeId)
resultStr, _ := json.Marshal(depList)
msg.Result = string(resultStr)
} else if lang == constants.Nodejs {
depList, _ := GetNodejsLocalInstalledDepList(nodeId)
resultStr, _ := json.Marshal(depList)
msg.Result = string(resultStr)
}
return msg
}
func RpcClientGetInstalledDepList(nodeId string, lang string) (list []entity.Dependency, err error) {
params := map[string]string{}
params["lang"] = lang
data, err := RpcClientFunc(nodeId, constants.RpcGetInstalledDepList, params, 10)()
if err != nil {
return
}
// 反序列化结果
if err := json.Unmarshal([]byte(data), &list); err != nil {
return list, err
}
return
}
func RpcClientFunc(nodeId string, method string, params map[string]string, timeout int) func() (string, error) {
return func() (result string, err error) {
// 请求ID
id := uuid.NewV4().String()
// 构造RPC消息
msg := RpcMessage{
Id: id,
Method: method,
Params: params,
Result: "",
}
// 发送RPC消息
msgStr := ObjectToString(msg)
if err := database.RedisClient.LPush(fmt.Sprintf("rpc:%s", nodeId), msgStr); err != nil {
return result, err
}
// 获取RPC回复消息
dataStr, err := database.RedisClient.BRPop(fmt.Sprintf("rpc:%s", nodeId), timeout)
if err != nil {
return result, err
}
// 反序列化消息
if err := json.Unmarshal([]byte(dataStr), &msg); err != nil {
return result, err
}
return msg.Result, err
}
}
func GetRpcParam(key string, params map[string]string) string {
return params[key]
}
func ObjectToString(params interface{}) string {
bytes, _ := json.Marshal(params)
return utils.BytesToString(bytes)
}
var IsRpcStopped = false
func StopRpcService() {
IsRpcStopped = true
}
func InitRpcService() error {
go func() {
for {
// 获取当前节点
node, err := model.GetCurrentNode()
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
continue
}
// 获取获取消息队列信息
dataStr, err := database.RedisClient.BRPop(fmt.Sprintf("rpc:%s", node.Id.Hex()), 0)
if err != nil {
if err != redis.ErrNil {
log.Errorf(err.Error())
debug.PrintStack()
}
continue
}
// 反序列化消息
var msg RpcMessage
if err := json.Unmarshal([]byte(dataStr), &msg); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
continue
}
// 根据Method调用本地方法
var replyMsg RpcMessage
if msg.Method == constants.RpcInstallDep {
replyMsg = RpcServerInstallDep(msg)
} else if msg.Method == constants.RpcUninstallDep {
replyMsg = RpcServerUninstallDep(msg)
} else if msg.Method == constants.RpcInstallLang {
replyMsg = RpcServerInstallLang(msg)
} else if msg.Method == constants.RpcGetInstalledDepList {
replyMsg = RpcServerGetInstalledDepList(node.Id.Hex(), msg)
} else {
continue
}
// 发送返回消息
if err := database.RedisClient.LPush(fmt.Sprintf("rpc:%s", node.Id.Hex()), ObjectToString(replyMsg)); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
continue
}
// 如果停止RPC服务则返回
if IsRpcStopped {
return
}
}
}()
return nil
}

View File

@@ -4,8 +4,10 @@ import (
"crawlab/constants"
"crawlab/lib/cron"
"crawlab/model"
"errors"
"github.com/apex/log"
"github.com/satori/go.uuid"
"github.com/globalsign/mgo/bson"
uuid "github.com/satori/go.uuid"
"runtime/debug"
)
@@ -15,48 +17,59 @@ type Scheduler struct {
cron *cron.Cron
}
func AddTask(s model.Schedule) func() {
func AddScheduleTask(s model.Schedule) func() {
return func() {
node, err := model.GetNodeByKey(s.NodeKey)
if err != nil || node.Id.Hex() == "" {
log.Errorf("get node by key error: %s", err.Error())
debug.PrintStack()
return
}
spider := model.GetSpiderByName(s.SpiderName)
if spider == nil || spider.Id.Hex() == "" {
log.Errorf("get spider by name error: %s", err.Error())
debug.PrintStack()
return
}
// 同步ID到定时任务
s.SyncNodeIdAndSpiderId(node, *spider)
// 生成任务ID
id := uuid.NewV4()
// 生成任务模型
t := model.Task{
Id: id.String(),
SpiderId: spider.Id,
NodeId: node.Id,
Status: constants.StatusPending,
Param: s.Param,
}
if s.RunType == constants.RunTypeAllNodes {
// 所有节点
nodes, err := model.GetNodeList(nil)
if err != nil {
return
}
for _, node := range nodes {
t := model.Task{
Id: id.String(),
SpiderId: s.SpiderId,
NodeId: node.Id,
Param: s.Param,
UserId: s.UserId,
}
// 将任务存入数据库
if err := model.AddTask(t); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return
}
if err := AddTask(t); err != nil {
return
}
}
} else if s.RunType == constants.RunTypeRandom {
// 随机
t := model.Task{
Id: id.String(),
SpiderId: s.SpiderId,
Param: s.Param,
UserId: s.UserId,
}
if err := AddTask(t); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return
}
} else if s.RunType == constants.RunTypeSelectedNodes {
// 指定节点
for _, nodeId := range s.NodeIds {
t := model.Task{
Id: id.String(),
SpiderId: s.SpiderId,
NodeId: nodeId,
Param: s.Param,
UserId: s.UserId,
}
// 加入任务队列
if err := AssignTask(t); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
if err := AddTask(t); err != nil {
return
}
}
} else {
return
}
}
@@ -96,8 +109,8 @@ func (s *Scheduler) Start() error {
func (s *Scheduler) AddJob(job model.Schedule) error {
spec := job.Cron
// 添加任务
eid, err := s.cron.AddFunc(spec, AddTask(job))
// 添加定时任务
eid, err := s.cron.AddFunc(spec, AddScheduleTask(job))
if err != nil {
log.Errorf("add func task error: %s", err.Error())
debug.PrintStack()
@@ -106,6 +119,12 @@ func (s *Scheduler) AddJob(job model.Schedule) error {
// 更新EntryID
job.EntryId = eid
// 更新状态
job.Status = constants.ScheduleStatusRunning
job.Enabled = true
// 保存定时任务
if err := job.Save(); err != nil {
log.Errorf("job save error: %s", err.Error())
debug.PrintStack()
@@ -134,6 +153,41 @@ func ParserCron(spec string) error {
return nil
}
// 禁用定时任务
func (s *Scheduler) Disable(id bson.ObjectId) error {
schedule, err := model.GetSchedule(id)
if err != nil {
return err
}
if schedule.EntryId == 0 {
return errors.New("entry id not found")
}
// 从cron服务中删除该任务
s.cron.Remove(schedule.EntryId)
// 更新状态
schedule.Status = constants.ScheduleStatusStop
schedule.Enabled = false
if err = schedule.Save(); err != nil {
return err
}
return nil
}
// 启用定时任务
func (s *Scheduler) Enable(id bson.ObjectId) error {
schedule, err := model.GetSchedule(id)
if err != nil {
return err
}
if err := s.AddJob(schedule); err != nil {
return err
}
return nil
}
func (s *Scheduler) Update() error {
// 删除所有定时任务
s.RemoveAll()
@@ -146,11 +200,26 @@ func (s *Scheduler) Update() error {
return err
}
user, err := model.GetUserByUsername("admin")
if err != nil {
log.Errorf("get admin user error: %s", err.Error())
return err
}
// 遍历任务列表
for i := 0; i < len(sList); i++ {
// 单个任务
job := sList[i]
if job.Status == constants.ScheduleStatusStop {
continue
}
// 兼容以前版本
if job.UserId.Hex() == "" {
job.UserId = user.Id
}
// 添加到定时任务
if err := s.AddJob(job); err != nil {
log.Errorf("add job error: %s, job: %s, cron: %s", err.Error(), job.Name, job.Cron)

View File

@@ -12,11 +12,14 @@ import (
"github.com/apex/log"
"github.com/globalsign/mgo"
"github.com/globalsign/mgo/bson"
"github.com/satori/go.uuid"
"github.com/spf13/viper"
"gopkg.in/yaml.v2"
"io/ioutil"
"os"
"path"
"path/filepath"
"runtime/debug"
"strings"
)
type SpiderFileData struct {
@@ -30,6 +33,59 @@ type SpiderUploadMessage struct {
SpiderId string
}
// 从主节点上传爬虫到GridFS
func UploadSpiderToGridFsFromMaster(spider model.Spider) error {
// 爬虫所在目录
spiderDir := spider.Src
// 打包为 zip 文件
files, err := utils.GetFilesFromDir(spiderDir)
if err != nil {
return err
}
randomId := uuid.NewV4()
tmpFilePath := filepath.Join(viper.GetString("other.tmppath"), spider.Name+"."+randomId.String()+".zip")
spiderZipFileName := spider.Name + ".zip"
if err := utils.Compress(files, tmpFilePath); err != nil {
return err
}
// 获取 GridFS 实例
s, gf := database.GetGridFs("files")
defer s.Close()
// 判断文件是否已经存在
var gfFile model.GridFs
if err := gf.Find(bson.M{"filename": spiderZipFileName}).One(&gfFile); err == nil {
// 已经存在文件,则删除
_ = gf.RemoveId(gfFile.Id)
}
// 上传到GridFs
fid, err := UploadToGridFs(spiderZipFileName, tmpFilePath)
if err != nil {
log.Errorf("upload to grid fs error: %s", err.Error())
return err
}
// 保存爬虫 FileId
spider.FileId = fid
_ = spider.Save()
// 获取爬虫同步实例
spiderSync := spider_handler.SpiderSync{
Spider: spider,
}
// 获取gfFile
gfFile2 := model.GetGridFs(spider.FileId)
// 生成MD5
spiderSync.CreateMd5File(gfFile2.Md5)
return nil
}
// 上传zip文件到GridFS
func UploadToGridFs(fileName string, filePath string) (fid bson.ObjectId, err error) {
fid = ""
@@ -59,6 +115,7 @@ func UploadToGridFs(fileName string, filePath string) (fid bson.ObjectId, err er
}
// 关闭文件,提交写入
if err = f.Close(); err != nil {
debug.PrintStack()
return "", err
}
// 文件ID
@@ -100,7 +157,7 @@ func ReadFileByStep(filePath string, handle func([]byte, *mgo.GridFile), fileCre
// 发布所有爬虫
func PublishAllSpiders() {
// 获取爬虫列表
spiders, _, _ := model.GetSpiderList(nil, 0, constants.Infinite)
spiders, _, _ := model.GetSpiderList(nil, 0, constants.Infinite, "-_id")
if len(spiders) == 0 {
return
}
@@ -116,12 +173,23 @@ func PublishAllSpiders() {
// 发布爬虫
func PublishSpider(spider model.Spider) {
// 查询gf file不存在则删除
gfFile := model.GetGridFs(spider.FileId)
if gfFile == nil {
_ = model.RemoveSpider(spider.Id)
var gfFile *model.GridFs
if spider.FileId.Hex() != constants.ObjectIdNull {
// 查询gf file不存在则标记为爬虫文件不存在
gfFile = model.GetGridFs(spider.FileId)
if gfFile == nil {
spider.FileId = constants.ObjectIdNull
_ = spider.Save()
return
}
}
// 如果FileId为空表示还没有上传爬虫到GridFS则跳过
if spider.FileId == bson.ObjectIdHex(constants.ObjectIdNull) {
return
}
// 获取爬虫同步实例
spiderSync := spider_handler.SpiderSync{
Spider: spider,
}
@@ -138,21 +206,14 @@ func PublishSpider(spider model.Spider) {
md5 := filepath.Join(path, spider_handler.Md5File)
if !utils.Exists(md5) {
log.Infof("md5 file not found: %s", md5)
spiderSync.RemoveSpiderFile()
spiderSync.Download()
spiderSync.CreateMd5File(gfFile.Md5)
spiderSync.RemoveDownCreate(gfFile.Md5)
return
}
// md5值不一样则下载
md5Str := utils.ReadFileOneLine(md5)
// 去掉空格以及换行符
md5Str = strings.Replace(md5Str, " ", "", -1)
md5Str = strings.Replace(md5Str, "\n", "", -1)
md5Str := utils.GetSpiderMd5Str(md5)
if gfFile.Md5 != md5Str {
log.Infof("md5 is different, gf-md5:%s, file-md5:%s", gfFile.Md5, md5Str)
spiderSync.RemoveSpiderFile()
spiderSync.Download()
spiderSync.CreateMd5File(gfFile.Md5)
spiderSync.RemoveDownCreate(gfFile.Md5)
return
}
}
@@ -206,5 +267,110 @@ func InitSpiderService() error {
// 启动定时任务
c.Start()
if model.IsMaster() {
// 添加Demo爬虫
templateSpidersDir := "./template/spiders"
for _, info := range utils.ListDir(templateSpidersDir) {
if !info.IsDir() {
continue
}
spiderName := info.Name()
// 如果爬虫在数据库中不存在,则添加
spider := model.GetSpiderByName(spiderName)
if spider.Name != "" {
// 存在同名爬虫,跳过
continue
}
// 拷贝爬虫
templateSpiderPath := path.Join(templateSpidersDir, spiderName)
spiderPath := path.Join(viper.GetString("spider.path"), spiderName)
if utils.Exists(spiderPath) {
utils.RemoveFiles(spiderPath)
}
if err := utils.CopyDir(templateSpiderPath, spiderPath); err != nil {
log.Errorf("copy error: " + err.Error())
debug.PrintStack()
continue
}
// 构造配置数据
configData := entity.ConfigSpiderData{}
// 读取YAML文件
yamlFile, err := ioutil.ReadFile(path.Join(spiderPath, "Spiderfile"))
if err != nil {
log.Errorf("read yaml error: " + err.Error())
//debug.PrintStack()
continue
}
// 反序列化
if err := yaml.Unmarshal(yamlFile, &configData); err != nil {
log.Errorf("unmarshal error: " + err.Error())
debug.PrintStack()
continue
}
if configData.Type == constants.Customized {
// 添加该爬虫到数据库
spider = model.Spider{
Id: bson.NewObjectId(),
Name: spiderName,
DisplayName: configData.DisplayName,
Type: constants.Customized,
Col: configData.Col,
Src: spiderPath,
Remark: configData.Remark,
ProjectId: bson.ObjectIdHex(constants.ObjectIdNull),
FileId: bson.ObjectIdHex(constants.ObjectIdNull),
Cmd: configData.Cmd,
}
if err := spider.Add(); err != nil {
log.Errorf("add spider error: " + err.Error())
debug.PrintStack()
continue
}
// 上传爬虫到GridFS
if err := UploadSpiderToGridFsFromMaster(spider); err != nil {
log.Errorf("upload spider error: " + err.Error())
debug.PrintStack()
continue
}
} else if configData.Type == constants.Configurable || configData.Type == "config" {
// 添加该爬虫到数据库
spider = model.Spider{
Id: bson.NewObjectId(),
Name: configData.Name,
DisplayName: configData.DisplayName,
Type: constants.Configurable,
Col: configData.Col,
Src: spiderPath,
Remark: configData.Remark,
ProjectId: bson.ObjectIdHex(constants.ObjectIdNull),
FileId: bson.ObjectIdHex(constants.ObjectIdNull),
Config: configData,
}
if err := spider.Add(); err != nil {
log.Errorf("add spider error: " + err.Error())
debug.PrintStack()
continue
}
// 根据序列化后的数据处理爬虫文件
if err := ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
log.Errorf("add spider error: " + err.Error())
debug.PrintStack()
continue
}
}
}
// 发布所有爬虫
PublishAllSpiders()
}
return nil
}

View File

@@ -4,12 +4,14 @@ import (
"crawlab/database"
"crawlab/model"
"crawlab/utils"
"fmt"
"github.com/apex/log"
"github.com/globalsign/mgo/bson"
"github.com/satori/go.uuid"
"github.com/spf13/viper"
"io"
"os"
"os/exec"
"path/filepath"
"runtime/debug"
)
@@ -24,7 +26,7 @@ type SpiderSync struct {
func (s *SpiderSync) CreateMd5File(md5 string) {
path := filepath.Join(viper.GetString("spider.path"), s.Spider.Name)
utils.CreateFilePath(path)
utils.CreateDirPath(path)
fileName := filepath.Join(path, Md5File)
file := utils.OpenFile(fileName)
@@ -37,6 +39,12 @@ func (s *SpiderSync) CreateMd5File(md5 string) {
}
}
func (s *SpiderSync) RemoveDownCreate(md5 string) {
s.RemoveSpiderFile()
s.Download()
s.CreateMd5File(md5)
}
// 获得下载锁的key
func (s *SpiderSync) GetLockDownloadKey(spiderId string) string {
node, _ := model.GetCurrentNode()
@@ -59,10 +67,14 @@ func (s *SpiderSync) RemoveSpiderFile() {
// 检测是否已经下载中
func (s *SpiderSync) CheckDownLoading(spiderId string, fileId string) (bool, string) {
key := s.GetLockDownloadKey(spiderId)
if _, err := database.RedisClient.HGet("spider", key); err == nil {
return true, key
key2, err := database.RedisClient.HGet("spider", key)
if err != nil {
return false, key2
}
return false, key
if key2 == "" {
return false, key2
}
return true, key2
}
// 下载爬虫
@@ -71,6 +83,7 @@ func (s *SpiderSync) Download() {
fileId := s.Spider.FileId.Hex()
isDownloading, key := s.CheckDownLoading(spiderId, fileId)
if isDownloading {
log.Infof(fmt.Sprintf("spider is already being downloaded, spider id: %s", s.Spider.Id.Hex()))
return
} else {
_ = database.RedisClient.HSet("spider", key, key)
@@ -99,7 +112,6 @@ func (s *SpiderSync) Download() {
// 创建临时文件
tmpFilePath := filepath.Join(tmpPath, randomId.String()+".zip")
tmpFile := utils.OpenFile(tmpFilePath)
defer utils.Close(tmpFile)
// 将该文件写入临时文件
if _, err := io.Copy(tmpFile, f); err != nil {
@@ -119,6 +131,15 @@ func (s *SpiderSync) Download() {
return
}
//递归修改目标文件夹权限
// 解决scrapy.setting中开启LOG_ENABLED 和 LOG_FILE时不能创建log文件的问题
cmd := exec.Command("chmod", "-R", "777", dstPath)
if err := cmd.Run(); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return
}
// 关闭临时文件
if err := tmpFile.Close(); err != nil {
log.Errorf(err.Error())

View File

@@ -4,28 +4,42 @@ import (
"crawlab/constants"
"crawlab/database"
"crawlab/entity"
"crawlab/lib/cron"
"crawlab/model"
"crawlab/utils"
"encoding/json"
"errors"
"fmt"
"github.com/apex/log"
"github.com/imroc/req"
"os/exec"
"path"
"regexp"
"runtime/debug"
"sort"
"strings"
"sync"
)
// 系统信息 chan 映射
var SystemInfoChanMap = utils.NewChanMap()
func GetRemoteSystemInfo(id string) (sysInfo entity.SystemInfo, err error) {
// 从远端获取系统信息
func GetRemoteSystemInfo(nodeId string) (sysInfo entity.SystemInfo, err error) {
// 发送消息
msg := entity.NodeMessage{
Type: constants.MsgTypeGetSystemInfo,
NodeId: id,
NodeId: nodeId,
}
// 序列化
msgBytes, _ := json.Marshal(&msg)
if _, err := database.RedisClient.Publish("nodes:"+id, utils.BytesToString(msgBytes)); err != nil {
if _, err := database.RedisClient.Publish("nodes:"+nodeId, utils.BytesToString(msgBytes)); err != nil {
return entity.SystemInfo{}, err
}
// 通道
ch := SystemInfoChanMap.ChanBlocked(id)
ch := SystemInfoChanMap.ChanBlocked(nodeId)
// 等待响应,阻塞
sysInfoStr := <-ch
@@ -38,11 +52,534 @@ func GetRemoteSystemInfo(id string) (sysInfo entity.SystemInfo, err error) {
return sysInfo, nil
}
func GetSystemInfo(id string) (sysInfo entity.SystemInfo, err error) {
if IsMasterNode(id) {
// 获取系统信息
func GetSystemInfo(nodeId string) (sysInfo entity.SystemInfo, err error) {
if IsMasterNode(nodeId) {
sysInfo, err = model.GetLocalSystemInfo()
} else {
sysInfo, err = GetRemoteSystemInfo(id)
sysInfo, err = GetRemoteSystemInfo(nodeId)
}
return
}
// 获取语言列表
func GetLangList(nodeId string) []entity.Lang {
list := []entity.Lang{
{Name: "Python", ExecutableName: "python", ExecutablePath: "/usr/local/bin/python", DepExecutablePath: "/usr/local/bin/pip"},
{Name: "Node.js", ExecutableName: "node", ExecutablePath: "/usr/local/bin/node", DepExecutablePath: "/usr/local/bin/npm"},
//{Name: "Java", ExecutableName: "java", ExecutablePath: "/usr/local/bin/java"},
}
for i, lang := range list {
list[i].Installed = IsInstalledLang(nodeId, lang)
}
return list
}
// 根据语言名获取语言实例
func GetLangFromLangName(nodeId string, name string) entity.Lang {
langList := GetLangList(nodeId)
for _, lang := range langList {
if lang.ExecutableName == name {
return lang
}
}
return entity.Lang{}
}
// 是否已安装该依赖
func IsInstalledLang(nodeId string, lang entity.Lang) bool {
sysInfo, err := GetSystemInfo(nodeId)
if err != nil {
return false
}
for _, exec := range sysInfo.Executables {
if exec.Path == lang.ExecutablePath {
return true
}
}
return false
}
// 是否已安装该依赖
func IsInstalledDep(installedDepList []entity.Dependency, dep entity.Dependency) bool {
for _, _dep := range installedDepList {
if strings.ToLower(_dep.Name) == strings.ToLower(dep.Name) {
return true
}
}
return false
}
// 初始化函数
func InitDepsFetcher() error {
c := cron.New(cron.WithSeconds())
c.Start()
if _, err := c.AddFunc("0 */5 * * * *", UpdatePythonDepList); err != nil {
return err
}
go func() {
UpdatePythonDepList()
}()
return nil
}
// =========
// Python
// =========
type PythonDepJsonData struct {
Info PythonDepJsonDataInfo `json:"info"`
}
type PythonDepJsonDataInfo struct {
Name string `json:"name"`
Summary string `json:"summary"`
Version string `json:"version"`
}
type PythonDepNameDict struct {
Name string `json:"name"`
Weight int `json:"weight"`
}
type PythonDepNameDictSlice []PythonDepNameDict
func (s PythonDepNameDictSlice) Len() int { return len(s) }
func (s PythonDepNameDictSlice) Swap(i, j int) { s[i], s[j] = s[j], s[i] }
func (s PythonDepNameDictSlice) Less(i, j int) bool { return s[i].Weight > s[j].Weight }
// 获取Python本地依赖列表
func GetPythonDepList(nodeId string, searchDepName string) ([]entity.Dependency, error) {
var list []entity.Dependency
// 先从 Redis 获取
depList, err := GetPythonDepListFromRedis()
if err != nil {
return list, err
}
// 过滤相似的依赖
var depNameList PythonDepNameDictSlice
for _, depName := range depList {
if strings.HasPrefix(strings.ToLower(depName), strings.ToLower(searchDepName)) {
var weight int
if strings.ToLower(depName) == strings.ToLower(searchDepName) {
weight = 3
} else if strings.HasPrefix(strings.ToLower(depName), strings.ToLower(searchDepName)) {
weight = 2
} else {
weight = 1
}
depNameList = append(depNameList, PythonDepNameDict{
Name: depName,
Weight: weight,
})
}
}
// 获取已安装依赖列表
var installedDepList []entity.Dependency
if IsMasterNode(nodeId) {
installedDepList, err = GetPythonLocalInstalledDepList(nodeId)
if err != nil {
return list, err
}
} else {
installedDepList, err = GetPythonRemoteInstalledDepList(nodeId)
if err != nil {
return list, err
}
}
// 根据依赖名排序
sort.Stable(depNameList)
// 遍历依赖名列表取前20个
for i, depNameDict := range depNameList {
if i > 20 {
break
}
dep := entity.Dependency{
Name: depNameDict.Name,
}
dep.Installed = IsInstalledDep(installedDepList, dep)
list = append(list, dep)
}
// 从依赖源获取信息
//list, err = GetPythonDepListWithInfo(list)
return list, nil
}
// 获取Python依赖的源数据信息
func GetPythonDepListWithInfo(depList []entity.Dependency) ([]entity.Dependency, error) {
var goSync sync.WaitGroup
for i, dep := range depList {
if i > 10 {
break
}
goSync.Add(1)
go func(i int, dep entity.Dependency, depList []entity.Dependency, n *sync.WaitGroup) {
url := fmt.Sprintf("https://pypi.org/pypi/%s/json", dep.Name)
res, err := req.Get(url)
if err != nil {
n.Done()
return
}
var data PythonDepJsonData
if err := res.ToJSON(&data); err != nil {
n.Done()
return
}
depList[i].Version = data.Info.Version
depList[i].Description = data.Info.Summary
n.Done()
}(i, dep, depList, &goSync)
}
goSync.Wait()
return depList, nil
}
func FetchPythonDepInfo(depName string) (entity.Dependency, error) {
url := fmt.Sprintf("https://pypi.org/pypi/%s/json", depName)
res, err := req.Get(url)
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return entity.Dependency{}, err
}
var data PythonDepJsonData
if res.Response().StatusCode == 404 {
return entity.Dependency{}, errors.New("get depName from [https://pypi.org] error: 404")
}
if err := res.ToJSON(&data); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return entity.Dependency{}, err
}
dep := entity.Dependency{
Name: depName,
Version: data.Info.Version,
Description: data.Info.Summary,
}
return dep, nil
}
// 从Redis获取Python依赖列表
func GetPythonDepListFromRedis() ([]string, error) {
var list []string
// 从 Redis 获取字符串
rawData, err := database.RedisClient.HGet("system", "deps:python")
if err != nil {
return list, err
}
// 反序列化
if err := json.Unmarshal([]byte(rawData), &list); err != nil {
return list, err
}
// 如果为空,则从依赖源获取列表
if len(list) == 0 {
UpdatePythonDepList()
}
return list, nil
}
// 从Python依赖源获取依赖列表并返回
func FetchPythonDepList() ([]string, error) {
// 依赖URL
url := "https://pypi.tuna.tsinghua.edu.cn/simple"
// 输出列表
var list []string
// 请求URL
res, err := req.Get(url)
if err != nil {
log.Error(err.Error())
debug.PrintStack()
return list, err
}
// 获取响应数据
text, err := res.ToString()
if err != nil {
log.Error(err.Error())
debug.PrintStack()
return list, err
}
// 从响应数据中提取依赖名
regex := regexp.MustCompile("<a href=\".*/\">(.*)</a>")
for _, line := range strings.Split(text, "\n") {
arr := regex.FindStringSubmatch(line)
if len(arr) < 2 {
continue
}
list = append(list, arr[1])
}
// 赋值给列表
return list, nil
}
// 更新Python依赖列表到Redis
func UpdatePythonDepList() {
// 从依赖源获取列表
list, _ := FetchPythonDepList()
// 序列化
listBytes, err := json.Marshal(list)
if err != nil {
log.Error(err.Error())
debug.PrintStack()
return
}
// 设置Redis
if err := database.RedisClient.HSet("system", "deps:python", string(listBytes)); err != nil {
log.Error(err.Error())
debug.PrintStack()
return
}
}
// 获取Python本地已安装的依赖列表
func GetPythonLocalInstalledDepList(nodeId string) ([]entity.Dependency, error) {
var list []entity.Dependency
lang := GetLangFromLangName(nodeId, constants.Python)
if !IsInstalledLang(nodeId, lang) {
return list, errors.New("python is not installed")
}
cmd := exec.Command("pip", "freeze")
outputBytes, err := cmd.Output()
if err != nil {
debug.PrintStack()
return list, err
}
for _, line := range strings.Split(string(outputBytes), "\n") {
arr := strings.Split(line, "==")
if len(arr) < 2 {
continue
}
dep := entity.Dependency{
Name: strings.ToLower(arr[0]),
Version: arr[1],
Installed: true,
}
list = append(list, dep)
}
return list, nil
}
// 获取Python远端依赖列表
func GetPythonRemoteInstalledDepList(nodeId string) ([]entity.Dependency, error) {
depList, err := RpcClientGetInstalledDepList(nodeId, constants.Python)
if err != nil {
return depList, err
}
return depList, nil
}
// 安装Python本地依赖
func InstallPythonLocalDep(depName string) (string, error) {
// 依赖镜像URL
url := "https://pypi.tuna.tsinghua.edu.cn/simple"
cmd := exec.Command("pip", "install", depName, "-i", url)
outputBytes, err := cmd.Output()
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return fmt.Sprintf("error: %s", err.Error()), err
}
return string(outputBytes), nil
}
// 获取Python远端依赖列表
func InstallPythonRemoteDep(nodeId string, depName string) (string, error) {
output, err := RpcClientInstallDep(nodeId, constants.Python, depName)
if err != nil {
return output, err
}
return output, nil
}
// 安装Python本地依赖
func UninstallPythonLocalDep(depName string) (string, error) {
cmd := exec.Command("pip", "uninstall", "-y", depName)
outputBytes, err := cmd.Output()
if err != nil {
log.Errorf(string(outputBytes))
log.Errorf(err.Error())
debug.PrintStack()
return fmt.Sprintf("error: %s", err.Error()), err
}
return string(outputBytes), nil
}
// 获取Python远端依赖列表
func UninstallPythonRemoteDep(nodeId string, depName string) (string, error) {
output, err := RpcClientUninstallDep(nodeId, constants.Python, depName)
if err != nil {
return output, err
}
return output, nil
}
// ==============
// Node.js
// ==============
func InstallNodejsLocalLang() (string, error) {
cmd := exec.Command("/bin/sh", path.Join("scripts", "install-nodejs.sh"))
output, err := cmd.Output()
if err != nil {
log.Error(err.Error())
debug.PrintStack()
return string(output), err
}
// TODO: check if Node.js is installed successfully
return string(output), nil
}
// 获取Node.js远端依赖列表
func InstallNodejsRemoteLang(nodeId string) (string, error) {
output, err := RpcClientInstallLang(nodeId, constants.Nodejs)
if err != nil {
return output, err
}
return output, nil
}
// 获取Nodejs本地已安装的依赖列表
func GetNodejsLocalInstalledDepList(nodeId string) ([]entity.Dependency, error) {
var list []entity.Dependency
lang := GetLangFromLangName(nodeId, constants.Nodejs)
if !IsInstalledLang(nodeId, lang) {
return list, errors.New("nodejs is not installed")
}
cmd := exec.Command("npm", "ls", "-g", "--depth", "0")
outputBytes, _ := cmd.Output()
//if err != nil {
// log.Error("error: " + string(outputBytes))
// debug.PrintStack()
// return list, err
//}
regex := regexp.MustCompile("\\s(.*)@(.*)")
for _, line := range strings.Split(string(outputBytes), "\n") {
arr := regex.FindStringSubmatch(line)
if len(arr) < 3 {
continue
}
dep := entity.Dependency{
Name: strings.ToLower(arr[1]),
Version: arr[2],
Installed: true,
}
list = append(list, dep)
}
return list, nil
}
// 获取Nodejs远端依赖列表
func GetNodejsRemoteInstalledDepList(nodeId string) ([]entity.Dependency, error) {
depList, err := RpcClientGetInstalledDepList(nodeId, constants.Nodejs)
if err != nil {
return depList, err
}
return depList, nil
}
// 安装Nodejs本地依赖
func InstallNodejsLocalDep(depName string) (string, error) {
// 依赖镜像URL
url := "https://registry.npm.taobao.org"
cmd := exec.Command("npm", "install", depName, "-g", "--registry", url)
outputBytes, err := cmd.Output()
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return fmt.Sprintf("error: %s", err.Error()), err
}
return string(outputBytes), nil
}
// 获取Nodejs远端依赖列表
func InstallNodejsRemoteDep(nodeId string, depName string) (string, error) {
output, err := RpcClientInstallDep(nodeId, constants.Nodejs, depName)
if err != nil {
return output, err
}
return output, nil
}
// 安装Nodejs本地依赖
func UninstallNodejsLocalDep(depName string) (string, error) {
cmd := exec.Command("npm", "uninstall", depName, "-g")
outputBytes, err := cmd.Output()
if err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return fmt.Sprintf("error: %s", err.Error()), err
}
return string(outputBytes), nil
}
// 获取Nodejs远端依赖列表
func UninstallNodejsRemoteDep(nodeId string, depName string) (string, error) {
output, err := RpcClientUninstallDep(nodeId, constants.Nodejs, depName)
if err != nil {
return output, err
}
return output, nil
}
// 获取Nodejs本地依赖列表
func GetNodejsDepList(nodeId string, searchDepName string) (depList []entity.Dependency, err error) {
// 执行shell命令
cmd := exec.Command("npm", "search", "--json", searchDepName)
outputBytes, _ := cmd.Output()
// 获取已安装依赖列表
var installedDepList []entity.Dependency
if IsMasterNode(nodeId) {
installedDepList, err = GetNodejsLocalInstalledDepList(nodeId)
if err != nil {
return depList, err
}
} else {
installedDepList, err = GetNodejsRemoteInstalledDepList(nodeId)
if err != nil {
return depList, err
}
}
// 反序列化
if err := json.Unmarshal(outputBytes, &depList); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return depList, err
}
// 遍历安装列表
for i, dep := range depList {
depList[i].Installed = IsInstalledDep(installedDepList, dep)
}
return depList, nil
}

View File

@@ -6,10 +6,15 @@ import (
"crawlab/entity"
"crawlab/lib/cron"
"crawlab/model"
"crawlab/services/notification"
"crawlab/services/spider_handler"
"crawlab/utils"
"encoding/json"
"errors"
"fmt"
"github.com/apex/log"
"github.com/globalsign/mgo/bson"
uuid "github.com/satori/go.uuid"
"github.com/spf13/viper"
"os"
"os/exec"
@@ -17,6 +22,7 @@ import (
"runtime"
"runtime/debug"
"strconv"
"strings"
"sync"
"syscall"
"time"
@@ -102,9 +108,34 @@ func AssignTask(task model.Task) error {
// 设置环境变量
func SetEnv(cmd *exec.Cmd, envs []model.Env, taskId string, dataCol string) *exec.Cmd {
// 默认把Node.js的全局node_modules加入环境变量
envPath := os.Getenv("PATH")
for _, _path := range strings.Split(envPath, ":") {
if strings.Contains(_path, "/.nvm/versions/node/") {
pathNodeModules := strings.Replace(_path, "/bin", "/lib/node_modules", -1)
_ = os.Setenv("PATH", pathNodeModules+":"+envPath)
_ = os.Setenv("NODE_PATH", pathNodeModules)
break
}
}
// 默认环境变量
cmd.Env = append(os.Environ(), "CRAWLAB_TASK_ID="+taskId)
cmd.Env = append(cmd.Env, "CRAWLAB_COLLECTION="+dataCol)
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_HOST="+viper.GetString("mongo.host"))
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_PORT="+viper.GetString("mongo.port"))
if viper.GetString("mongo.db") != "" {
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_DB="+viper.GetString("mongo.db"))
}
if viper.GetString("mongo.username") != "" {
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_USERNAME="+viper.GetString("mongo.username"))
}
if viper.GetString("mongo.password") != "" {
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_PASSWORD="+viper.GetString("mongo.password"))
}
if viper.GetString("mongo.authSource") != "" {
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_AUTHSOURCE="+viper.GetString("mongo.authSource"))
}
cmd.Env = append(cmd.Env, "PYTHONUNBUFFERED=0")
cmd.Env = append(cmd.Env, "PYTHONIOENCODING=utf-8")
cmd.Env = append(cmd.Env, "TZ=Asia/Shanghai")
@@ -114,7 +145,11 @@ func SetEnv(cmd *exec.Cmd, envs []model.Env, taskId string, dataCol string) *exe
cmd.Env = append(cmd.Env, env.Name+"="+env.Value)
}
// TODO 全局环境变量
// 全局环境变量
variables := model.GetVariableList()
for _, variable := range variables {
cmd.Env = append(cmd.Env, variable.Key+"="+variable.Value)
}
return cmd
}
@@ -136,8 +171,15 @@ func FinishOrCancelTask(ch chan string, cmd *exec.Cmd, t model.Task) {
log.Infof("process received signal: %s", signal)
if signal == constants.TaskCancel && cmd.Process != nil {
var err error
// 兼容windows
if runtime.GOOS == constants.Windows {
err = cmd.Process.Kill()
} else {
err = syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
}
// 取消进程
if err := syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL); err != nil {
if err != nil {
log.Errorf("process kill error: %s", err.Error())
debug.PrintStack()
@@ -217,7 +259,22 @@ func ExecuteShellCmd(cmdStr string, cwd string, t model.Task, s model.Spider) (e
}
// 环境变量配置
cmd = SetEnv(cmd, s.Envs, t.Id, s.Col)
envs := s.Envs
if s.Type == constants.Configurable {
// 数据库配置
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_HOST", Value: viper.GetString("mongo.host")})
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_PORT", Value: viper.GetString("mongo.port")})
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_DB", Value: viper.GetString("mongo.db")})
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_USERNAME", Value: viper.GetString("mongo.username")})
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_PASSWORD", Value: viper.GetString("mongo.password")})
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_AUTHSOURCE", Value: viper.GetString("mongo.authSource")})
// 设置配置
for envName, envValue := range s.Config.Settings {
envs = append(envs, model.Env{Name: "CRAWLAB_SETTING_" + envName, Value: envValue})
}
}
cmd = SetEnv(cmd, envs, t.Id, s.Col)
// 起一个goroutine来监控进程
ch := utils.TaskExecChanMap.ChanBlocked(t.Id)
@@ -225,7 +282,9 @@ func ExecuteShellCmd(cmdStr string, cwd string, t model.Task, s model.Spider) (e
go FinishOrCancelTask(ch, cmd, t)
// kill的时候可以kill所有的子进程
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
if runtime.GOOS != constants.Windows {
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
}
// 启动进程
if err := StartTaskProcess(cmd, t); err != nil {
@@ -293,9 +352,12 @@ func SaveTaskResultCount(id string) func() {
// 执行任务
func ExecuteTask(id int) {
if flag, _ := LockList.Load(id); flag.(bool) {
log.Debugf(GetWorkerPrefix(id) + "正在执行任务...")
return
if flag, ok := LockList.Load(id); ok {
if flag.(bool) {
log.Debugf(GetWorkerPrefix(id) + "正在执行任务...")
return
}
}
// 上锁
@@ -369,7 +431,14 @@ func ExecuteTask(id int) {
)
// 执行命令
cmd := spider.Cmd
var cmd string
if spider.Type == constants.Configurable {
// 可配置爬虫命令
cmd = "scrapy crawl config_spider"
} else {
// 自定义爬虫命令
cmd = spider.Cmd
}
// 加入参数
if t.Param != "" {
@@ -382,15 +451,17 @@ func ExecuteTask(id int) {
t.Status = constants.StatusRunning // 任务状态
t.WaitDuration = t.StartTs.Sub(t.CreateTs).Seconds() // 等待时长
// 文件检查
if err := SpiderFileCheck(t, spider); err != nil {
log.Errorf("spider file check error: %s", err.Error())
return
}
// 开始执行任务
log.Infof(GetWorkerPrefix(id) + "开始执行任务(ID:" + t.Id + ")")
// 储存任务
if err := t.Save(); err != nil {
log.Errorf(err.Error())
HandleTaskError(t, err)
return
}
_ = t.Save()
// 起一个cron执行器来统计任务结果数
if spider.Col != "" {
@@ -404,9 +475,22 @@ func ExecuteTask(id int) {
defer cronExec.Stop()
}
// 获得触发任务用户
user, err := model.GetUser(t.UserId)
if err != nil {
log.Errorf(GetWorkerPrefix(id) + err.Error())
return
}
// 执行Shell命令
if err := ExecuteShellCmd(cmd, cwd, t, spider); err != nil {
log.Errorf(GetWorkerPrefix(id) + err.Error())
// 如果发生错误,则发送通知
t, _ = model.GetTask(t.Id)
if user.Setting.NotificationTrigger == constants.NotificationTriggerOnTaskEnd || user.Setting.NotificationTrigger == constants.NotificationTriggerOnTaskError {
SendNotifications(user, t, spider)
}
return
}
@@ -429,6 +513,11 @@ func ExecuteTask(id int) {
t.RuntimeDuration = t.FinishTs.Sub(t.StartTs).Seconds() // 运行时长
t.TotalDuration = t.FinishTs.Sub(t.CreateTs).Seconds() // 总时长
// 如果是任务结束时发送通知,则发送通知
if user.Setting.NotificationTrigger == constants.NotificationTriggerOnTaskEnd {
SendNotifications(user, t, spider)
}
// 保存任务
if err := t.Save(); err != nil {
log.Errorf(GetWorkerPrefix(id) + err.Error())
@@ -444,6 +533,30 @@ func ExecuteTask(id int) {
log.Infof(GetWorkerPrefix(id) + "任务(ID:" + t.Id + ")" + "执行完毕. 消耗时间:" + durationStr + "秒")
}
func SpiderFileCheck(t model.Task, spider model.Spider) error {
// 判断爬虫文件是否存在
gfFile := model.GetGridFs(spider.FileId)
if gfFile == nil {
t.Error = "找不到爬虫文件,请重新上传"
t.Status = constants.StatusError
t.FinishTs = time.Now() // 结束时间
t.RuntimeDuration = t.FinishTs.Sub(t.StartTs).Seconds() // 运行时长
t.TotalDuration = t.FinishTs.Sub(t.CreateTs).Seconds() // 总时长
_ = t.Save()
return errors.New(t.Error)
}
// 判断md5值是否一致
path := filepath.Join(viper.GetString("spider.path"), spider.Name)
md5File := filepath.Join(path, spider_handler.Md5File)
md5 := utils.GetSpiderMd5Str(md5File)
if gfFile.Md5 != md5 {
spiderSync := spider_handler.SpiderSync{Spider: spider}
spiderSync.RemoveDownCreate(gfFile.Md5)
}
return nil
}
func GetTaskLog(id string) (logStr string, err error) {
task, err := model.GetTask(id)
@@ -452,6 +565,29 @@ func GetTaskLog(id string) (logStr string, err error) {
}
if IsMasterNode(task.NodeId.Hex()) {
if !utils.Exists(task.LogPath) {
fileDir, err := MakeLogDir(task)
if err != nil {
log.Errorf(err.Error())
}
fileP := GetLogFilePaths(fileDir)
// 获取日志文件路径
fLog, err := os.Create(fileP)
defer fLog.Close()
if err != nil {
log.Errorf("create task log file error: %s", fileP)
debug.PrintStack()
}
task.LogPath = fileP
if err := task.Save(); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
}
}
// 若为主节点,获取本机日志
logBytes, err := model.GetLocalLog(task.LogPath)
if err != nil {
@@ -533,17 +669,188 @@ func CancelTask(id string) (err error) {
return nil
}
func HandleTaskError(t model.Task, err error) {
log.Error("handle task error:" + err.Error())
t.Status = constants.StatusError
t.Error = err.Error()
t.FinishTs = time.Now()
if err := t.Save(); err != nil {
func AddTask(t model.Task) error {
// 生成任务ID
id := uuid.NewV4()
t.Id = id.String()
// 设置任务状态
t.Status = constants.StatusPending
// 如果没有传入node_id则置为null
if t.NodeId.Hex() == "" {
t.NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
}
// 将任务存入数据库
if err := model.AddTask(t); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return
return err
}
// 加入任务队列
if err := AssignTask(t); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
return err
}
return nil
}
func GetTaskEmailMarkdownContent(t model.Task, s model.Spider) string {
n, _ := model.GetNode(t.NodeId)
errMsg := ""
statusMsg := fmt.Sprintf(`<span style="color:green">%s</span>`, t.Status)
if t.Status == constants.StatusError {
errMsg = " with errors"
statusMsg = fmt.Sprintf(`<span style="color:red">%s</span>`, t.Status)
}
return fmt.Sprintf(`
Your task has finished%s. Please find the task info below.
|
--: | :--
**Task ID:** | %s
**Task Status:** | %s
**Task Param:** | %s
**Spider ID:** | %s
**Spider Name:** | %s
**Node:** | %s
**Create Time:** | %s
**Start Time:** | %s
**Finish Time:** | %s
**Wait Duration:** | %.0f sec
**Runtime Duration:** | %.0f sec
**Total Duration:** | %.0f sec
**Number of Results:** | %d
**Error:** | <span style="color:red">%s</span>
Please login to Crawlab to view the details.
`,
errMsg,
t.Id,
statusMsg,
t.Param,
s.Id.Hex(),
s.Name,
n.Name,
utils.GetLocalTimeString(t.CreateTs),
utils.GetLocalTimeString(t.StartTs),
utils.GetLocalTimeString(t.FinishTs),
t.WaitDuration,
t.RuntimeDuration,
t.TotalDuration,
t.ResultCount,
t.Error,
)
}
func GetTaskMarkdownContent(t model.Task, s model.Spider) string {
n, _ := model.GetNode(t.NodeId)
errMsg := ""
errLog := "-"
statusMsg := fmt.Sprintf(`<font color="#00FF00">%s</font>`, t.Status)
if t.Status == constants.StatusError {
errMsg = `(有错误)`
errLog = fmt.Sprintf(`<font color="#FF0000">%s</font>`, t.Error)
statusMsg = fmt.Sprintf(`<font color="#FF0000">%s</font>`, t.Status)
}
return fmt.Sprintf(`
您的任务已完成%s请查看任务信息如下。
> **任务ID:** %s
> **任务状态:** %s
> **任务参数:** %s
> **爬虫ID:** %s
> **爬虫名称:** %s
> **节点:** %s
> **创建时间:** %s
> **开始时间:** %s
> **完成时间:** %s
> **等待时间:** %.0f秒
> **运行时间:** %.0f秒
> **总时间:** %.0f秒
> **结果数:** %d
> **错误:** %s
请登录Crawlab查看详情。
`,
errMsg,
t.Id,
statusMsg,
t.Param,
s.Id.Hex(),
s.Name,
n.Name,
utils.GetLocalTimeString(t.CreateTs),
utils.GetLocalTimeString(t.StartTs),
utils.GetLocalTimeString(t.FinishTs),
t.WaitDuration,
t.RuntimeDuration,
t.TotalDuration,
t.ResultCount,
errLog,
)
}
func SendTaskEmail(u model.User, t model.Task, s model.Spider) {
statusMsg := "has finished"
if t.Status == constants.StatusError {
statusMsg = "has an error"
}
title := fmt.Sprintf("[Crawlab] Task for \"%s\" %s", s.Name, statusMsg)
if err := notification.SendMail(
u.Email,
u.Username,
title,
GetTaskEmailMarkdownContent(t, s),
); err != nil {
log.Errorf("mail error: " + err.Error())
debug.PrintStack()
}
}
func SendTaskDingTalk(u model.User, t model.Task, s model.Spider) {
statusMsg := "已完成"
if t.Status == constants.StatusError {
statusMsg = "发生错误"
}
title := fmt.Sprintf("[Crawlab] \"%s\" 任务%s", s.Name, statusMsg)
content := GetTaskMarkdownContent(t, s)
if err := notification.SendMobileNotification(u.Setting.DingTalkRobotWebhook, title, content); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
}
}
func SendTaskWechat(u model.User, t model.Task, s model.Spider) {
content := GetTaskMarkdownContent(t, s)
if err := notification.SendMobileNotification(u.Setting.WechatRobotWebhook, "", content); err != nil {
log.Errorf(err.Error())
debug.PrintStack()
}
}
func SendNotifications(u model.User, t model.Task, s model.Spider) {
if u.Email != "" && utils.StringArrayContains(u.Setting.EnabledNotifications, constants.NotificationTypeMail) {
go func() {
SendTaskEmail(u, t, s)
}()
}
if u.Setting.DingTalkRobotWebhook != "" && utils.StringArrayContains(u.Setting.EnabledNotifications, constants.NotificationTypeDingTalk) {
go func() {
SendTaskDingTalk(u, t, s)
}()
}
if u.Setting.WechatRobotWebhook != "" && utils.StringArrayContains(u.Setting.EnabledNotifications, constants.NotificationTypeWechat) {
go func() {
SendTaskWechat(u, t, s)
}()
}
debug.PrintStack()
}
func InitTaskExecutor() error {

View File

@@ -6,20 +6,18 @@ import (
"crawlab/utils"
"errors"
"github.com/dgrijalva/jwt-go"
"github.com/gin-gonic/gin"
"github.com/globalsign/mgo/bson"
"github.com/spf13/viper"
"strings"
"time"
)
func InitUserService() error {
adminUser := model.User{
Username: "admin",
Password: utils.EncryptPassword("admin"),
Role: constants.RoleAdmin,
}
_ = adminUser.Add()
_ = CreateNewUser("admin", "admin", constants.RoleAdmin, "")
return nil
}
func MakeToken(user *model.User) (tokenStr string, err error) {
token := jwt.NewWithClaims(jwt.SigningMethodHS256, jwt.MapClaims{
"id": user.Id,
@@ -91,3 +89,29 @@ func CheckToken(tokenStr string) (user model.User, err error) {
return
}
func CreateNewUser(username string, password string, role string, email string) error {
user := model.User{
Username: strings.ToLower(username),
Password: utils.EncryptPassword(password),
Role: role,
Email: email,
Setting: model.UserSetting{
NotificationTrigger: constants.NotificationTriggerNever,
EnabledNotifications: []string{
constants.NotificationTypeMail,
constants.NotificationTypeDingTalk,
constants.NotificationTypeWechat,
},
},
}
if err := user.Add(); err != nil {
return err
}
return nil
}
func GetCurrentUser(c *gin.Context) *model.User {
data, _ := c.Get("currentUser")
return data.(*model.User)
}

View File

@@ -0,0 +1,12 @@
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class Item(scrapy.Item):
###ITEMS###

View File

@@ -0,0 +1,103 @@
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class ConfigSpiderSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesnt have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class ConfigSpiderDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

View File

@@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
from pymongo import MongoClient
mongo = MongoClient(
host=os.environ.get('CRAWLAB_MONGO_HOST') or 'localhost',
port=int(os.environ.get('CRAWLAB_MONGO_PORT') or 27017),
username=os.environ.get('CRAWLAB_MONGO_USERNAME'),
password=os.environ.get('CRAWLAB_MONGO_PASSWORD'),
authSource=os.environ.get('CRAWLAB_MONGO_AUTHSOURCE') or 'admin'
)
db = mongo[os.environ.get('CRAWLAB_MONGO_DB') or 'test']
col = db[os.environ.get('CRAWLAB_COLLECTION') or 'test']
task_id = os.environ.get('CRAWLAB_TASK_ID')
class ConfigSpiderPipeline(object):
def process_item(self, item, spider):
item['task_id'] = task_id
if col is not None:
col.save(item)
return item

View File

@@ -0,0 +1,111 @@
# -*- coding: utf-8 -*-
import os
import re
import json
# Scrapy settings for config_spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Crawlab Configurable Spider'
SPIDER_MODULES = ['config_spider.spiders']
NEWSPIDER_MODULE = 'config_spider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Crawlab Spider'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'config_spider.middlewares.ConfigSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'config_spider.middlewares.ConfigSpiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'config_spider.pipelines.ConfigSpiderPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
for setting_env_name in [x for x in os.environ.keys() if x.startswith('CRAWLAB_SETTING_')]:
setting_name = setting_env_name.replace('CRAWLAB_SETTING_', '')
setting_value = os.environ.get(setting_env_name)
if setting_value.lower() == 'true':
setting_value = True
elif setting_value.lower() == 'false':
setting_value = False
elif re.search(r'^\d+$', setting_value) is not None:
setting_value = int(setting_value)
elif re.search(r'^\{.*\}$', setting_value.strip()) is not None:
setting_value = json.loads(setting_value)
elif re.search(r'^\[.*\]$', setting_value.strip()) is not None:
setting_value = json.loads(setting_value)
else:
pass
locals()[setting_name] = setting_value

View File

@@ -0,0 +1,21 @@
# -*- coding: utf-8 -*-
import scrapy
import re
from config_spider.items import Item
from urllib.parse import urljoin, urlparse
def get_real_url(response, url):
if re.search(r'^https?', url):
return url
elif re.search(r'^\/\/', url):
u = urlparse(response.url)
return u.scheme + url
return urljoin(response.url, url)
class ConfigSpider(scrapy.Spider):
name = 'config_spider'
def start_requests(self):
yield scrapy.Request(url='###START_URL###', callback=self.###START_STAGE###)
###PARSERS###

View File

@@ -0,0 +1,11 @@
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html
[settings]
default = config_spider.settings
[deploy]
#url = http://localhost:6800/
project = config_spider

View File

@@ -0,0 +1,19 @@
name: "toscrapy_books"
start_url: "http://news.163.com/special/0001386F/rank_news.html"
start_stage: "list"
engine: "scrapy"
stages:
- name: list
is_list: true
list_css: "table tr:not(:first-child)"
fields:
- name: "title"
css: "td:nth-child(1) > a"
- name: "url"
css: "td:nth-child(1) > a"
attr: "href"
- name: "clicks"
css: "td.cBlue"
settings:
ROBOTSTXT_OBEY: false
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36

View File

@@ -0,0 +1,21 @@
name: toscrapy_books
start_url: http://www.baidu.com/s?wd=crawlab
start_stage: list
engine: scrapy
stages:
- name: list
is_list: true
list_xpath: //*[contains(@class, "c-container")]
page_xpath: //*[@id="page"]//a[@class="n"][last()]
page_attr: href
fields:
- name: title
xpath: .//h3/a
- name: url
xpath: .//h3/a
attr: href
- name: abstract
xpath: .//*[@class="c-abstract"]
settings:
ROBOTSTXT_OBEY: false
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36

View File

@@ -0,0 +1,27 @@
name: "toscrapy_books"
start_url: "http://books.toscrape.com"
start_stage: "list"
engine: "scrapy"
stages:
- name: list
is_list: true
list_css: "section article.product_pod"
page_css: "ul.pager li.next a"
page_attr: "href"
fields:
- name: "title"
css: "h3 > a"
- name: "url"
css: "h3 > a"
attr: "href"
next_stage: "detail"
- name: "price"
css: ".product_price > .price_color"
- name: detail
is_list: false
fields:
- name: "description"
css: "#product_description + p"
settings:
ROBOTSTXT_OBEY: true
AUTOTHROTTLE_ENABLED: true

View File

@@ -0,0 +1,51 @@
name: "amazon_config"
display_name: "亚马逊中国(可配置)"
remark: "亚马逊中国搜索手机,列表+分页"
type: "configurable"
col: "results_amazon_config"
engine: scrapy
start_url: https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_2
start_stage: list
stages:
- name: list
is_list: true
list_css: .s-result-item
list_xpath: ""
page_css: .a-last > a
page_xpath: ""
page_attr: href
fields:
- name: title
css: span.a-text-normal
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: .a-link-normal
xpath: ""
attr: href
next_stage: ""
remark: ""
- name: price
css: ""
xpath: .//*[@class="a-price-whole"]
attr: ""
next_stage: ""
remark: ""
- name: price_fraction
css: ""
xpath: .//*[@class="a-price-fraction"]
attr: ""
next_stage: ""
remark: ""
- name: img
css: .s-image-square-aspect > img
xpath: ""
attr: src
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

View File

@@ -0,0 +1,57 @@
name: "autohome_config"
display_name: "汽车之家(可配置)"
remark: "汽车之家文章,列表+详情+分页"
type: "configurable"
col: "results_autohome_config"
engine: scrapy
start_url: https://www.autohome.com.cn/all/
start_stage: list
stages:
- name: list
is_list: true
list_css: ul.article > li
list_xpath: ""
page_css: a.page-item-next
page_xpath: ""
page_attr: href
fields:
- name: title
css: li > a > h3
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: li > a
xpath: ""
attr: href
next_stage: ""
remark: ""
- name: abstract
css: li > a > p
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: time
css: li > a .fn-left
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: views
css: li > a .fn-right > em:first-child
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: comments
css: li > a .fn-right > em:last-child
xpath: ""
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

View File

@@ -0,0 +1,39 @@
name: "baidu_config"
display_name: "百度搜索(可配置)"
remark: "百度搜索Crawlab列表+分页"
type: "configurable"
col: "results_baidu_config"
engine: scrapy
start_url: http://www.baidu.com/s?wd=crawlab
start_stage: list
stages:
- name: list
is_list: true
list_css: ".result.c-container"
list_xpath: ""
page_css: "a.n"
page_xpath: ""
page_attr: href
fields:
- name: title
css: ""
xpath: .//h3/a
attr: ""
next_stage: ""
remark: ""
- name: url
css: ""
xpath: .//h3/a
attr: href
next_stage: ""
remark: ""
- name: abstract
css: ""
xpath: .//*[@class="c-abstract"]
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

View File

@@ -0,0 +1,6 @@
name: "bing_general"
display_name: "必应搜索 (通用)"
remark: "必应搜索 Crawlab列表+分页"
col: "results_bing_general"
type: "customized"
cmd: "python bing_spider.py"

View File

@@ -0,0 +1,41 @@
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse
import re
from crawlab import save_item
s = requests.Session()
def get_real_url(response, url):
if re.search(r'^https?', url):
return url
elif re.search(r'^\/\/', url):
u = urlparse(response.url)
return u.scheme + url
return urljoin(response.url, url)
def start_requests():
for i in range(0, 9):
fr = 'PERE' if not i else 'MORE'
url = f'https://cn.bing.com/search?q=crawlab&first={10 * i + 1}&FROM={fr}'
request_page(url)
def request_page(url):
print(f'requesting {url}')
r = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'})
parse_list(r)
def parse_list(response):
soup = bs(response.content.decode('utf-8'))
for el in list(soup.select('#b_results > li')):
try:
save_item({
'title': el.select_one('h2').text,
'url': el.select_one('h2 a').attrs.get('href'),
'abstract': el.select_one('.b_caption p').text,
})
except:
pass
if __name__ == '__main__':
start_requests()

View File

@@ -0,0 +1,5 @@
name: "chinaz"
display_name: "站长之家 (Scrapy)"
col: "results_chinaz"
type: "customized"
cmd: "scrapy crawl chinaz_spider"

View File

@@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

View File

@@ -65,7 +65,7 @@ ROBOTSTXT_OBEY = True
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'chinaz.pipelines.MongoPipeline': 300,
'crawlab.pipelines.CrawlabMongoPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)

Some files were not shown because too many files have changed in this diff Show More