mirror of
https://github.com/crawlab-team/crawlab.git
synced 2026-01-23 17:31:11 +01:00
Merge branch 'develop'
This commit is contained in:
9
.gitattributes
vendored
Normal file
9
.gitattributes
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
*.md linguist-language=Go
|
||||
*.yml linguist-language=Go
|
||||
*.html linguist-language=Go
|
||||
*.js linguist-language=Go
|
||||
*.xml linguist-language=Go
|
||||
*.css linguist-language=Go
|
||||
*.sql linguist-language=Go
|
||||
*.uml linguist-language=Go
|
||||
*.cmd linguist-language=Go
|
||||
24
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
24
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
@@ -0,0 +1,24 @@
|
||||
---
|
||||
name: Bug report
|
||||
about: Create a report to help us improve
|
||||
title: ''
|
||||
labels: 'bug'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Describe the bug**
|
||||
A clear and concise description of what the bug is.
|
||||
|
||||
**To Reproduce**
|
||||
Steps to reproduce the behavior:
|
||||
1. Go to '...'
|
||||
2. Click on '....'
|
||||
3. Scroll down to '....'
|
||||
4. See error
|
||||
|
||||
**Expected behavior**
|
||||
A clear and concise description of what you expected to happen.
|
||||
|
||||
**Screenshots**
|
||||
If applicable, add screenshots to help explain your problem.
|
||||
23
.github/ISSUE_TEMPLATE/bug_report_zh.md
vendored
Normal file
23
.github/ISSUE_TEMPLATE/bug_report_zh.md
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
---
|
||||
name: Bug 报告
|
||||
about: 创建一份 Bug 报告帮助我们优化产品
|
||||
title: ''
|
||||
labels: 'bug'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Bug 描述**
|
||||
例如,当 xxx 时,xxx 功能不工作。
|
||||
|
||||
**复现步骤**
|
||||
该 Bug 复现步骤如下
|
||||
1.
|
||||
2.
|
||||
3.
|
||||
|
||||
**期望结果**
|
||||
xxx 能工作。
|
||||
|
||||
**截屏**
|
||||

|
||||
17
.github/ISSUE_TEMPLATE/feature_request.md
vendored
Normal file
17
.github/ISSUE_TEMPLATE/feature_request.md
vendored
Normal file
@@ -0,0 +1,17 @@
|
||||
---
|
||||
name: Feature request
|
||||
about: Suggest an idea for this project
|
||||
title: ''
|
||||
labels: 'enhancement'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Is your feature request related to a problem? Please describe.**
|
||||
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||
|
||||
**Describe the solution you'd like**
|
||||
A clear and concise description of what you want to happen.
|
||||
|
||||
**Describe alternatives you've considered**
|
||||
A clear and concise description of any alternative solutions or features you've considered.
|
||||
17
.github/ISSUE_TEMPLATE/feature_request_zh.md
vendored
Normal file
17
.github/ISSUE_TEMPLATE/feature_request_zh.md
vendored
Normal file
@@ -0,0 +1,17 @@
|
||||
---
|
||||
name: 功能需求
|
||||
about: 优化和功能需求建议
|
||||
title: ''
|
||||
labels: 'enhancement'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**请描述该需求尝试解决的问题**
|
||||
例如,当 xxx 时,我总是被当前 xxx 的设计所困扰。
|
||||
|
||||
**请描述您认为可行的解决方案**
|
||||
例如,添加 xxx 功能能够解决问题。
|
||||
|
||||
**考虑过的替代方案**
|
||||
例如,如果用 xxx,也能解决该问题。
|
||||
3
.gitignore
vendored
3
.gitignore
vendored
@@ -121,4 +121,5 @@ _book/
|
||||
.idea
|
||||
*.lock
|
||||
|
||||
backend/spiders
|
||||
backend/spiders
|
||||
spiders/*.zip
|
||||
|
||||
190
CHANGELOG-zh.md
Normal file
190
CHANGELOG-zh.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# 0.4.5 (unkown)
|
||||
### 功能 / 优化
|
||||
- **交互式教程**. 引导用户了解 Crawlab 的主要功能.
|
||||
- **加入全局环境变量**. 可以设置全局环境变量,然后传入到所有爬虫程序中. [#177](https://github.com/crawlab-team/crawlab/issues/177)
|
||||
- **项目**. 允许用户将爬虫关联到项目上. [#316](https://github.com/crawlab-team/crawlab/issues/316)
|
||||
- **示例爬虫**. 当初始化时,自动加入示例爬虫. [#379](https://github.com/crawlab-team/crawlab/issues/379)
|
||||
- **用户管理优化**. 限制管理用户的权限. [#456](https://github.com/crawlab-team/crawlab/issues/456)
|
||||
- **设置页面优化**.
|
||||
- **任务结果页面优化**.
|
||||
|
||||
### Bug 修复
|
||||
- **无法找到爬虫文件错误**. [#485](https://github.com/crawlab-team/crawlab/issues/485)
|
||||
- **点击删除按钮导致跳转**. [#480](https://github.com/crawlab-team/crawlab/issues/480)
|
||||
- **无法在空爬虫里创建文件**. [#479](https://github.com/crawlab-team/crawlab/issues/479)
|
||||
- **下载结果错误**. [#465](https://github.com/crawlab-team/crawlab/issues/465)
|
||||
- **crawlab-sdk CLI 错误**. [#458](https://github.com/crawlab-team/crawlab/issues/458)
|
||||
- **页面刷新问题**. [#441](https://github.com/crawlab-team/crawlab/issues/441)
|
||||
- **结果不支持 JSON**. [#202](https://github.com/crawlab-team/crawlab/issues/202)
|
||||
- **修复“删除爬虫后获取所有爬虫”错误**.
|
||||
- **修复 i18n 警告**.
|
||||
|
||||
# 0.4.4 (2020-01-17)
|
||||
|
||||
### 功能 / 优化
|
||||
- **邮件通知**. 允许用户发送邮件消息通知.
|
||||
- **钉钉机器人通知**. 允许用户发送钉钉机器人消息通知.
|
||||
- **企业微信机器人通知**. 允许用户发送企业微信机器人消息通知.
|
||||
- **API 地址优化**. 在前端加入相对路径,因此用户不需要特别注明 `CRAWLAB_API_ADDRESS`.
|
||||
- **SDK 兼容**. 允许用户通过 Crawlab SDK 与 Scrapy 或通用爬虫集成.
|
||||
- **优化文件管理**. 加入树状文件侧边栏,让用户更方便的编辑文件.
|
||||
- **高级定时任务 Cron**. 允许用户通过 Cron 可视化编辑器编辑定时任务.
|
||||
|
||||
### Bug 修复
|
||||
- **`nil retuened` 错误**.
|
||||
- **使用 HTTPS 出现的报错**.
|
||||
- **无法在爬虫列表页运行可配置爬虫**.
|
||||
- **上传爬虫文件缺少表单验证**.
|
||||
|
||||
# 0.4.3 (2020-01-07)
|
||||
|
||||
### 功能 / 优化
|
||||
- **依赖安装**. 允许用户在平台 Web 界面安装/卸载依赖以及添加编程语言(暂时只有 Node.js)。
|
||||
- **Docker 中预装编程语言**. 允许 Docker 用户通过设置 `CRAWLAB_SERVER_LANG_NODE` 为 `Y` 来预装 `Node.js` 环境.
|
||||
- **在爬虫详情页添加定时任务列表**. 允许用户在爬虫详情页查看、添加、编辑定时任务. [#360](https://github.com/crawlab-team/crawlab/issues/360)
|
||||
- **Cron 表达式与 Linux 一致**. 将表达式从 6 元素改为 5 元素,与 Linux 一致.
|
||||
- **启用/禁用定时任务**. 允许用户启用/禁用定时任务. [#297](https://github.com/crawlab-team/crawlab/issues/297)
|
||||
- **优化任务管理**. 允许用户批量删除任务. [#341](https://github.com/crawlab-team/crawlab/issues/341)
|
||||
- **优化爬虫管理**. 允许用户在爬虫列表页对爬虫进行筛选和排序.
|
||||
- **添加中文版 `CHANGELOG`**.
|
||||
- **在顶部添加 Github 加星按钮**.
|
||||
|
||||
### Bug 修复
|
||||
- **定时任务问题**. [#423](https://github.com/crawlab-team/crawlab/issues/423)
|
||||
- **上传爬虫zip文件问题**. [#403](https://github.com/crawlab-team/crawlab/issues/403) [#407](https://github.com/crawlab-team/crawlab/issues/407)
|
||||
- **因为网络原因导致崩溃**. [#340](https://github.com/crawlab-team/crawlab/issues/340)
|
||||
- **定时任务无法正常运行**
|
||||
- **定时任务列表列表错位问题**
|
||||
- **刷新按钮跳转错误问题**
|
||||
|
||||
# 0.4.2 (2019-12-26)
|
||||
### 功能 / 优化
|
||||
- **免责声明**. 加入免责声明.
|
||||
- **通过 API 获取版本号**. [#371](https://github.com/crawlab-team/crawlab/issues/371)
|
||||
- **通过配置来允许用户注册**. [#346](https://github.com/crawlab-team/crawlab/issues/346)
|
||||
- **允许添加新用户**.
|
||||
- **更高级的文件管理**. 允许用户添加、编辑、重命名、删除代码文件. [#286](https://github.com/crawlab-team/crawlab/issues/286)
|
||||
- **优化爬虫创建流程**. 允许用户在上传 zip 文件前创建空的自定义爬虫.
|
||||
- **优化任务管理**. 允许用户通过选择条件过滤任务. [#341](https://github.com/crawlab-team/crawlab/issues/341)
|
||||
|
||||
### Bug 修复
|
||||
- **重复节点**. [#391](https://github.com/crawlab-team/crawlab/issues/391)
|
||||
- **"mongodb no reachable" 错误**. [#373](https://github.com/crawlab-team/crawlab/issues/373)
|
||||
|
||||
# 0.4.1 (2019-12-13)
|
||||
### 功能 / 优化
|
||||
- **Spiderfile 优化**. 将阶段由数组更换为字典. [#358](https://github.com/crawlab-team/crawlab/issues/358)
|
||||
- **百度统计更新**.
|
||||
|
||||
### Bug 修复
|
||||
- **无法展示定时任务**. [#353](https://github.com/crawlab-team/crawlab/issues/353)
|
||||
- **重复节点注册**. [#334](https://github.com/crawlab-team/crawlab/issues/334)
|
||||
|
||||
# 0.4.0 (2019-12-06)
|
||||
### 功能 / 优化
|
||||
- **可配置爬虫**. 允许用户添加 `Spiderfile` 来配置抓取规则.
|
||||
- **执行模式**. 允许用户选择 3 种任务执行模式: *所有节点*, *指定节点* and *随机*.
|
||||
|
||||
### Bug 修复
|
||||
- **任务意外被杀死**. [#306](https://github.com/crawlab-team/crawlab/issues/306)
|
||||
- **文档更正**. [#301](https://github.com/crawlab-team/crawlab/issues/258) [#301](https://github.com/crawlab-team/crawlab/issues/258)
|
||||
- **直接部署与 Windows 不兼容**. [#288](https://github.com/crawlab-team/crawlab/issues/288)
|
||||
- **日志文件丢失**. [#269](https://github.com/crawlab-team/crawlab/issues/269)
|
||||
|
||||
# 0.3.5 (2019-10-28)
|
||||
### 功能 / 优化
|
||||
- **优雅关闭**. [详情](https://github.com/crawlab-team/crawlab/commit/63fab3917b5a29fd9770f9f51f1572b9f0420385)
|
||||
- **节点信息优化**. [详情](https://github.com/crawlab-team/crawlab/commit/973251a0fbe7a2184ac0da09e0404a17c736aee7)
|
||||
- **将系统环境变量添加到任务**. [详情](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3)
|
||||
- **自动刷新任务日志**. [详情](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3)
|
||||
- **允许 HTTPS 部署**. [详情](https://github.com/crawlab-team/crawlab/commit/5d8f6f0c56768a6e58f5e46cbf5adff8c7819228)
|
||||
|
||||
### Bug 修复
|
||||
- **定时任务中无法获取爬虫列表**. [详情](https://github.com/crawlab-team/crawlab/commit/311f72da19094e3fa05ab4af49812f58843d8d93)
|
||||
- **无法获取工作节点信息**. [详情](https://github.com/crawlab-team/crawlab/commit/6af06efc17685a9e232e8c2b5fd819ec7d2d1674)
|
||||
- **运行爬虫任务时无法选择节点**. [详情](https://github.com/crawlab-team/crawlab/commit/31f8e03234426e97aed9b0bce6a50562f957edad)
|
||||
- **结果量很大时无法获取结果数量**. [#260](https://github.com/crawlab-team/crawlab/issues/260)
|
||||
- **定时任务中的节点问题**. [#244](https://github.com/crawlab-team/crawlab/issues/244)
|
||||
|
||||
|
||||
# 0.3.1 (2019-08-25)
|
||||
### 功能 / 优化
|
||||
- **Docker 镜像优化**. 将 Docker 镜像进一步分割成 alpine 镜像版本的 master、worker、frontendSplit docker further into master, worker, frontend.
|
||||
- **单元测试**. 用单元测试覆盖部分后端代码.
|
||||
- **前端优化**. 登录页、按钮大小、上传 UI 提示.
|
||||
- **更灵活的节点注册**. 允许用户传一个变量作为注册 key,而不是默认的 MAC 地址.
|
||||
|
||||
### Bug 修复
|
||||
- **上传大爬虫文件错误**. 上传大爬虫文件时的内存崩溃问题. [#150](https://github.com/crawlab-team/crawlab/issues/150)
|
||||
- **无法同步爬虫**. 通过提高写权限等级来修复同步爬虫文件时的问题. [#114](https://github.com/crawlab-team/crawlab/issues/114)
|
||||
- **爬虫页问题**. 通过删除 `Site` 字段来修复. [#112](https://github.com/crawlab-team/crawlab/issues/112)
|
||||
- **节点展示问题**. 当在多个机器上跑 Docker 容器时,节点无法正确展示. [#99](https://github.com/crawlab-team/crawlab/issues/99)
|
||||
|
||||
# 0.3.0 (2019-07-31)
|
||||
### 功能 / 优化
|
||||
- **Golang 后端**: 将后端由 Python 重构为 Golang,很大的提高了稳定性和性能.
|
||||
- **节点网络图**: 节点拓扑图可视化.
|
||||
- **节点系统信息**: 可以查看包括操作系统、CPU数量、可执行文件在内的系统信息.
|
||||
- **节点监控改进**: 节点通过 Redis 来监控和注册.
|
||||
- **文件管理**: 可以在线编辑爬虫文件,包括代码高亮.
|
||||
- **登录页/注册页/用户管理**: 要求用户登录后才能使用 Crawlab, 允许用户注册和用户管理,有一些基于角色的鉴权机制.
|
||||
- **自动部署爬虫**: 爬虫将被自动部署或同步到所有在线节点.
|
||||
- **更小的 Docker 镜像**: 瘦身版 Docker 镜像,通过多阶段构建将 Docker 镜像大小从 1.3G 减小到 700M 左右.
|
||||
|
||||
### Bug 修复
|
||||
- **节点状态**. 节点状态不会随着节点下线而更新. [#87](https://github.com/tikazyq/crawlab/issues/87)
|
||||
- **爬虫部署错误**. 通过自动爬虫部署来修复 [#83](https://github.com/tikazyq/crawlab/issues/83)
|
||||
- **节点无法显示**. 节点无法显示在线 [#81](https://github.com/tikazyq/crawlab/issues/81)
|
||||
- **定时任务无法工作**. 通过 Golang 后端修复 [#64](https://github.com/tikazyq/crawlab/issues/64)
|
||||
- **Flower 错误**. 通过 Golang 后端修复 [#57](https://github.com/tikazyq/crawlab/issues/57)
|
||||
|
||||
# 0.2.4 (2019-07-07)
|
||||
### 功能 / 优化
|
||||
- **文档**: 更优和更详细的文档.
|
||||
- **更好的 Crontab**: 通过 UI 界面生成 Cron 表达式.
|
||||
- **更优的性能**: 从原生 flask 引擎 切换到 `gunicorn`. [#78](https://github.com/tikazyq/crawlab/issues/78)
|
||||
|
||||
### Bug 修复
|
||||
- **删除爬虫**. 删除爬虫时不止在数据库中删除,还应该删除相关的文件夹、任务和定时任务. [#69](https://github.com/tikazyq/crawlab/issues/69)
|
||||
- **MongoDB 授权**. 允许用户注明 `authenticationDatabase` 来连接 `mongodb`. [#68](https://github.com/tikazyq/crawlab/issues/68)
|
||||
- **Windows 兼容性**. 加入 `eventlet` 到 `requirements.txt`. [#59](https://github.com/tikazyq/crawlab/issues/59)
|
||||
|
||||
|
||||
# 0.2.3 (2019-06-12)
|
||||
### 功能 / 优化
|
||||
- **Docker**: 用户能够运行 Docker 镜像来加快部署.
|
||||
- **CLI**: 允许用户通过命令行来执行 Crawlab 程序.
|
||||
- **上传爬虫**: 允许用户上传自定义爬虫到 Crawlab.
|
||||
- **预览时编辑字段**: 允许用户在可配置爬虫中预览数据时编辑字段.
|
||||
|
||||
### Bug 修复
|
||||
- **爬虫分页**. 爬虫列表页中修复分页问题.
|
||||
|
||||
# 0.2.2 (2019-05-30)
|
||||
### 功能 / 优化
|
||||
- **自动抓取字段**: 在可配置爬虫列表页种自动抓取字段.
|
||||
- **下载结果**: 允许下载结果为 CSV 文件.
|
||||
- **百度统计**: 允许用户选择是否允许向百度统计发送统计数据.
|
||||
|
||||
### Bug 修复
|
||||
- **结果页分页**. [#45](https://github.com/tikazyq/crawlab/issues/45)
|
||||
- **定时任务重复触发**: 将 Flask DEBUG 设置为 False 来保证定时任务无法重复触发. [#32](https://github.com/tikazyq/crawlab/issues/32)
|
||||
- **前端环境**: 添加 `VUE_APP_BASE_URL` 作为生产环境模式变量,然后 API 不会永远都是 `localhost` [#30](https://github.com/tikazyq/crawlab/issues/30)
|
||||
|
||||
# 0.2.1 (2019-05-27)
|
||||
- **可配置爬虫**: 允许用户创建爬虫来抓取数据,而不用编写代码.
|
||||
|
||||
# 0.2 (2019-05-10)
|
||||
|
||||
- **高级数据统计**: 爬虫详情页的高级数据统计.
|
||||
- **网站数据**: 加入网站列表(中国),允许用户查看 robots.txt、首页响应时间等信息.
|
||||
|
||||
# 0.1.1 (2019-04-23)
|
||||
|
||||
- **基础统计**: 用户可以查看基础统计数据,包括爬虫和任务页中的失败任务数、结果数.
|
||||
- **近实时任务信息**: 周期性(5 秒)向服务器轮训数据来实现近实时查看任务信息.
|
||||
- **定时任务**: 利用 apscheduler 实现定时任务,允许用户设置类似 Cron 的定时任务.
|
||||
|
||||
# 0.1 (2019-04-17)
|
||||
|
||||
- **首次发布**
|
||||
92
CHANGELOG.md
92
CHANGELOG.md
@@ -1,3 +1,95 @@
|
||||
# 0.4.5 (2020-02-03)
|
||||
### Features / Enhancement
|
||||
- **Interactive Tutorial**. Guide users through the main functionalities of Crawlab.
|
||||
- **Global Environment Variables**. Allow users to set global environment variables, which will be passed into all spider programs. [#177](https://github.com/crawlab-team/crawlab/issues/177)
|
||||
- **Project**. Allow users to link spiders to projects. [#316](https://github.com/crawlab-team/crawlab/issues/316)
|
||||
- **Demo Spiders**. Added demo spiders when Crawlab is initialized. [#379](https://github.com/crawlab-team/crawlab/issues/379)
|
||||
- **User Admin Optimization**. Restrict privilleges of admin users. [#456](https://github.com/crawlab-team/crawlab/issues/456)
|
||||
- **Setting Page Optimization**.
|
||||
- **Task Results Optimization**.
|
||||
|
||||
### Bug Fixes
|
||||
- **Unable to find spider file error**. [#485](https://github.com/crawlab-team/crawlab/issues/485)
|
||||
- **Click delete button results in redirect**. [#480](https://github.com/crawlab-team/crawlab/issues/480)
|
||||
- **Unable to create files in an empty spider**. [#479](https://github.com/crawlab-team/crawlab/issues/479)
|
||||
- **Download results error**. [#465](https://github.com/crawlab-team/crawlab/issues/465)
|
||||
- **crawlab-sdk CLI error**. [#458](https://github.com/crawlab-team/crawlab/issues/458)
|
||||
- **Page refresh issue**. [#441](https://github.com/crawlab-team/crawlab/issues/441)
|
||||
- **Results not support JSON**. [#202](https://github.com/crawlab-team/crawlab/issues/202)
|
||||
- **Getting all spider after deleting a spider**.
|
||||
- **i18n warning**.
|
||||
|
||||
# 0.4.4 (2020-01-17)
|
||||
### Features / Enhancement
|
||||
- **Email Notification**. Allow users to send email notifications.
|
||||
- **DingTalk Robot Notification**. Allow users to send DingTalk Robot notifications.
|
||||
- **Wechat Robot Notification**. Allow users to send Wechat Robot notifications.
|
||||
- **API Address Optimization**. Added relative URL path in frontend so that users don't have to specify `CRAWLAB_API_ADDRESS` explicitly.
|
||||
- **SDK Compatiblity**. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
|
||||
- **Enhanced File Management**. Added tree-like file sidebar to allow users to edit files much more easier.
|
||||
- **Advanced Schedule Cron**. Allow users to edit schedule cron with visualized cron editor.
|
||||
|
||||
### Bug Fixes
|
||||
- **`nil retuened` error**.
|
||||
- **Error when using HTTPS**.
|
||||
- **Unable to run Configurable Spiders on Spider List**.
|
||||
- **Missing form validation before uploading spider files**.
|
||||
|
||||
# 0.4.3 (2020-01-07)
|
||||
|
||||
### Features / Enhancement
|
||||
- **Dependency Installation**. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
|
||||
- **Pre-install Programming Languages in Docker**. Allow Docker users to set `CRAWLAB_SERVER_LANG_NODE` as `Y` to pre-install `Node.js` environments.
|
||||
- **Add Schedule List in Spider Detail Page**. Allow users to view / add / edit schedule cron jobs in the spider detail page. [#360](https://github.com/crawlab-team/crawlab/issues/360)
|
||||
- **Align Cron Expression with Linux**. Change the expression of 6 elements to 5 elements as aligned in Linux.
|
||||
- **Enable/Disable Schedule Cron**. Allow users to enable/disable the schedule jobs. [#297](https://github.com/crawlab-team/crawlab/issues/297)
|
||||
- **Better Task Management**. Allow users to batch delete tasks. [#341](https://github.com/crawlab-team/crawlab/issues/341)
|
||||
- **Better Spider Management**. Allow users to sort and filter spiders in the spider list page.
|
||||
- **Added Chinese `CHANGELOG`**.
|
||||
- **Added Github Star Button at Nav Bar**.
|
||||
|
||||
### Bug Fixes
|
||||
- **Schedule Cron Task Issue**. [#423](https://github.com/crawlab-team/crawlab/issues/423)
|
||||
- **Upload Spider Zip File Issue**. [#403](https://github.com/crawlab-team/crawlab/issues/403) [#407](https://github.com/crawlab-team/crawlab/issues/407)
|
||||
- **Exit due to Network Failure**. [#340](https://github.com/crawlab-team/crawlab/issues/340)
|
||||
- **Cron Jobs not Running Correctly**
|
||||
- **Schedule List Columns Mis-positioned**
|
||||
- **Clicking Refresh Button Redirected to 404 Page**
|
||||
|
||||
# 0.4.2 (2019-12-26)
|
||||
### Features / Enhancement
|
||||
- **Disclaimer**. Added page for Disclaimer.
|
||||
- **Call API to fetch version**. [#371](https://github.com/crawlab-team/crawlab/issues/371)
|
||||
- **Configure to allow user registration**. [#346](https://github.com/crawlab-team/crawlab/issues/346)
|
||||
- **Allow adding new users**.
|
||||
- **More Advanced File Management**. Allow users to add / edit / rename / delete files. [#286](https://github.com/crawlab-team/crawlab/issues/286)
|
||||
- **Optimized Spider Creation Process**. Allow users to create an empty customized spider before uploading the zip file.
|
||||
- **Better Task Management**. Allow users to filter tasks by selecting through certian criterions. [#341](https://github.com/crawlab-team/crawlab/issues/341)
|
||||
|
||||
### Bug Fixes
|
||||
- **Duplicated nodes**. [#391](https://github.com/crawlab-team/crawlab/issues/391)
|
||||
- **"mongodb no reachable" error**. [#373](https://github.com/crawlab-team/crawlab/issues/373)
|
||||
|
||||
# 0.4.1 (2019-12-13)
|
||||
### Features / Enhancement
|
||||
- **Spiderfile Optimization**. Stages changed from dictionary to array. [#358](https://github.com/crawlab-team/crawlab/issues/358)
|
||||
- **Baidu Tongji Update**.
|
||||
|
||||
### Bug Fixes
|
||||
- **Unable to display schedule tasks**. [#353](https://github.com/crawlab-team/crawlab/issues/353)
|
||||
- **Duplicate node registration**. [#334](https://github.com/crawlab-team/crawlab/issues/334)
|
||||
|
||||
# 0.4.0 (2019-12-06)
|
||||
### Features / Enhancement
|
||||
- **Configurable Spider**. Allow users to add spiders using *Spiderfile* to configure crawling rules.
|
||||
- **Execution Mode**. Allow users to select 3 modes for task execution: *All Nodes*, *Selected Nodes* and *Random*.
|
||||
|
||||
### Bug Fixes
|
||||
- **Task accidentally killed**. [#306](https://github.com/crawlab-team/crawlab/issues/306)
|
||||
- **Documentation fix**. [#301](https://github.com/crawlab-team/crawlab/issues/258) [#301](https://github.com/crawlab-team/crawlab/issues/258)
|
||||
- **Direct deploy incompatible with Windows**. [#288](https://github.com/crawlab-team/crawlab/issues/288)
|
||||
- **Log files lost**. [#269](https://github.com/crawlab-team/crawlab/issues/269)
|
||||
|
||||
# 0.3.5 (2019-10-28)
|
||||
### Features / Enhancement
|
||||
- **Graceful Showdown**. [detail](https://github.com/crawlab-team/crawlab/commit/63fab3917b5a29fd9770f9f51f1572b9f0420385)
|
||||
|
||||
12
DISCLAIMER-zh.md
Normal file
12
DISCLAIMER-zh.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# 免责声明
|
||||
|
||||
本免责及隐私保护声明(以下简称“免责声明”或“本声明”)适用于 Crawlab 开发组 (以下简称“开发组”)研发的系列软件(以下简称"Crawlab") 在您阅读本声明后若不同意此声明中的任何条款,或对本声明存在质疑,请立刻停止使用我们的软件。若您已经开始或正在使用 Crawlab,则表示您已阅读并同意本声明的所有条款之约定。
|
||||
|
||||
1. 总则:您通过安装 Crawlab 并使用 Crawlab 提供的服务与功能即表示您已经同意与开发组立本协议。开发组可随时执行全权决定更改“条款”。经修订的“条款”一经在 Github 免责声明页面上公布后,立即自动生效。
|
||||
2. 本产品是基于Golang的分布式爬虫管理平台,支持Python、NodeJS、Go、Java、PHP等多种编程语言以及多种爬虫框架。
|
||||
3. 一切因使用 Crawlab 而引致之任何意外、疏忽、合约毁坏、诽谤、版权或知识产权侵犯及其所造成的损失(包括在非官方站点下载 Crawlab 而感染电脑病毒),Crawlab 开发组概不负责,亦不承担任何法律责任。
|
||||
4. 用户对使用 Crawlab 自行承担风险,我们不做任何形式的保证, 因网络状况、通讯线路等任何技术原因而导致用户不能正常升级更新,我们也不承担任何法律责任。
|
||||
5. 用户使用 Crawlab 对目标网站进行抓取时需遵从《网络安全法》等与爬虫相关的法律法规,切勿擅自采集公民个人信息、用 DDoS 等方式造成目标网站瘫痪、不遵从目标网站的 robots.txt 协议等非法手段。
|
||||
6. Crawlab 尊重并保护所有用户的个人隐私权,不会窃取任何用户计算机中的信息。
|
||||
7. 系统的版权:Crawlab 开发组对所有开发的或合作开发的产品拥有知识产权,著作权,版权和使用权,这些产品受到适用的知识产权、版权、商标、服务商标、专利或其他法律的保护。
|
||||
8. 传播:任何公司或个人在网络上发布,传播我们软件的行为都是允许的,但因公司或个人传播软件可能造成的任何法律和刑事事件 Crawlab 开发组不负任何责任。
|
||||
12
DISCLAIMER.md
Normal file
12
DISCLAIMER.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Disclaimer
|
||||
|
||||
This Disclaimer and privacy protection statement (hereinafter referred to as "disclaimer statement" or "this statement") is applicable to the series of software (hereinafter referred to as "crawlab") developed by crawlab development group (hereinafter referred to as "development group") after you read this statement, if you do not agree with any terms in this statement or have doubts about this statement, please stop using our software immediately. If you have started or are using crawlab, you have read and agree to all terms of this statement.
|
||||
|
||||
1. General: by installing crawlab and using the services and functions provided by crawlab, you have agreed to establish this agreement with the development team. The developer group may at any time change the terms at its sole discretion. The amended "terms" shall take effect automatically as soon as they are published on the GitHub disclaimer page.
|
||||
2. This product is a distributed crawler management platform based on golang, supporting python, nodejs, go, Java, PHP and other programming languages as well as a variety of crawler frameworks.
|
||||
3. The development team of crawlab shall not be responsible for any accident, negligence, contract damage, defamation, copyright or intellectual property infringement caused by the use of crawlab and any loss caused by it (including computer virus infection caused by downloading crawlab on the unofficial site), and shall not bear any legal responsibility.
|
||||
4. The user shall bear the risk of using crawlab by himself, we do not make any form of guarantee, and we will not bear any legal responsibility for the user's failure to upgrade and update normally due to any technical reasons such as network condition and communication line.
|
||||
5. When users use crawlab to grab the target website, they need to comply with the laws and regulations related to crawlers, such as the network security law. Do not collect personal information of citizens without authorization, cause the target website to be paralyzed by DDoS, or fail to comply with the robots.txt protocol and other illegal means of the target website.
|
||||
6. Crawlab respects and protects the personal privacy of all users and will not steal any information from users' computers.
|
||||
7. Copyright of the system: the crawleb development team owns the intellectual property rights, copyrights, copyrights and use rights for all developed or jointly developed products, which are protected by applicable intellectual property rights, copyrights, trademarks, service trademarks, patents or other laws.
|
||||
8. Communication: any company or individual who publishes or disseminates our software on the Internet is allowed, but the crawlab development team shall not be responsible for any legal and criminal events that may be caused by the company or individual disseminating the software.
|
||||
20
Dockerfile
20
Dockerfile
@@ -15,34 +15,34 @@ WORKDIR /app
|
||||
|
||||
# install frontend
|
||||
RUN npm config set unsafe-perm true
|
||||
RUN npm install -g yarn && yarn install --registry=https://registry.npm.taobao.org
|
||||
RUN npm install -g yarn && yarn install
|
||||
|
||||
RUN npm run build:prod
|
||||
|
||||
# images
|
||||
FROM ubuntu:latest
|
||||
|
||||
ADD . /app
|
||||
|
||||
# set as non-interactive
|
||||
ENV DEBIAN_FRONTEND noninteractive
|
||||
|
||||
# set CRAWLAB_IS_DOCKER
|
||||
ENV CRAWLAB_IS_DOCKER Y
|
||||
|
||||
# install packages
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip \
|
||||
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip nginx \
|
||||
&& ln -s /usr/bin/pip3 /usr/local/bin/pip \
|
||||
&& ln -s /usr/bin/python3 /usr/local/bin/python
|
||||
|
||||
# install backend
|
||||
RUN pip install scrapy pymongo bs4 requests -i https://pypi.tuna.tsinghua.edu.cn/simple
|
||||
RUN pip install scrapy pymongo bs4 requests crawlab-sdk scrapy-splash
|
||||
|
||||
# add files
|
||||
ADD . /app
|
||||
|
||||
# copy backend files
|
||||
COPY --from=backend-build /go/src/app .
|
||||
COPY --from=backend-build /go/bin/crawlab /usr/local/bin
|
||||
|
||||
# install nginx
|
||||
RUN apt-get -y install nginx
|
||||
|
||||
# copy frontend files
|
||||
COPY --from=frontend-build /app/dist /app/dist
|
||||
COPY --from=frontend-build /app/conf/crawlab.conf /etc/nginx/conf.d
|
||||
@@ -57,4 +57,4 @@ EXPOSE 8080
|
||||
EXPOSE 8000
|
||||
|
||||
# start backend
|
||||
CMD ["/bin/sh", "/app/docker_init.sh"]
|
||||
CMD ["/bin/bash", "/app/docker_init.sh"]
|
||||
|
||||
@@ -4,44 +4,43 @@ WORKDIR /go/src/app
|
||||
COPY ./backend .
|
||||
|
||||
ENV GO111MODULE on
|
||||
ENV GOPROXY https://mirrors.aliyun.com/goproxy/
|
||||
ENV GOPROXY https://goproxy.io
|
||||
|
||||
RUN go install -v ./...
|
||||
|
||||
FROM node:8.16.0 AS frontend-build
|
||||
FROM node:8.16.0-alpine AS frontend-build
|
||||
|
||||
ADD ./frontend /app
|
||||
WORKDIR /app
|
||||
|
||||
# install frontend
|
||||
RUN npm install -g yarn && yarn install --registry=https://registry.npm.taobao.org
|
||||
RUN npm config set unsafe-perm true
|
||||
RUN npm install -g yarn && yarn install --registry=https://registry.npm.taobao.org # --sass_binary_site=https://npm.taobao.org/mirrors/node-sass/
|
||||
|
||||
RUN npm run build:prod
|
||||
|
||||
# images
|
||||
FROM ubuntu:latest
|
||||
|
||||
ADD . /app
|
||||
|
||||
# set as non-interactive
|
||||
ENV DEBIAN_FRONTEND noninteractive
|
||||
|
||||
# install packages
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip \
|
||||
RUN chmod 777 /tmp \
|
||||
&& apt-get update \
|
||||
&& apt-get install -y curl git net-tools iputils-ping ntp ntpdate python3 python3-pip nginx \
|
||||
&& ln -s /usr/bin/pip3 /usr/local/bin/pip \
|
||||
&& ln -s /usr/bin/python3 /usr/local/bin/python
|
||||
|
||||
# install backend
|
||||
RUN pip install scrapy pymongo bs4 requests -i https://pypi.tuna.tsinghua.edu.cn/simple
|
||||
RUN pip install scrapy pymongo bs4 requests crawlab-sdk scrapy-splash -i https://pypi.tuna.tsinghua.edu.cn/simple
|
||||
|
||||
# add files
|
||||
ADD . /app
|
||||
|
||||
# copy backend files
|
||||
COPY --from=backend-build /go/src/app .
|
||||
COPY --from=backend-build /go/bin/crawlab /usr/local/bin
|
||||
|
||||
# install nginx
|
||||
RUN apt-get -y install nginx
|
||||
|
||||
# copy frontend files
|
||||
COPY --from=frontend-build /app/dist /app/dist
|
||||
COPY --from=frontend-build /app/conf/crawlab.conf /etc/nginx/conf.d
|
||||
@@ -56,4 +55,4 @@ EXPOSE 8080
|
||||
EXPOSE 8000
|
||||
|
||||
# start backend
|
||||
CMD ["/bin/sh", "/app/docker_init.sh"]
|
||||
CMD ["/bin/bash", "/app/docker_init.sh"]
|
||||
|
||||
153
README-zh.md
153
README-zh.md
@@ -1,39 +1,68 @@
|
||||
# Crawlab
|
||||
|
||||

|
||||

|
||||

|
||||

|
||||

|
||||

|
||||

|
||||
<p>
|
||||
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
|
||||
<img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
|
||||
</a>
|
||||
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
|
||||
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
|
||||
<img src="https://img.shields.io/github/release/crawlab-team/crawlab.svg?logo=github">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/commits/master" target="_blank">
|
||||
<img src="https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Abug" target="_blank">
|
||||
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/bug.svg?label=bugs&color=red">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement" target="_blank">
|
||||
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/enhancement.svg?label=enhancements&color=cyan">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/blob/master/LICENSE" target="_blank">
|
||||
<img src="https://img.shields.io/github/license/crawlab-team/crawlab.svg">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
中文 | [English](https://github.com/crawlab-team/crawlab)
|
||||
|
||||
[安装](#安装) | [运行](#运行) | [截图](#截图) | [架构](#架构) | [集成](#与其他框架的集成) | [比较](#与其他框架比较) | [相关文章](#相关文章) | [社区&赞助](#社区--赞助)
|
||||
[安装](#安装) | [运行](#运行) | [截图](#截图) | [架构](#架构) | [集成](#与其他框架的集成) | [比较](#与其他框架比较) | [相关文章](#相关文章) | [社区&赞助](#社区--赞助) | [更新日志](https://github.com/crawlab-team/crawlab/blob/master/CHANGELOG-zh.md) | [免责声明](https://github.com/crawlab-team/crawlab/blob/master/DISCLAIMER-zh.md)
|
||||
|
||||
基于Golang的分布式爬虫管理平台,支持Python、NodeJS、Go、Java、PHP等多种编程语言以及多种爬虫框架。
|
||||
|
||||
[查看演示 Demo](http://crawlab.cn/demo) | [文档](https://tikazyq.github.io/crawlab-docs)
|
||||
[查看演示 Demo](http://crawlab.cn/demo) | [文档](http://docs.crawlab.cn)
|
||||
|
||||
## 安装
|
||||
|
||||
三种方式:
|
||||
1. [Docker](https://tikazyq.github.io/crawlab-docs/Installation/Docker.html)(推荐)
|
||||
2. [直接部署](https://tikazyq.github.io/crawlab-docs/Installation/Direct.html)(了解内核)
|
||||
3. [Kubernetes](https://mp.weixin.qq.com/s/3Q1BQATUIEE_WXcHPqhYbA)
|
||||
1. [Docker](http://docs.crawlab.cn/Installation/Docker.html)(推荐)
|
||||
2. [直接部署](http://docs.crawlab.cn/Installation/Direct.html)(了解内核)
|
||||
3. [Kubernetes](https://juejin.im/post/5e0a02d851882549884c27ad) (多节点部署)
|
||||
|
||||
### 要求(Docker)
|
||||
- Docker 18.03+
|
||||
- Redis
|
||||
- Redis 5.x+
|
||||
- MongoDB 3.6+
|
||||
- Docker Compose 1.24+ (可选,但推荐)
|
||||
|
||||
### 要求(直接部署)
|
||||
- Go 1.12+
|
||||
- Node 8.12+
|
||||
- Redis
|
||||
- Redis 5.x+
|
||||
- MongoDB 3.6+
|
||||
|
||||
## 快速开始
|
||||
|
||||
请打开命令行并执行下列命令。请保证您已经提前安装了 `docker-compose`。
|
||||
|
||||
```bash
|
||||
git clone https://github.com/crawlab-team/crawlab
|
||||
cd crawlab
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
接下来,您可以看看 `docker-compose.yml` (包含详细配置参数),以及参考 [文档](http://docs.crawlab.cn) 来查看更多信息。
|
||||
|
||||
## 运行
|
||||
|
||||
### Docker
|
||||
@@ -47,13 +76,11 @@ services:
|
||||
image: tikazyq/crawlab:latest
|
||||
container_name: master
|
||||
environment:
|
||||
CRAWLAB_API_ADDRESS: "http://localhost:8000"
|
||||
CRAWLAB_SERVER_MASTER: "Y"
|
||||
CRAWLAB_MONGO_HOST: "mongo"
|
||||
CRAWLAB_REDIS_ADDRESS: "redis"
|
||||
ports:
|
||||
- "8080:8080" # frontend
|
||||
- "8000:8000" # backend
|
||||
- "8080:8080"
|
||||
depends_on:
|
||||
- mongo
|
||||
- redis
|
||||
@@ -111,9 +138,9 @@ Docker部署的详情,请见[相关文档](https://tikazyq.github.io/crawlab-d
|
||||
|
||||

|
||||
|
||||
#### 爬虫文件
|
||||
#### 爬虫文件编辑
|
||||
|
||||

|
||||

|
||||
|
||||
#### 任务详情 - 抓取结果
|
||||
|
||||
@@ -121,13 +148,21 @@ Docker部署的详情,请见[相关文档](https://tikazyq.github.io/crawlab-d
|
||||
|
||||
#### 定时任务
|
||||
|
||||

|
||||

|
||||
|
||||
#### 依赖安装
|
||||
|
||||

|
||||
|
||||
#### 消息通知
|
||||
|
||||
<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
|
||||
|
||||
## 架构
|
||||
|
||||
Crawlab的架构包括了一个主节点(Master Node)和多个工作节点(Worker Node),以及负责通信和数据储存的Redis和MongoDB数据库。
|
||||
|
||||

|
||||

|
||||
|
||||
前端应用向主节点请求数据,主节点通过MongoDB和Redis来执行任务派发调度以及部署,工作节点收到任务之后,开始执行爬虫任务,并将任务结果储存到MongoDB。架构相对于`v0.3.0`之前的Celery版本有所精简,去除了不必要的节点监控模块Flower,节点监控主要由Redis完成。
|
||||
|
||||
@@ -162,37 +197,43 @@ Redis是非常受欢迎的Key-Value数据库,在Crawlab中主要实现节点
|
||||
|
||||
## 与其他框架的集成
|
||||
|
||||
[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) 提供了一些 `helper` 方法来让您的爬虫更好的集成到 Crawlab 中,例如保存结果数据到 Crawlab 中等等。
|
||||
|
||||
### 集成 Scrapy
|
||||
|
||||
在 `settings.py` 中找到 `ITEM_PIPELINES`(`dict` 类型的变量),在其中添加如下内容。
|
||||
|
||||
```python
|
||||
ITEM_PIPELINES = {
|
||||
'crawlab.pipelines.CrawlabMongoPipeline': 888,
|
||||
}
|
||||
```
|
||||
|
||||
然后,启动 Scrapy 爬虫,运行完成之后,您就应该能看到抓取结果出现在 **任务详情-结果** 里。
|
||||
|
||||
### 通用 Python 爬虫
|
||||
|
||||
将下列代码加入到您爬虫中的结果保存部分。
|
||||
|
||||
```python
|
||||
# 引入保存结果方法
|
||||
from crawlab import save_item
|
||||
|
||||
# 这是一个结果,需要为 dict 类型
|
||||
result = {'name': 'crawlab'}
|
||||
|
||||
# 调用保存结果方法
|
||||
save_item(result)
|
||||
```
|
||||
|
||||
然后,启动爬虫,运行完成之后,您就应该能看到抓取结果出现在 **任务详情-结果** 里。
|
||||
|
||||
### 其他框架和语言
|
||||
|
||||
爬虫任务本质上是由一个shell命令来实现的。任务ID将以环境变量`CRAWLAB_TASK_ID`的形式存在于爬虫任务运行的进程中,并以此来关联抓取数据。另外,`CRAWLAB_COLLECTION`是Crawlab传过来的所存放collection的名称。
|
||||
|
||||
在爬虫程序中,需要将`CRAWLAB_TASK_ID`的值以`task_id`作为可以存入数据库中`CRAWLAB_COLLECTION`的collection中。这样Crawlab就知道如何将爬虫任务与抓取数据关联起来了。当前,Crawlab只支持MongoDB。
|
||||
|
||||
### 集成Scrapy
|
||||
|
||||
以下是Crawlab跟Scrapy集成的例子,利用了Crawlab传过来的task_id和collection_name。
|
||||
|
||||
```python
|
||||
import os
|
||||
from pymongo import MongoClient
|
||||
|
||||
MONGO_HOST = '192.168.99.100'
|
||||
MONGO_PORT = 27017
|
||||
MONGO_DB = 'crawlab_test'
|
||||
|
||||
# scrapy example in the pipeline
|
||||
class JuejinPipeline(object):
|
||||
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
|
||||
db = mongo[MONGO_DB]
|
||||
col_name = os.environ.get('CRAWLAB_COLLECTION')
|
||||
if not col_name:
|
||||
col_name = 'test'
|
||||
col = db[col_name]
|
||||
|
||||
def process_item(self, item, spider):
|
||||
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
|
||||
self.col.save(item)
|
||||
return item
|
||||
```
|
||||
|
||||
## 与其他框架比较
|
||||
|
||||
现在已经有一些爬虫管理框架了,因此为啥还要用Crawlab?
|
||||
@@ -201,13 +242,12 @@ class JuejinPipeline(object):
|
||||
|
||||
Crawlab使用起来很方便,也很通用,可以适用于几乎任何主流语言和框架。它还有一个精美的前端界面,让用户可以方便的管理和运行爬虫。
|
||||
|
||||
|框架 | 类型 | 分布式 | 前端 | 依赖于Scrapyd |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| [Crawlab](https://github.com/crawlab-team/crawlab) | 管理平台 | Y | Y | N
|
||||
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | 管理平台 | Y | Y | Y
|
||||
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | 管理平台 | Y | Y | Y
|
||||
| [Gerapy](https://github.com/Gerapy/Gerapy) | 管理平台 | Y | Y | Y
|
||||
| [Scrapyd](https://github.com/scrapy/scrapyd) | 网络服务 | Y | N | N/A
|
||||
|框架 | 技术 | 优点 | 缺点 | Github 统计数据 |
|
||||
|:---|:---|:---|-----| :---- |
|
||||
| [Crawlab](https://github.com/crawlab-team/crawlab) | Golang + Vue|不局限于 scrapy,可以运行任何语言和框架的爬虫,精美的 UI 界面,天然支持分布式爬虫,支持节点管理、爬虫管理、任务管理、定时任务、结果导出、数据统计、消息通知、可配置爬虫、在线编辑代码等功能|暂时不支持爬虫版本管理|   |
|
||||
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Python Flask + Vue|精美的 UI 界面,内置了 scrapy 日志解析器,有较多任务运行统计图表,支持节点管理、定时任务、邮件提醒、移动界面,算是 scrapy-based 中功能完善的爬虫管理平台|不支持 scrapy 以外的爬虫,Python Flask 为后端,性能上有一定局限性|   |
|
||||
| [Gerapy](https://github.com/Gerapy/Gerapy) | Python Django + Vue|Gerapy 是崔庆才大神开发的爬虫管理平台,安装部署非常简单,同样基于 scrapyd,有精美的 UI 界面,支持节点管理、代码编辑、可配置规则等功能|同样不支持 scrapy 以外的爬虫,而且据使用者反馈,1.0 版本有很多 bug,期待 2.0 版本会有一定程度的改进|   |
|
||||
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Python Flask|基于 scrapyd,开源版 Scrapyhub,非常简洁的 UI 界面,支持定时任务|可能有些过于简洁了,不支持分页,不支持节点管理,不支持 scrapy 以外的爬虫|   |
|
||||
|
||||
## Q&A
|
||||
|
||||
@@ -254,6 +294,9 @@ Crawlab使用起来很方便,也很通用,可以适用于几乎任何主流
|
||||
<a href="https://github.com/hantmac">
|
||||
<img src="https://avatars2.githubusercontent.com/u/7600925?s=460&v=4" height="80">
|
||||
</a>
|
||||
<a href="https://github.com/duanbin0414">
|
||||
<img src="https://avatars3.githubusercontent.com/u/50389867?s=460&v=4" height="80">
|
||||
</a>
|
||||
|
||||
## 社区 & 赞助
|
||||
|
||||
|
||||
145
README.md
145
README.md
@@ -1,39 +1,68 @@
|
||||
# Crawlab
|
||||
|
||||

|
||||

|
||||

|
||||

|
||||

|
||||

|
||||

|
||||
<p>
|
||||
<a href="https://hub.docker.com/r/tikazyq/crawlab/builds" target="_blank">
|
||||
<img src="https://img.shields.io/docker/cloud/build/tikazyq/crawlab.svg?label=build&logo=docker">
|
||||
</a>
|
||||
<a href="https://hub.docker.com/r/tikazyq/crawlab" target="_blank">
|
||||
<img src="https://img.shields.io/docker/pulls/tikazyq/crawlab?label=pulls&logo=docker">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/releases" target="_blank">
|
||||
<img src="https://img.shields.io/github/release/crawlab-team/crawlab.svg?logo=github">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/commits/master" target="_blank">
|
||||
<img src="https://img.shields.io/github/last-commit/crawlab-team/crawlab.svg">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Abug" target="_blank">
|
||||
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/bug.svg?label=bugs&color=red">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/issues?q=is%3Aissue+is%3Aopen+label%3Aenhancement" target="_blank">
|
||||
<img src="https://img.shields.io/github/issues/crawlab-team/crawlab/enhancement.svg?label=enhancements&color=cyan">
|
||||
</a>
|
||||
<a href="https://github.com/crawlab-team/crawlab/blob/master/LICENSE" target="_blank">
|
||||
<img src="https://img.shields.io/github/license/crawlab-team/crawlab.svg">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
[中文](https://github.com/crawlab-team/crawlab/blob/master/README-zh.md) | English
|
||||
|
||||
[Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship)
|
||||
[Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship) | [CHANGELOG](https://github.com/crawlab-team/crawlab/blob/master/CHANGELOG.md) | [Disclaimer](https://github.com/crawlab-team/crawlab/blob/master/DISCLAIMER.md)
|
||||
|
||||
Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.
|
||||
|
||||
[Demo](http://crawlab.cn/demo) | [Documentation](https://tikazyq.github.io/crawlab-docs)
|
||||
[Demo](http://crawlab.cn/demo) | [Documentation](http://docs.crawlab.cn)
|
||||
|
||||
## Installation
|
||||
|
||||
Two methods:
|
||||
1. [Docker](https://tikazyq.github.io/crawlab-docs/Installation/Docker.html) (Recommended)
|
||||
2. [Direct Deploy](https://tikazyq.github.io/crawlab-docs/Installation/Direct.html) (Check Internal Kernel)
|
||||
3. [Kubernetes](https://mp.weixin.qq.com/s/3Q1BQATUIEE_WXcHPqhYbA)
|
||||
1. [Docker](http://docs.crawlab.cn/Installation/Docker.html) (Recommended)
|
||||
2. [Direct Deploy](http://docs.crawlab.cn/Installation/Direct.html) (Check Internal Kernel)
|
||||
3. [Kubernetes](https://juejin.im/post/5e0a02d851882549884c27ad) (Multi-Node Deployment)
|
||||
|
||||
### Pre-requisite (Docker)
|
||||
- Docker 18.03+
|
||||
- Redis
|
||||
- Redis 5.x+
|
||||
- MongoDB 3.6+
|
||||
- Docker Compose 1.24+ (optional but recommended)
|
||||
|
||||
### Pre-requisite (Direct Deploy)
|
||||
- Go 1.12+
|
||||
- Node 8.12+
|
||||
- Redis
|
||||
- Redis 5.x+
|
||||
- MongoDB 3.6+
|
||||
|
||||
## Quick Start
|
||||
|
||||
Please open the command line prompt and execute the command beloe. Make sure you have installed `docker-compose` in advance.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/crawlab-team/crawlab
|
||||
cd crawlab
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
Next, you can look into the `docker-compose.yml` (with detailed config params) and the [Documentation (Chinese)](http://docs.crawlab.cn) for further information.
|
||||
|
||||
## Run
|
||||
|
||||
### Docker
|
||||
@@ -48,13 +77,11 @@ services:
|
||||
image: tikazyq/crawlab:latest
|
||||
container_name: master
|
||||
environment:
|
||||
CRAWLAB_API_ADDRESS: "http://localhost:8000"
|
||||
CRAWLAB_SERVER_MASTER: "Y"
|
||||
CRAWLAB_MONGO_HOST: "mongo"
|
||||
CRAWLAB_REDIS_ADDRESS: "redis"
|
||||
ports:
|
||||
- "8080:8080" # frontend
|
||||
- "8000:8000" # backend
|
||||
- "8080:8080"
|
||||
depends_on:
|
||||
- mongo
|
||||
- redis
|
||||
@@ -109,9 +136,9 @@ For Docker Deployment details, please refer to [relevant documentation](https://
|
||||
|
||||

|
||||
|
||||
#### Spider Files
|
||||
#### Spider File Edit
|
||||
|
||||

|
||||

|
||||
|
||||
#### Task Results
|
||||
|
||||
@@ -119,13 +146,21 @@ For Docker Deployment details, please refer to [relevant documentation](https://
|
||||
|
||||
#### Cron Job
|
||||
|
||||

|
||||

|
||||
|
||||
#### Dependency Installation
|
||||
|
||||

|
||||
|
||||
#### Notifications
|
||||
|
||||
<img src="http://static-docs.crawlab.cn/notification-mobile.jpeg" height="480px">
|
||||
|
||||
## Architecture
|
||||
|
||||
The architecture of Crawlab is consisted of the Master Node and multiple Worker Nodes, and Redis and MongoDB databases which are mainly for nodes communication and data storage.
|
||||
|
||||

|
||||

|
||||
|
||||
The frontend app makes requests to the Master Node, which assigns tasks and deploys spiders through MongoDB and Redis. When a Worker Node receives a task, it begins to execute the crawling task, and stores the results to MongoDB. The architecture is much more concise compared with versions before `v0.3.0`. It has removed unnecessary Flower module which offers node monitoring services. They are now done by Redis.
|
||||
|
||||
@@ -161,35 +196,43 @@ Frontend is a SPA based on
|
||||
|
||||
## Integration with Other Frameworks
|
||||
|
||||
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
|
||||
[Crawlab SDK](https://github.com/crawlab-team/crawlab-sdk) provides some `helper` methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.
|
||||
|
||||
⚠️Note: make sure you have already installed `crawlab-sdk` using pip.
|
||||
|
||||
### Scrapy
|
||||
|
||||
Below is an example to integrate Crawlab with Scrapy in pipelines.
|
||||
In `settings.py` in your Scrapy project, find the variable named `ITEM_PIPELINES` (a `dict` variable). Add content below.
|
||||
|
||||
```python
|
||||
import os
|
||||
from pymongo import MongoClient
|
||||
|
||||
MONGO_HOST = '192.168.99.100'
|
||||
MONGO_PORT = 27017
|
||||
MONGO_DB = 'crawlab_test'
|
||||
|
||||
# scrapy example in the pipeline
|
||||
class JuejinPipeline(object):
|
||||
mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
|
||||
db = mongo[MONGO_DB]
|
||||
col_name = os.environ.get('CRAWLAB_COLLECTION')
|
||||
if not col_name:
|
||||
col_name = 'test'
|
||||
col = db[col_name]
|
||||
|
||||
def process_item(self, item, spider):
|
||||
item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
|
||||
self.col.save(item)
|
||||
return item
|
||||
ITEM_PIPELINES = {
|
||||
'crawlab.pipelines.CrawlabMongoPipeline': 888,
|
||||
}
|
||||
```
|
||||
|
||||
Then, start the Scrapy spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
|
||||
|
||||
### General Python Spider
|
||||
|
||||
Please add below content to your spider files to save results.
|
||||
|
||||
```python
|
||||
# import result saving method
|
||||
from crawlab import save_item
|
||||
|
||||
# this is a result record, must be dict type
|
||||
result = {'name': 'crawlab'}
|
||||
|
||||
# call result saving method
|
||||
save_item(result)
|
||||
```
|
||||
|
||||
Then, start the spider. After it's done, you should be able to see scraped results in **Task Detail -> Result**
|
||||
|
||||
### Other Frameworks / Languages
|
||||
|
||||
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named `CRAWLAB_TASK_ID`. By doing so, the data can be related to a task. Also, another environment variable `CRAWLAB_COLLECTION` is passed by Crawlab as the name of the collection to store results data.
|
||||
|
||||
## Comparison with Other Frameworks
|
||||
|
||||
There are existing spider management frameworks. So why use Crawlab?
|
||||
@@ -198,13 +241,12 @@ The reason is that most of the existing platforms are depending on Scrapyd, whic
|
||||
|
||||
Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.
|
||||
|
||||
|Framework | Type | Distributed | Frontend | Scrapyd-Dependent |
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
| [Crawlab](https://github.com/crawlab-team/crawlab) | Admin Platform | Y | Y | N
|
||||
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Admin Platform | Y | Y | Y
|
||||
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Admin Platform | Y | Y | Y
|
||||
| [Gerapy](https://github.com/Gerapy/Gerapy) | Admin Platform | Y | Y | Y
|
||||
| [Scrapyd](https://github.com/scrapy/scrapyd) | Web Service | Y | N | N/A
|
||||
|Framework | Technology | Pros | Cons | Github Stats |
|
||||
|:---|:---|:---|-----| :---- |
|
||||
| [Crawlab](https://github.com/crawlab-team/crawlab) | Golang + Vue|Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider mangement, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc.|Not yet support spider versioning|   |
|
||||
| [ScrapydWeb](https://github.com/my8100/scrapydweb) | Python Flask + Vue|Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform.|Not support spiders other than Scrapy. Limited performance because of Python Flask backend.|   |
|
||||
| [Gerapy](https://github.com/Gerapy/Gerapy) | Python Django + Vue|Gerapy is built by web crawler guru [Germey Cui](https://github.com/Germey). Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc.|Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0|   |
|
||||
| [SpiderKeeper](https://github.com/DormyMo/SpiderKeeper) | Python Flask|Open-source Scrapyhub. Concise and simple UI interface. Support cron job.|Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy.|   |
|
||||
|
||||
## Contributors
|
||||
<a href="https://github.com/tikazyq">
|
||||
@@ -219,6 +261,9 @@ Crawlab is easy to use, general enough to adapt spiders in any language and any
|
||||
<a href="https://github.com/hantmac">
|
||||
<img src="https://avatars2.githubusercontent.com/u/7600925?s=460&v=4" height="80">
|
||||
</a>
|
||||
<a href="https://github.com/duanbin0414">
|
||||
<img src="https://avatars3.githubusercontent.com/u/50389867?s=460&v=4" height="80">
|
||||
</a>
|
||||
|
||||
## Community & Sponsorship
|
||||
|
||||
|
||||
@@ -15,20 +15,35 @@ redis:
|
||||
log:
|
||||
level: info
|
||||
path: "/var/logs/crawlab"
|
||||
isDeletePeriodically: "Y"
|
||||
isDeletePeriodically: "N"
|
||||
deleteFrequency: "@hourly"
|
||||
server:
|
||||
host: 0.0.0.0
|
||||
port: 8000
|
||||
master: "N"
|
||||
master: "Y"
|
||||
secret: "crawlab"
|
||||
register:
|
||||
# mac地址 或者 ip地址,如果是ip,则需要手动指定IP
|
||||
type: "mac"
|
||||
ip: ""
|
||||
lang: # 安装语言环境, Y 为安装,N 为不安装,只对 Docker 有效
|
||||
python: "Y"
|
||||
node: "N"
|
||||
spider:
|
||||
path: "/app/spiders"
|
||||
task:
|
||||
workers: 4
|
||||
other:
|
||||
tmppath: "/tmp"
|
||||
version: 0.4.5
|
||||
setting:
|
||||
allowRegister: "N"
|
||||
notification:
|
||||
mail:
|
||||
server: ''
|
||||
port: ''
|
||||
senderEmail: ''
|
||||
senderIdentity: ''
|
||||
smtp:
|
||||
user: ''
|
||||
password: ''
|
||||
@@ -28,7 +28,7 @@ func (c *Config) Init() error {
|
||||
}
|
||||
viper.SetConfigType("yaml") // 设置配置文件格式为YAML
|
||||
viper.AutomaticEnv() // 读取匹配的环境变量
|
||||
viper.SetEnvPrefix("CRAWLAB") // 读取环境变量的前缀为APISERVER
|
||||
viper.SetEnvPrefix("CRAWLAB") // 读取环境变量的前缀为CRAWLAB
|
||||
replacer := strings.NewReplacer(".", "_")
|
||||
viper.SetEnvKeyReplacer(replacer)
|
||||
if err := viper.ReadInConfig(); err != nil { // viper解析配置文件
|
||||
|
||||
8
backend/constants/anchor.go
Normal file
8
backend/constants/anchor.go
Normal file
@@ -0,0 +1,8 @@
|
||||
package constants
|
||||
|
||||
const (
|
||||
AnchorStartStage = "START_STAGE"
|
||||
AnchorStartUrl = "START_URL"
|
||||
AnchorItems = "ITEMS"
|
||||
AnchorParsers = "PARSERS"
|
||||
)
|
||||
6
backend/constants/common.go
Normal file
6
backend/constants/common.go
Normal file
@@ -0,0 +1,6 @@
|
||||
package constants
|
||||
|
||||
const (
|
||||
ASCENDING = "ascending"
|
||||
DESCENDING = "descending"
|
||||
)
|
||||
6
backend/constants/config_spider.go
Normal file
6
backend/constants/config_spider.go
Normal file
@@ -0,0 +1,6 @@
|
||||
package constants
|
||||
|
||||
const (
|
||||
EngineScrapy = "scrapy"
|
||||
EngineColly = "colly"
|
||||
)
|
||||
13
backend/constants/notification.go
Normal file
13
backend/constants/notification.go
Normal file
@@ -0,0 +1,13 @@
|
||||
package constants
|
||||
|
||||
const (
|
||||
NotificationTriggerOnTaskEnd = "notification_trigger_on_task_end"
|
||||
NotificationTriggerOnTaskError = "notification_trigger_on_task_error"
|
||||
NotificationTriggerNever = "notification_trigger_never"
|
||||
)
|
||||
|
||||
const (
|
||||
NotificationTypeMail = "notification_type_mail"
|
||||
NotificationTypeDingTalk = "notification_type_ding_talk"
|
||||
NotificationTypeWechat = "notification_type_wechat"
|
||||
)
|
||||
9
backend/constants/rpc.go
Normal file
9
backend/constants/rpc.go
Normal file
@@ -0,0 +1,9 @@
|
||||
package constants
|
||||
|
||||
const (
|
||||
RpcInstallLang = "install_lang"
|
||||
RpcInstallDep = "install_dep"
|
||||
RpcUninstallDep = "uninstall_dep"
|
||||
RpcGetDepList = "get_dep_list"
|
||||
RpcGetInstalledDepList = "get_installed_dep_list"
|
||||
)
|
||||
10
backend/constants/schedule.go
Normal file
10
backend/constants/schedule.go
Normal file
@@ -0,0 +1,10 @@
|
||||
package constants
|
||||
|
||||
const (
|
||||
ScheduleStatusStop = "stopped"
|
||||
ScheduleStatusRunning = "running"
|
||||
ScheduleStatusError = "error"
|
||||
|
||||
ScheduleStatusErrorNotFoundNode = "Not Found Node"
|
||||
ScheduleStatusErrorNotFoundSpider = "Not Found Spider"
|
||||
)
|
||||
5
backend/constants/scrapy.go
Normal file
5
backend/constants/scrapy.go
Normal file
@@ -0,0 +1,5 @@
|
||||
package constants
|
||||
|
||||
const ScrapyProtectedStageNames = ""
|
||||
|
||||
const ScrapyProtectedFieldNames = "_id,task_id,ts"
|
||||
@@ -3,4 +3,5 @@ package constants
|
||||
const (
|
||||
Customized = "customized"
|
||||
Configurable = "configurable"
|
||||
Plugin = "plugin"
|
||||
)
|
||||
|
||||
@@ -5,3 +5,9 @@ const (
|
||||
Linux = "linux"
|
||||
Darwin = "darwin"
|
||||
)
|
||||
|
||||
const (
|
||||
Python = "python"
|
||||
Nodejs = "node"
|
||||
Java = "java"
|
||||
)
|
||||
|
||||
@@ -19,3 +19,9 @@ const (
|
||||
TaskFinish string = "finish"
|
||||
TaskCancel string = "cancel"
|
||||
)
|
||||
|
||||
const (
|
||||
RunTypeAllNodes string = "all-nodes"
|
||||
RunTypeRandom string = "random"
|
||||
RunTypeSelectedNodes string = "selected-nodes"
|
||||
)
|
||||
|
||||
@@ -61,11 +61,46 @@ func InitMongo() error {
|
||||
dialInfo.Password = mongoPassword
|
||||
dialInfo.Source = mongoAuth
|
||||
}
|
||||
sess, err := mgo.DialWithInfo(&dialInfo)
|
||||
if err != nil {
|
||||
return err
|
||||
|
||||
// mongo session
|
||||
var sess *mgo.Session
|
||||
|
||||
// 错误次数
|
||||
errNum := 0
|
||||
|
||||
// 重复尝试连接mongo
|
||||
for {
|
||||
var err error
|
||||
|
||||
// 连接mongo
|
||||
sess, err = mgo.DialWithInfo(&dialInfo)
|
||||
|
||||
if err != nil {
|
||||
// 如果连接错误,休息1秒,错误次数+1
|
||||
time.Sleep(1 * time.Second)
|
||||
errNum++
|
||||
|
||||
// 如果错误次数超过30,返回错误
|
||||
if errNum >= 30 {
|
||||
return err
|
||||
}
|
||||
} else {
|
||||
// 如果没有错误,退出循环
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// 赋值给全局mongo session
|
||||
Session = sess
|
||||
}
|
||||
//Add Unique index for 'key'
|
||||
keyIndex := mgo.Index{
|
||||
Key: []string{"key"},
|
||||
Unique: true,
|
||||
}
|
||||
s, c := GetCol("nodes")
|
||||
defer s.Close()
|
||||
c.EnsureIndex(keyIndex)
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -58,9 +58,9 @@ func (r *Redis) subscribe(ctx context.Context, consume ConsumeFunc, channel ...s
|
||||
}
|
||||
done <- nil
|
||||
case <-tick.C:
|
||||
//fmt.Printf("ping message \n")
|
||||
if err := psc.Ping(""); err != nil {
|
||||
done <- err
|
||||
fmt.Printf("ping message error: %s \n", err)
|
||||
//done <- err
|
||||
}
|
||||
case err := <-done:
|
||||
close(done)
|
||||
|
||||
@@ -4,10 +4,12 @@ import (
|
||||
"context"
|
||||
"crawlab/entity"
|
||||
"crawlab/utils"
|
||||
"errors"
|
||||
"github.com/apex/log"
|
||||
"github.com/gomodule/redigo/redis"
|
||||
"github.com/spf13/viper"
|
||||
"runtime/debug"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
@@ -17,14 +19,36 @@ type Redis struct {
|
||||
pool *redis.Pool
|
||||
}
|
||||
|
||||
type Mutex struct {
|
||||
Name string
|
||||
expiry time.Duration
|
||||
tries int
|
||||
delay time.Duration
|
||||
value string
|
||||
}
|
||||
|
||||
func NewRedisClient() *Redis {
|
||||
return &Redis{pool: NewRedisPool()}
|
||||
}
|
||||
|
||||
func (r *Redis) RPush(collection string, value interface{}) error {
|
||||
c := r.pool.Get()
|
||||
defer utils.Close(c)
|
||||
|
||||
if _, err := c.Do("RPUSH", collection, value); err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (r *Redis) LPush(collection string, value interface{}) error {
|
||||
c := r.pool.Get()
|
||||
defer utils.Close(c)
|
||||
|
||||
if _, err := c.Do("RPUSH", collection, value); err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
@@ -47,6 +71,7 @@ func (r *Redis) HSet(collection string, key string, value string) error {
|
||||
defer utils.Close(c)
|
||||
|
||||
if _, err := c.Do("HSET", collection, key, value); err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
@@ -58,7 +83,9 @@ func (r *Redis) HGet(collection string, key string) (string, error) {
|
||||
defer utils.Close(c)
|
||||
|
||||
value, err2 := redis.String(c.Do("HGET", collection, key))
|
||||
if err2 != nil {
|
||||
if err2 != nil && err2 != redis.ErrNil {
|
||||
log.Error(err2.Error())
|
||||
debug.PrintStack()
|
||||
return value, err2
|
||||
}
|
||||
return value, nil
|
||||
@@ -69,6 +96,8 @@ func (r *Redis) HDel(collection string, key string) error {
|
||||
defer utils.Close(c)
|
||||
|
||||
if _, err := c.Do("HDEL", collection, key); err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
@@ -80,11 +109,27 @@ func (r *Redis) HKeys(collection string) ([]string, error) {
|
||||
|
||||
value, err2 := redis.Strings(c.Do("HKeys", collection))
|
||||
if err2 != nil {
|
||||
log.Error(err2.Error())
|
||||
debug.PrintStack()
|
||||
return []string{}, err2
|
||||
}
|
||||
return value, nil
|
||||
}
|
||||
|
||||
func (r *Redis) BRPop(collection string, timeout int) (string, error) {
|
||||
if timeout <= 0 {
|
||||
timeout = 60
|
||||
}
|
||||
c := r.pool.Get()
|
||||
defer utils.Close(c)
|
||||
|
||||
values, err := redis.Strings(c.Do("BRPOP", collection, timeout))
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return values[1], nil
|
||||
}
|
||||
|
||||
func NewRedisPool() *redis.Pool {
|
||||
var address = viper.GetString("redis.address")
|
||||
var port = viper.GetString("redis.port")
|
||||
@@ -101,7 +146,7 @@ func NewRedisPool() *redis.Pool {
|
||||
Dial: func() (conn redis.Conn, e error) {
|
||||
return redis.DialURL(url,
|
||||
redis.DialConnectTimeout(time.Second*10),
|
||||
redis.DialReadTimeout(time.Second*10),
|
||||
redis.DialReadTimeout(time.Second*600),
|
||||
redis.DialWriteTimeout(time.Second*10),
|
||||
)
|
||||
},
|
||||
@@ -143,3 +188,59 @@ func Sub(channel string, consume ConsumeFunc) error {
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// 构建同步锁key
|
||||
func (r *Redis) getLockKey(lockKey string) string {
|
||||
lockKey = strings.ReplaceAll(lockKey, ":", "-")
|
||||
return "nodes:lock:" + lockKey
|
||||
}
|
||||
|
||||
// 获得锁
|
||||
func (r *Redis) Lock(lockKey string) (int64, error) {
|
||||
c := r.pool.Get()
|
||||
defer utils.Close(c)
|
||||
lockKey = r.getLockKey(lockKey)
|
||||
|
||||
ts := time.Now().Unix()
|
||||
ok, err := c.Do("SET", lockKey, ts, "NX", "PX", 30000)
|
||||
if err != nil {
|
||||
log.Errorf("get lock fail with error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return 0, err
|
||||
}
|
||||
if err == nil && ok == nil {
|
||||
log.Errorf("the lockKey is locked: key=%s", lockKey)
|
||||
return 0, errors.New("the lockKey is locked")
|
||||
}
|
||||
return ts, nil
|
||||
}
|
||||
|
||||
func (r *Redis) UnLock(lockKey string, value int64) {
|
||||
c := r.pool.Get()
|
||||
defer utils.Close(c)
|
||||
lockKey = r.getLockKey(lockKey)
|
||||
|
||||
getValue, err := redis.Int64(c.Do("GET", lockKey))
|
||||
if err != nil {
|
||||
log.Errorf("get lockKey error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
if getValue != value {
|
||||
log.Errorf("the lockKey value diff: %d, %d", value, getValue)
|
||||
return
|
||||
}
|
||||
|
||||
v, err := redis.Int64(c.Do("DEL", lockKey))
|
||||
if err != nil {
|
||||
log.Errorf("unlock failed, error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
if v == 0 {
|
||||
log.Errorf("unlock failed: key=%s", lockKey)
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
@@ -3,15 +3,15 @@ package entity
|
||||
import "strconv"
|
||||
|
||||
type Page struct {
|
||||
Skip int
|
||||
Limit int
|
||||
PageNum int
|
||||
Skip int
|
||||
Limit int
|
||||
PageNum int
|
||||
PageSize int
|
||||
}
|
||||
|
||||
func (p *Page)GetPage(pageNum string, pageSize string) {
|
||||
func (p *Page) GetPage(pageNum string, pageSize string) {
|
||||
p.PageNum, _ = strconv.Atoi(pageNum)
|
||||
p.PageSize, _ = strconv.Atoi(pageSize)
|
||||
p.Skip = p.PageSize * (p.PageNum - 1)
|
||||
p.Limit = p.PageSize
|
||||
}
|
||||
}
|
||||
|
||||
40
backend/entity/config_spider.go
Normal file
40
backend/entity/config_spider.go
Normal file
@@ -0,0 +1,40 @@
|
||||
package entity
|
||||
|
||||
type ConfigSpiderData struct {
|
||||
// 通用
|
||||
Name string `yaml:"name" json:"name"`
|
||||
DisplayName string `yaml:"display_name" json:"display_name"`
|
||||
Col string `yaml:"col" json:"col"`
|
||||
Remark string `yaml:"remark" json:"remark"`
|
||||
Type string `yaml:"type" bson:"type"`
|
||||
|
||||
// 可配置爬虫
|
||||
Engine string `yaml:"engine" json:"engine"`
|
||||
StartUrl string `yaml:"start_url" json:"start_url"`
|
||||
StartStage string `yaml:"start_stage" json:"start_stage"`
|
||||
Stages []Stage `yaml:"stages" json:"stages"`
|
||||
Settings map[string]string `yaml:"settings" json:"settings"`
|
||||
|
||||
// 自定义爬虫
|
||||
Cmd string `yaml:"cmd" json:"cmd"`
|
||||
}
|
||||
|
||||
type Stage struct {
|
||||
Name string `yaml:"name" json:"name"`
|
||||
IsList bool `yaml:"is_list" json:"is_list"`
|
||||
ListCss string `yaml:"list_css" json:"list_css"`
|
||||
ListXpath string `yaml:"list_xpath" json:"list_xpath"`
|
||||
PageCss string `yaml:"page_css" json:"page_css"`
|
||||
PageXpath string `yaml:"page_xpath" json:"page_xpath"`
|
||||
PageAttr string `yaml:"page_attr" json:"page_attr"`
|
||||
Fields []Field `yaml:"fields" json:"fields"`
|
||||
}
|
||||
|
||||
type Field struct {
|
||||
Name string `yaml:"name" json:"name"`
|
||||
Css string `yaml:"css" json:"css"`
|
||||
Xpath string `yaml:"xpath" json:"xpath"`
|
||||
Attr string `yaml:"attr" json:"attr"`
|
||||
NextStage string `yaml:"next_stage" json:"next_stage"`
|
||||
Remark string `yaml:"remark" json:"remark"`
|
||||
}
|
||||
@@ -13,3 +13,18 @@ type Executable struct {
|
||||
FileName string `json:"file_name"`
|
||||
DisplayName string `json:"display_name"`
|
||||
}
|
||||
|
||||
type Lang struct {
|
||||
Name string `json:"name"`
|
||||
ExecutableName string `json:"executable_name"`
|
||||
ExecutablePath string `json:"executable_path"`
|
||||
DepExecutablePath string `json:"dep_executable_path"`
|
||||
Installed bool `json:"installed"`
|
||||
}
|
||||
|
||||
type Dependency struct {
|
||||
Name string `json:"name"`
|
||||
Version string `json:"version"`
|
||||
Description string `json:"description"`
|
||||
Installed bool `json:"installed"`
|
||||
}
|
||||
|
||||
@@ -11,10 +11,18 @@ require (
|
||||
github.com/go-playground/locales v0.12.1 // indirect
|
||||
github.com/go-playground/universal-translator v0.16.0 // indirect
|
||||
github.com/gomodule/redigo v2.0.0+incompatible
|
||||
github.com/imroc/req v0.2.4
|
||||
github.com/leodido/go-urn v1.1.0 // indirect
|
||||
github.com/matcornic/hermes v1.2.0
|
||||
github.com/matcornic/hermes/v2 v2.0.2 // indirect
|
||||
github.com/pkg/errors v0.8.1
|
||||
github.com/royeo/dingrobot v1.0.0
|
||||
github.com/satori/go.uuid v1.2.0
|
||||
github.com/smartystreets/goconvey v0.0.0-20190731233626-505e41936337
|
||||
github.com/spf13/viper v1.4.0
|
||||
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc // indirect
|
||||
gopkg.in/go-playground/validator.v9 v9.29.1
|
||||
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737
|
||||
gopkg.in/russross/blackfriday.v2 v2.0.0 // indirect
|
||||
gopkg.in/yaml.v2 v2.2.2
|
||||
)
|
||||
|
||||
@@ -1,9 +1,15 @@
|
||||
cloud.google.com/go v0.26.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw=
|
||||
github.com/BurntSushi/toml v0.3.1 h1:WXkYYl6Yr3qBf1K79EBnL4mak0OimBfB0XUf9Vl28OQ=
|
||||
github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
|
||||
github.com/Masterminds/semver v1.4.2 h1:WBLTQ37jOCzSLtXNdoo8bNM8876KhNqOKvrlGITgsTc=
|
||||
github.com/Masterminds/semver v1.4.2/go.mod h1:MB6lktGJrhw8PrUyiEoblNEGEQ+RzHPF078ddwwvV3Y=
|
||||
github.com/Masterminds/sprig v2.16.0+incompatible h1:QZbMUPxRQ50EKAq3LFMnxddMu88/EUUG3qmxwtDmPsY=
|
||||
github.com/Masterminds/sprig v2.16.0+incompatible/go.mod h1:y6hNFY5UBTIWBxnzTeuNhlNS5hqE0NB0E6fgfo2Br3o=
|
||||
github.com/OneOfOne/xxhash v1.2.2/go.mod h1:HSdplMjZKSmBqAxg5vPj2TmRDmfkzw+cTzAElWljhcU=
|
||||
github.com/alecthomas/template v0.0.0-20160405071501-a0175ee3bccc/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc=
|
||||
github.com/alecthomas/units v0.0.0-20151022065526-2efee857e7cf/go.mod h1:ybxpYRFXyAe+OPACYpWeL0wqObRcbAqCMya13uyzqw0=
|
||||
github.com/aokoli/goutils v1.0.1 h1:7fpzNGoJ3VA8qcrm++XEE1QUe0mIwNeLa02Nwq7RDkg=
|
||||
github.com/aokoli/goutils v1.0.1/go.mod h1:SijmP0QR8LtwsmDs8Yii5Z/S4trXFGFC2oO5g9DP+DQ=
|
||||
github.com/apex/log v1.1.1 h1:BwhRZ0qbjYtTob0I+2M+smavV0kOC8XgcnGZcyL9liA=
|
||||
github.com/apex/log v1.1.1/go.mod h1:Ls949n1HFtXfbDcjiTTFQqkVUrte0puoIBfO3SVgwOA=
|
||||
github.com/aphistic/golf v0.0.0-20180712155816-02c07f170c5a/go.mod h1:3NqKYiepwy8kCu4PNA+aP7WUV72eXWJeP9/r3/K9aLE=
|
||||
@@ -56,6 +62,8 @@ github.com/gomodule/redigo v2.0.0+incompatible h1:K/R+8tc58AaqLkqG2Ol3Qk+DR/TlNu
|
||||
github.com/gomodule/redigo v2.0.0+incompatible/go.mod h1:B4C85qUVwatsJoIUNIfCRsp7qO0iAmpGFZ4EELWSbC4=
|
||||
github.com/google/btree v1.0.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
|
||||
github.com/google/go-cmp v0.2.0/go.mod h1:oXzfMopK8JAjlY9xF4vHSVASa0yLyX7SntLO5aqRK0M=
|
||||
github.com/google/uuid v1.0.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
|
||||
github.com/google/uuid v1.1.1 h1:Gkbcsh/GbpXz7lPftLA3P6TYMwjCLYm83jiFQZF/3gY=
|
||||
github.com/google/uuid v1.1.1/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
|
||||
github.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1 h1:EGx4pi6eqNxGaHF6qqu48+N2wcFQ5qg5FXgOdqsJ5d8=
|
||||
github.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1/go.mod h1:wJfORRmW1u3UXTncJ5qlYoELFm8eSnnEO6hX4iZ3EWY=
|
||||
@@ -66,6 +74,14 @@ github.com/grpc-ecosystem/grpc-gateway v1.9.0/go.mod h1:vNeuVxBJEsws4ogUvrchl83t
|
||||
github.com/hashicorp/hcl v1.0.0 h1:0Anlzjpi4vEasTeNFn2mLJgTSwt0+6sfsiTG8qcWGx4=
|
||||
github.com/hashicorp/hcl v1.0.0/go.mod h1:E5yfLk+7swimpb2L/Alb/PJmXilQ/rhwaUYs4T20WEQ=
|
||||
github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU=
|
||||
github.com/huandu/xstrings v1.2.0 h1:yPeWdRnmynF7p+lLYz0H2tthW9lqhMJrQV/U7yy4wX0=
|
||||
github.com/huandu/xstrings v1.2.0/go.mod h1:DvyZB1rfVYsBIigL8HwpZgxHwXozlTgGqn63UyNX5k4=
|
||||
github.com/imdario/mergo v0.3.6 h1:xTNEAn+kxVO7dTZGu0CegyqKZmoWFI0rF8UxjlB2d28=
|
||||
github.com/imdario/mergo v0.3.6/go.mod h1:2EnlNZ0deacrJVfApfmtdGgDfMuh/nq6Ok1EcJh5FfA=
|
||||
github.com/imroc/req v0.2.4 h1:8XbvaQpERLAJV6as/cB186DtH5f0m5zAOtHEaTQ4ac0=
|
||||
github.com/imroc/req v0.2.4/go.mod h1:J9FsaNHDTIVyW/b5r6/Df5qKEEEq2WzZKIgKSajd1AE=
|
||||
github.com/jaytaylor/html2text v0.0.0-20180606194806-57d518f124b0 h1:xqgexXAGQgY3HAjNPSaCqn5Aahbo5TKsmhp8VRfr1iQ=
|
||||
github.com/jaytaylor/html2text v0.0.0-20180606194806-57d518f124b0/go.mod h1:CVKlgaMiht+LXvHG173ujK6JUhZXKb2u/BQtjPDIvyk=
|
||||
github.com/jmespath/go-jmespath v0.0.0-20180206201540-c2b33e8439af/go.mod h1:Nht3zPeWKUH0NzdCt2Blrr5ys8VGpn0CEB0cQHVjt7k=
|
||||
github.com/jonboulle/clockwork v0.1.0/go.mod h1:Ii8DK3G1RaLaWxj9trq07+26W01tbo22gdxWY5EU2bo=
|
||||
github.com/jpillora/backoff v0.0.0-20180909062703-3050d21c67d7/go.mod h1:2iMrUgbbvHEiQClaW2NsSzMyGHqN+rDFqY705q49KG0=
|
||||
@@ -87,12 +103,17 @@ github.com/leodido/go-urn v1.1.0 h1:Sm1gr51B1kKyfD2BlRcLSiEkffoG96g6TPv6eRoEiB8=
|
||||
github.com/leodido/go-urn v1.1.0/go.mod h1:+cyI34gQWZcE1eQU7NVgKkkzdXDQHr1dBMtdAPozLkw=
|
||||
github.com/magiconair/properties v1.8.0 h1:LLgXmsheXeRoUOBOjtwPQCWIYqM/LU1ayDtDePerRcY=
|
||||
github.com/magiconair/properties v1.8.0/go.mod h1:PppfXfuXeibc/6YijjN8zIbojt8czPbwD3XqdrwzmxQ=
|
||||
github.com/matcornic/hermes v1.2.0 h1:AuqZpYcTOtTB7cahdevLfnhIpfzmpqw5Czv8vpdnFDU=
|
||||
github.com/matcornic/hermes v1.2.0/go.mod h1:lujJomb016Xjv8wBnWlNvUdtmvowjjfkqri5J/+1hYc=
|
||||
github.com/matcornic/hermes/v2 v2.0.2/go.mod h1:iVsJWSIS4NtMNtgan22sy6lt7pImok7bATGPWCoaKNY=
|
||||
github.com/mattn/go-colorable v0.1.1/go.mod h1:FuOcm+DKB9mbwrcAfNl7/TZVBZ6rcnceauSikq3lYCQ=
|
||||
github.com/mattn/go-colorable v0.1.2/go.mod h1:U0ppj6V5qS13XJ6of8GYAs25YV2eR4EVcfRqFIhoBtE=
|
||||
github.com/mattn/go-isatty v0.0.5/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
|
||||
github.com/mattn/go-isatty v0.0.7/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
|
||||
github.com/mattn/go-isatty v0.0.8 h1:HLtExJ+uU2HOZ+wI0Tt5DtUDrx8yhUqDcp7fYERX4CE=
|
||||
github.com/mattn/go-isatty v0.0.8/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
|
||||
github.com/mattn/go-runewidth v0.0.3 h1:a+kO+98RDGEfo6asOGMmpodZq4FNtnGP54yps8BzLR4=
|
||||
github.com/mattn/go-runewidth v0.0.3/go.mod h1:LwmH8dsx7+W8Uxz3IHJYH5QSwggIsqBzpuz5H//U1FU=
|
||||
github.com/matttproud/golang_protobuf_extensions v1.0.1/go.mod h1:D8He9yQNgCq6Z5Ld7szi9bcBfOoFv/3dc6xSMkL2PC0=
|
||||
github.com/mgutz/ansi v0.0.0-20170206155736-9520e82c474b/go.mod h1:01TrycV0kFyexm33Z7vhZRXopbI8J3TDReVlkTgMUxE=
|
||||
github.com/mitchellh/mapstructure v1.1.2 h1:fmNYVwqnSfB9mZU6OS2O6GsXM+wcskZDuKQzvN1EDeE=
|
||||
@@ -103,6 +124,8 @@ github.com/modern-go/reflect2 v1.0.1 h1:9f412s+6RmYXLWZSEzVVgPGK7C2PphHj5RJrvfx9
|
||||
github.com/modern-go/reflect2 v1.0.1/go.mod h1:bx2lNnkwVCuqBIxFjflWJWanXIb3RllmbCylyMrvgv0=
|
||||
github.com/mwitkow/go-conntrack v0.0.0-20161129095857-cc309e4a2223/go.mod h1:qRWi+5nqEBWmkhHvq77mSJWrCKwh8bxhgT7d/eI7P4U=
|
||||
github.com/oklog/ulid v1.3.1/go.mod h1:CirwcVhetQ6Lv90oh/F+FBtV6XMibvdAFo93nm5qn4U=
|
||||
github.com/olekukonko/tablewriter v0.0.1 h1:b3iUnf1v+ppJiOfNX4yxxqfWKMQPZR5yoh8urCTFX88=
|
||||
github.com/olekukonko/tablewriter v0.0.1/go.mod h1:vsDQFd/mU46D+Z4whnwzcISnGGzXWMclvtLoiIKAKIo=
|
||||
github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE=
|
||||
github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY=
|
||||
github.com/pelletier/go-toml v1.2.0 h1:T5zMGML61Wp+FlcbWjRDT7yAxhJNAiPPLOFECq181zc=
|
||||
@@ -123,9 +146,14 @@ github.com/prometheus/procfs v0.0.0-20190507164030-5867b95ac084/go.mod h1:TjEm7z
|
||||
github.com/prometheus/tsdb v0.7.1/go.mod h1:qhTCs0VvXwvX/y3TZrWD7rabWM+ijKTux40TwIPHuXU=
|
||||
github.com/rogpeppe/fastuuid v0.0.0-20150106093220-6724a57986af/go.mod h1:XWv6SoW27p1b0cqNHllgS5HIMJraePCO15w5zCzIWYg=
|
||||
github.com/rogpeppe/fastuuid v1.1.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ=
|
||||
github.com/royeo/dingrobot v1.0.0 h1:K4GrF+fOecNX0yi+oBKpfh7z0XP/8TzaIIHu1B2kKUQ=
|
||||
github.com/royeo/dingrobot v1.0.0/go.mod h1:RqDM8E/hySCVwI2aUFRJAUGDcHHRnIhzNmbNG3bamQs=
|
||||
github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
|
||||
github.com/satori/go.uuid v1.2.0 h1:0uYX9dsZ2yD7q2RtLRtPSdGDWzjeM3TbMJP9utgA0ww=
|
||||
github.com/satori/go.uuid v1.2.0/go.mod h1:dA0hQrYB0VpLJoorglMZABFdXlWrHn1NEOzdhQKdks0=
|
||||
github.com/sergi/go-diff v1.0.0/go.mod h1:0CfEIISq7TuYL3j771MWULgwwjU+GofnZX9QAmXWZgo=
|
||||
github.com/shurcooL/sanitized_anchor_name v1.0.0 h1:PdmoCO6wvbs+7yrJyMORt4/BmY5IYyJwS/kOiWx8mHo=
|
||||
github.com/shurcooL/sanitized_anchor_name v1.0.0/go.mod h1:1NzhyTcUVG4SuEtjjoZeVRXNmyL/1OwPU0+IJeTBvfc=
|
||||
github.com/sirupsen/logrus v1.2.0/go.mod h1:LxeOpSwHxABJmUn/MG1IvRgCAasNZTLOkJPxbbu5VWo=
|
||||
github.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d/go.mod h1:OnSkiWE9lh6wB0YB77sQom3nweQdgAjqCqsofrRNTgc=
|
||||
github.com/smartystreets/assertions v1.0.0 h1:UVQPSSmc3qtTi+zPPkCXvZX9VvW/xT/NsRvKfwY81a8=
|
||||
@@ -146,6 +174,8 @@ github.com/spf13/pflag v1.0.3 h1:zPAT6CGy6wXeQ7NtTnaTerfKOsV6V6F8agHXFiazDkg=
|
||||
github.com/spf13/pflag v1.0.3/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4=
|
||||
github.com/spf13/viper v1.4.0 h1:yXHLWeravcrgGyFSyCgdYpXQ9dR9c/WED3pg1RhxqEU=
|
||||
github.com/spf13/viper v1.4.0/go.mod h1:PTJ7Z/lr49W6bUbkmS1V3by4uWynFiR9p7+dSq/yZzE=
|
||||
github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf h1:pvbZ0lM0XWPBqUKqFU8cmavspvIl9nulOYwdy6IFRRo=
|
||||
github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf/go.mod h1:RJID2RhlZKId02nZ62WenDCkgHFerpIOmW0iT7GKmXM=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
|
||||
@@ -165,12 +195,15 @@ go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
|
||||
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
|
||||
go.uber.org/zap v1.10.0/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
|
||||
golang.org/x/crypto v0.0.0-20180904163835-0709b304e793/go.mod h1:6SG95UA2DQfeDnfUPMdvaQW0Q7yPrPDi9nlGo2tz2b4=
|
||||
golang.org/x/crypto v0.0.0-20181029175232-7e6ffbd03851/go.mod h1:6SG95UA2DQfeDnfUPMdvaQW0Q7yPrPDi9nlGo2tz2b4=
|
||||
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
|
||||
golang.org/x/crypto v0.0.0-20190426145343-a29dc8fdc734 h1:p/H982KKEjUnLJkM3tt/LemDnOc1GiZL5FCVlORJ5zo=
|
||||
golang.org/x/crypto v0.0.0-20190426145343-a29dc8fdc734/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
|
||||
golang.org/x/lint v0.0.0-20181026193005-c67002cb31c3/go.mod h1:UVdnD1Gm6xHRNCYTkRU2/jEulfH38KcIWyp/GAMgvoE=
|
||||
golang.org/x/lint v0.0.0-20190313153728-d0100b6bd8b3/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
|
||||
golang.org/x/net v0.0.0-20180826012351-8a410e7b638d/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
|
||||
golang.org/x/net v0.0.0-20180906233101-161cd47e91fd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
|
||||
golang.org/x/net v0.0.0-20181029044818-c44066c5c816/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
|
||||
golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
|
||||
golang.org/x/net v0.0.0-20181220203305-927f97764cc3/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
|
||||
golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
|
||||
@@ -204,6 +237,8 @@ google.golang.org/genproto v0.0.0-20180817151627-c66870c02cf8/go.mod h1:JiN7NxoA
|
||||
google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c=
|
||||
google.golang.org/grpc v1.21.0/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ijfRaM=
|
||||
gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
|
||||
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc h1:2gGKlE2+asNV9m7xrywl36YYNnBG5ZQ0r/BOOxqPpmk=
|
||||
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc/go.mod h1:m7x9LTH6d71AHyAX77c9yqWCCa3UKHcVEj9y7hAtKDk=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 h1:qIbj1fsPNlZgppZ+VLlY7N33q108Sa+fhmuc+sWQYwY=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
@@ -214,7 +249,11 @@ gopkg.in/go-playground/validator.v8 v8.18.2 h1:lFB4DoMU6B626w8ny76MV7VX6W2VHct2G
|
||||
gopkg.in/go-playground/validator.v8 v8.18.2/go.mod h1:RX2a/7Ha8BgOhfk7j780h4/u/RRjR0eouCJSH80/M2Y=
|
||||
gopkg.in/go-playground/validator.v9 v9.29.1 h1:SvGtYmN60a5CVKTOzMSyfzWDeZRxRuGvRQyEAKbw1xc=
|
||||
gopkg.in/go-playground/validator.v9 v9.29.1/go.mod h1:+c9/zcJMFNgbLvly1L1V+PpxWdVbfP1avr/N00E2vyQ=
|
||||
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737 h1:NvePS/smRcFQ4bMtTddFtknbGCtoBkJxGmpSpVRafCc=
|
||||
gopkg.in/gomail.v2 v2.0.0-20150902115704-41f357289737/go.mod h1:LRQQ+SO6ZHR7tOkpBDuZnXENFzX8qRjMDMyPD6BRkCw=
|
||||
gopkg.in/resty.v1 v1.12.0/go.mod h1:mDo4pnntr5jdWRML875a/NmxYqAlA73dVijT2AXvQQo=
|
||||
gopkg.in/russross/blackfriday.v2 v2.0.0 h1:+FlnIV8DSQnT7NZ43hcVKcdJdzZoeCmJj4Ql8gq5keA=
|
||||
gopkg.in/russross/blackfriday.v2 v2.0.0/go.mod h1:6sSBNz/GtOm/pJTuh5UmBK2ZHfmnxGbl2NZg1UliSOI=
|
||||
gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw=
|
||||
gopkg.in/yaml.v2 v2.0.0-20170812160011-eb3733d160e7/go.mod h1:JAlM8MvJe8wmxCU4Bli9HhUf9+ttbYbLASfIpnQbh74=
|
||||
gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
|
||||
|
||||
191
backend/main.go
191
backend/main.go
@@ -31,22 +31,23 @@ func main() {
|
||||
log.Error("init config error:" + err.Error())
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化配置成功")
|
||||
log.Info("initialized config successfully")
|
||||
|
||||
// 初始化日志设置
|
||||
logLevel := viper.GetString("log.level")
|
||||
if logLevel != "" {
|
||||
log.SetLevelFromString(logLevel)
|
||||
}
|
||||
log.Info("初始化日志设置成功")
|
||||
|
||||
log.Info("initialized log config successfully")
|
||||
if viper.GetString("log.isDeletePeriodically") == "Y" {
|
||||
err := services.InitDeleteLogPeriodically()
|
||||
if err != nil {
|
||||
log.Error("Init DeletePeriodically Failed")
|
||||
log.Error("init DeletePeriodically failed")
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化定期清理日志配置成功")
|
||||
log.Info("initialized periodically cleaning log successfully")
|
||||
} else {
|
||||
log.Info("periodically cleaning log is switched off")
|
||||
}
|
||||
|
||||
// 初始化Mongodb数据库
|
||||
@@ -55,7 +56,7 @@ func main() {
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化Mongodb数据库成功")
|
||||
log.Info("initialized MongoDB successfully")
|
||||
|
||||
// 初始化Redis数据库
|
||||
if err := database.InitRedis(); err != nil {
|
||||
@@ -63,7 +64,7 @@ func main() {
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化Redis数据库成功")
|
||||
log.Info("initialized Redis successfully")
|
||||
|
||||
if model.IsMaster() {
|
||||
// 初始化定时任务
|
||||
@@ -72,7 +73,23 @@ func main() {
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化定时任务成功")
|
||||
log.Info("initialized schedule successfully")
|
||||
|
||||
// 初始化用户服务
|
||||
if err := services.InitUserService(); err != nil {
|
||||
log.Error("init user service error:" + err.Error())
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("initialized user service successfully")
|
||||
|
||||
// 初始化依赖服务
|
||||
if err := services.InitDepsFetcher(); err != nil {
|
||||
log.Error("init dependency fetcher error:" + err.Error())
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("initialized dependency fetcher successfully")
|
||||
}
|
||||
|
||||
// 初始化任务执行器
|
||||
@@ -81,14 +98,14 @@ func main() {
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化任务执行器成功")
|
||||
log.Info("initialized task executor successfully")
|
||||
|
||||
// 初始化节点服务
|
||||
if err := services.InitNodeService(); err != nil {
|
||||
log.Error("init node service error:" + err.Error())
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化节点配置成功")
|
||||
log.Info("initialized node service successfully")
|
||||
|
||||
// 初始化爬虫服务
|
||||
if err := services.InitSpiderService(); err != nil {
|
||||
@@ -96,73 +113,133 @@ func main() {
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化爬虫服务成功")
|
||||
log.Info("initialized spider service successfully")
|
||||
|
||||
// 初始化用户服务
|
||||
if err := services.InitUserService(); err != nil {
|
||||
log.Error("init user service error:" + err.Error())
|
||||
// 初始化RPC服务
|
||||
if err := services.InitRpcService(); err != nil {
|
||||
log.Error("init rpc service error:" + err.Error())
|
||||
debug.PrintStack()
|
||||
panic(err)
|
||||
}
|
||||
log.Info("初始化用户服务成功")
|
||||
log.Info("initialized rpc service successfully")
|
||||
|
||||
// 以下为主节点服务
|
||||
if model.IsMaster() {
|
||||
// 中间件
|
||||
app.Use(middlewares.CORSMiddleware())
|
||||
//app.Use(middlewares.AuthorizationMiddleware())
|
||||
anonymousGroup := app.Group("/")
|
||||
{
|
||||
anonymousGroup.POST("/login", routes.Login) // 用户登录
|
||||
anonymousGroup.PUT("/users", routes.PutUser) // 添加用户
|
||||
|
||||
anonymousGroup.POST("/login", routes.Login) // 用户登录
|
||||
anonymousGroup.PUT("/users", routes.PutUser) // 添加用户
|
||||
anonymousGroup.GET("/setting", routes.GetSetting) // 获取配置信息
|
||||
// release版本
|
||||
anonymousGroup.GET("/version", routes.GetVersion) // 获取发布的版本
|
||||
}
|
||||
authGroup := app.Group("/", middlewares.AuthorizationMiddleware())
|
||||
{
|
||||
// 路由
|
||||
// 节点
|
||||
authGroup.GET("/nodes", routes.GetNodeList) // 节点列表
|
||||
authGroup.GET("/nodes/:id", routes.GetNode) // 节点详情
|
||||
authGroup.POST("/nodes/:id", routes.PostNode) // 修改节点
|
||||
authGroup.GET("/nodes/:id/tasks", routes.GetNodeTaskList) // 节点任务列表
|
||||
authGroup.GET("/nodes/:id/system", routes.GetSystemInfo) // 节点任务列表
|
||||
authGroup.DELETE("/nodes/:id", routes.DeleteNode) // 删除节点
|
||||
{
|
||||
authGroup.GET("/nodes", routes.GetNodeList) // 节点列表
|
||||
authGroup.GET("/nodes/:id", routes.GetNode) // 节点详情
|
||||
authGroup.POST("/nodes/:id", routes.PostNode) // 修改节点
|
||||
authGroup.GET("/nodes/:id/tasks", routes.GetNodeTaskList) // 节点任务列表
|
||||
authGroup.GET("/nodes/:id/system", routes.GetSystemInfo) // 节点任务列表
|
||||
authGroup.DELETE("/nodes/:id", routes.DeleteNode) // 删除节点
|
||||
authGroup.GET("/nodes/:id/langs", routes.GetLangList) // 节点语言环境列表
|
||||
authGroup.GET("/nodes/:id/deps", routes.GetDepList) // 节点第三方依赖列表
|
||||
authGroup.GET("/nodes/:id/deps/installed", routes.GetInstalledDepList) // 节点已安装第三方依赖列表
|
||||
authGroup.POST("/nodes/:id/deps/install", routes.InstallDep) // 节点安装依赖
|
||||
authGroup.POST("/nodes/:id/deps/uninstall", routes.UninstallDep) // 节点卸载依赖
|
||||
authGroup.POST("/nodes/:id/langs/install", routes.InstallLang) // 节点安装语言
|
||||
}
|
||||
// 爬虫
|
||||
authGroup.GET("/spiders", routes.GetSpiderList) // 爬虫列表
|
||||
authGroup.GET("/spiders/:id", routes.GetSpider) // 爬虫详情
|
||||
authGroup.POST("/spiders", routes.PutSpider) // 上传爬虫
|
||||
authGroup.POST("/spiders/:id", routes.PostSpider) // 修改爬虫
|
||||
authGroup.POST("/spiders/:id/publish", routes.PublishSpider) // 发布爬虫
|
||||
authGroup.DELETE("/spiders/:id", routes.DeleteSpider) // 删除爬虫
|
||||
authGroup.GET("/spiders/:id/tasks", routes.GetSpiderTasks) // 爬虫任务列表
|
||||
authGroup.GET("/spiders/:id/file", routes.GetSpiderFile) // 爬虫文件读取
|
||||
authGroup.POST("/spiders/:id/file", routes.PostSpiderFile) // 爬虫目录写入
|
||||
authGroup.GET("/spiders/:id/dir", routes.GetSpiderDir) // 爬虫目录
|
||||
authGroup.GET("/spiders/:id/stats", routes.GetSpiderStats) // 爬虫统计数据
|
||||
authGroup.GET("/spider/types", routes.GetSpiderTypes) // 爬虫类型
|
||||
{
|
||||
authGroup.GET("/spiders", routes.GetSpiderList) // 爬虫列表
|
||||
authGroup.GET("/spiders/:id", routes.GetSpider) // 爬虫详情
|
||||
authGroup.PUT("/spiders", routes.PutSpider) // 添加爬虫
|
||||
authGroup.POST("/spiders", routes.UploadSpider) // 上传爬虫
|
||||
authGroup.POST("/spiders/:id", routes.PostSpider) // 修改爬虫
|
||||
authGroup.POST("/spiders/:id/publish", routes.PublishSpider) // 发布爬虫
|
||||
authGroup.POST("/spiders/:id/upload", routes.UploadSpiderFromId) // 上传爬虫(ID)
|
||||
authGroup.DELETE("/spiders/:id", routes.DeleteSpider) // 删除爬虫
|
||||
authGroup.GET("/spiders/:id/tasks", routes.GetSpiderTasks) // 爬虫任务列表
|
||||
authGroup.GET("/spiders/:id/file/tree", routes.GetSpiderFileTree) // 爬虫文件目录树读取
|
||||
authGroup.GET("/spiders/:id/file", routes.GetSpiderFile) // 爬虫文件读取
|
||||
authGroup.POST("/spiders/:id/file", routes.PostSpiderFile) // 爬虫文件更改
|
||||
authGroup.PUT("/spiders/:id/file", routes.PutSpiderFile) // 爬虫文件创建
|
||||
authGroup.PUT("/spiders/:id/dir", routes.PutSpiderDir) // 爬虫目录创建
|
||||
authGroup.DELETE("/spiders/:id/file", routes.DeleteSpiderFile) // 爬虫文件删除
|
||||
authGroup.POST("/spiders/:id/file/rename", routes.RenameSpiderFile) // 爬虫文件重命名
|
||||
authGroup.GET("/spiders/:id/dir", routes.GetSpiderDir) // 爬虫目录
|
||||
authGroup.GET("/spiders/:id/stats", routes.GetSpiderStats) // 爬虫统计数据
|
||||
authGroup.GET("/spiders/:id/schedules", routes.GetSpiderSchedules) // 爬虫定时任务
|
||||
}
|
||||
// 可配置爬虫
|
||||
{
|
||||
authGroup.GET("/config_spiders/:id/config", routes.GetConfigSpiderConfig) // 获取可配置爬虫配置
|
||||
authGroup.POST("/config_spiders/:id/config", routes.PostConfigSpiderConfig) // 更改可配置爬虫配置
|
||||
authGroup.PUT("/config_spiders", routes.PutConfigSpider) // 添加可配置爬虫
|
||||
authGroup.POST("/config_spiders/:id", routes.PostConfigSpider) // 修改可配置爬虫
|
||||
authGroup.POST("/config_spiders/:id/upload", routes.UploadConfigSpider) // 上传可配置爬虫
|
||||
authGroup.POST("/config_spiders/:id/spiderfile", routes.PostConfigSpiderSpiderfile) // 上传可配置爬虫
|
||||
authGroup.GET("/config_spiders_templates", routes.GetConfigSpiderTemplateList) // 获取可配置爬虫模版列表
|
||||
}
|
||||
// 任务
|
||||
authGroup.GET("/tasks", routes.GetTaskList) // 任务列表
|
||||
authGroup.GET("/tasks/:id", routes.GetTask) // 任务详情
|
||||
authGroup.PUT("/tasks", routes.PutTask) // 派发任务
|
||||
authGroup.DELETE("/tasks/:id", routes.DeleteTask) // 删除任务
|
||||
authGroup.POST("/tasks/:id/cancel", routes.CancelTask) // 取消任务
|
||||
authGroup.GET("/tasks/:id/log", routes.GetTaskLog) // 任务日志
|
||||
authGroup.GET("/tasks/:id/results", routes.GetTaskResults) // 任务结果
|
||||
authGroup.GET("/tasks/:id/results/download", routes.DownloadTaskResultsCsv) // 下载任务结果
|
||||
{
|
||||
authGroup.GET("/tasks", routes.GetTaskList) // 任务列表
|
||||
authGroup.GET("/tasks/:id", routes.GetTask) // 任务详情
|
||||
authGroup.PUT("/tasks", routes.PutTask) // 派发任务
|
||||
authGroup.DELETE("/tasks/:id", routes.DeleteTask) // 删除任务
|
||||
authGroup.DELETE("/tasks_multiple", routes.DeleteMultipleTask) // 删除多个任务
|
||||
authGroup.DELETE("/tasks_by_status", routes.DeleteTaskByStatus) //删除指定状态的任务
|
||||
authGroup.POST("/tasks/:id/cancel", routes.CancelTask) // 取消任务
|
||||
authGroup.GET("/tasks/:id/log", routes.GetTaskLog) // 任务日志
|
||||
authGroup.GET("/tasks/:id/results", routes.GetTaskResults) // 任务结果
|
||||
authGroup.GET("/tasks/:id/results/download", routes.DownloadTaskResultsCsv) // 下载任务结果
|
||||
}
|
||||
// 定时任务
|
||||
authGroup.GET("/schedules", routes.GetScheduleList) // 定时任务列表
|
||||
authGroup.GET("/schedules/:id", routes.GetSchedule) // 定时任务详情
|
||||
authGroup.PUT("/schedules", routes.PutSchedule) // 创建定时任务
|
||||
authGroup.POST("/schedules/:id", routes.PostSchedule) // 修改定时任务
|
||||
authGroup.DELETE("/schedules/:id", routes.DeleteSchedule) // 删除定时任务
|
||||
{
|
||||
authGroup.GET("/schedules", routes.GetScheduleList) // 定时任务列表
|
||||
authGroup.GET("/schedules/:id", routes.GetSchedule) // 定时任务详情
|
||||
authGroup.PUT("/schedules", routes.PutSchedule) // 创建定时任务
|
||||
authGroup.POST("/schedules/:id", routes.PostSchedule) // 修改定时任务
|
||||
authGroup.DELETE("/schedules/:id", routes.DeleteSchedule) // 删除定时任务
|
||||
authGroup.POST("/schedules/:id/disable", routes.DisableSchedule) // 禁用定时任务
|
||||
authGroup.POST("/schedules/:id/enable", routes.EnableSchedule) // 启用定时任务
|
||||
}
|
||||
// 用户
|
||||
{
|
||||
authGroup.GET("/users", routes.GetUserList) // 用户列表
|
||||
authGroup.GET("/users/:id", routes.GetUser) // 用户详情
|
||||
authGroup.POST("/users/:id", routes.PostUser) // 更改用户
|
||||
authGroup.DELETE("/users/:id", routes.DeleteUser) // 删除用户
|
||||
authGroup.GET("/me", routes.GetMe) // 获取自己账户
|
||||
authGroup.POST("/me", routes.PostMe) // 修改自己账户
|
||||
}
|
||||
// 系统
|
||||
{
|
||||
authGroup.GET("/system/deps/:lang", routes.GetAllDepList) // 节点所有第三方依赖列表
|
||||
authGroup.GET("/system/deps/:lang/:dep_name/json", routes.GetDepJson) // 节点第三方依赖JSON
|
||||
}
|
||||
// 全局变量
|
||||
{
|
||||
authGroup.GET("/variables", routes.GetVariableList) // 列表
|
||||
authGroup.PUT("/variable", routes.PutVariable) // 新增
|
||||
authGroup.POST("/variable/:id", routes.PostVariable) //修改
|
||||
authGroup.DELETE("/variable/:id", routes.DeleteVariable) //删除
|
||||
}
|
||||
// 项目
|
||||
{
|
||||
authGroup.GET("/projects", routes.GetProjectList) // 列表
|
||||
authGroup.GET("/projects/tags", routes.GetProjectTags) // 项目标签
|
||||
authGroup.PUT("/projects", routes.PutProject) //修改
|
||||
authGroup.POST("/projects/:id", routes.PostProject) // 新增
|
||||
authGroup.DELETE("/projects/:id", routes.DeleteProject) //删除
|
||||
}
|
||||
// 统计数据
|
||||
authGroup.GET("/stats/home", routes.GetHomeStats) // 首页统计数据
|
||||
// 用户
|
||||
authGroup.GET("/users", routes.GetUserList) // 用户列表
|
||||
authGroup.GET("/users/:id", routes.GetUser) // 用户详情
|
||||
authGroup.POST("/users/:id", routes.PostUser) // 更改用户
|
||||
authGroup.DELETE("/users/:id", routes.DeleteUser) // 删除用户
|
||||
authGroup.GET("/me", routes.GetMe) // 获取自己账户
|
||||
// 文件
|
||||
authGroup.GET("/file", routes.GetFile) // 获取文件
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
@@ -42,12 +42,12 @@ func init() {
|
||||
app.DELETE("/tasks/:id", DeleteTask) // 删除任务
|
||||
app.GET("/tasks/:id/results", GetTaskResults) // 任务结果
|
||||
app.GET("/tasks/:id/results/download", DownloadTaskResultsCsv) // 下载任务结果
|
||||
app.GET("/spiders", GetSpiderList) // 爬虫列表
|
||||
app.GET("/spiders/:id", GetSpider) // 爬虫详情
|
||||
app.POST("/spiders/:id", PostSpider) // 修改爬虫
|
||||
app.DELETE("/spiders/:id",DeleteSpider) // 删除爬虫
|
||||
app.GET("/spiders/:id/tasks",GetSpiderTasks) // 爬虫任务列表
|
||||
app.GET("/spiders/:id/dir",GetSpiderDir) // 爬虫目录
|
||||
app.GET("/spiders", GetSpiderList) // 爬虫列表
|
||||
app.GET("/spiders/:id", GetSpider) // 爬虫详情
|
||||
app.POST("/spiders/:id", PostSpider) // 修改爬虫
|
||||
app.DELETE("/spiders/:id", DeleteSpider) // 删除爬虫
|
||||
app.GET("/spiders/:id/tasks", GetSpiderTasks) // 爬虫任务列表
|
||||
app.GET("/spiders/:id/dir", GetSpiderDir) // 爬虫目录
|
||||
}
|
||||
|
||||
//mock test, test data in ./mock
|
||||
|
||||
@@ -10,17 +10,19 @@ import (
|
||||
"time"
|
||||
)
|
||||
|
||||
var NodeIdss = []bson.ObjectId{bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
bson.ObjectIdHex("5d429e6c19f7abede924fee1")}
|
||||
|
||||
var scheduleList = []model.Schedule{
|
||||
{
|
||||
Id: bson.ObjectId("5d429e6c19f7abede924fee2"),
|
||||
Name: "test schedule",
|
||||
SpiderId: "123",
|
||||
NodeId: bson.ObjectId("5d429e6c19f7abede924fee2"),
|
||||
NodeIds: NodeIdss,
|
||||
Cron: "***1*",
|
||||
EntryId: 10,
|
||||
// 前端展示
|
||||
SpiderName: "test scedule",
|
||||
NodeName: "测试节点",
|
||||
|
||||
CreateTs: time.Now(),
|
||||
UpdateTs: time.Now(),
|
||||
@@ -29,12 +31,11 @@ var scheduleList = []model.Schedule{
|
||||
Id: bson.ObjectId("xx429e6c19f7abede924fee2"),
|
||||
Name: "test schedule2",
|
||||
SpiderId: "234",
|
||||
NodeId: bson.ObjectId("5d429e6c19f7abede924fee2"),
|
||||
NodeIds: NodeIdss,
|
||||
Cron: "***1*",
|
||||
EntryId: 10,
|
||||
// 前端展示
|
||||
SpiderName: "test scedule2",
|
||||
NodeName: "测试节点",
|
||||
|
||||
CreateTs: time.Now(),
|
||||
UpdateTs: time.Now(),
|
||||
@@ -100,8 +101,10 @@ func PutSchedule(c *gin.Context) {
|
||||
}
|
||||
|
||||
// 如果node_id为空,则置为空ObjectId
|
||||
if item.NodeId == "" {
|
||||
item.NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
for _, NodeId := range item.NodeIds {
|
||||
if NodeId == "" {
|
||||
NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
}
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
|
||||
@@ -75,12 +75,11 @@ func TestPostSchedule(t *testing.T) {
|
||||
Id: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
Name: "test schedule",
|
||||
SpiderId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
NodeId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
NodeIds: NodeIdss,
|
||||
Cron: "***1*",
|
||||
EntryId: 10,
|
||||
// 前端展示
|
||||
SpiderName: "test scedule",
|
||||
NodeName: "测试节点",
|
||||
|
||||
CreateTs: time.Now(),
|
||||
UpdateTs: time.Now(),
|
||||
@@ -112,12 +111,11 @@ func TestPutSchedule(t *testing.T) {
|
||||
Id: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
Name: "test schedule",
|
||||
SpiderId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
NodeId: bson.ObjectIdHex("5d429e6c19f7abede924fee2"),
|
||||
NodeIds: NodeIdss,
|
||||
Cron: "***1*",
|
||||
EntryId: 10,
|
||||
// 前端展示
|
||||
SpiderName: "test scedule",
|
||||
NodeName: "测试节点",
|
||||
|
||||
CreateTs: time.Now(),
|
||||
UpdateTs: time.Now(),
|
||||
|
||||
@@ -6,8 +6,6 @@ import (
|
||||
"net/http"
|
||||
)
|
||||
|
||||
|
||||
|
||||
var taskDailyItems = []model.TaskDailyItem{
|
||||
{
|
||||
Date: "2019/08/19",
|
||||
|
||||
@@ -1 +1 @@
|
||||
package mock
|
||||
package mock
|
||||
|
||||
@@ -1 +1 @@
|
||||
package mock
|
||||
package mock
|
||||
|
||||
26
backend/model/config_spider/common.go
Normal file
26
backend/model/config_spider/common.go
Normal file
@@ -0,0 +1,26 @@
|
||||
package config_spider
|
||||
|
||||
import "crawlab/entity"
|
||||
|
||||
func GetAllFields(data entity.ConfigSpiderData) []entity.Field {
|
||||
var fields []entity.Field
|
||||
for _, stage := range data.Stages {
|
||||
for _, field := range stage.Fields {
|
||||
fields = append(fields, field)
|
||||
}
|
||||
}
|
||||
return fields
|
||||
}
|
||||
|
||||
func GetStartStageName(data entity.ConfigSpiderData) string {
|
||||
// 如果 start_stage 设置了且在 stages 里,则返回
|
||||
if data.StartStage != "" {
|
||||
return data.StartStage
|
||||
}
|
||||
|
||||
// 否则返回第一个 stage
|
||||
for _, stage := range data.Stages {
|
||||
return stage.Name
|
||||
}
|
||||
return ""
|
||||
}
|
||||
259
backend/model/config_spider/scrapy.go
Normal file
259
backend/model/config_spider/scrapy.go
Normal file
@@ -0,0 +1,259 @@
|
||||
package config_spider
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/entity"
|
||||
"crawlab/model"
|
||||
"crawlab/utils"
|
||||
"errors"
|
||||
"fmt"
|
||||
"path/filepath"
|
||||
)
|
||||
|
||||
type ScrapyGenerator struct {
|
||||
Spider model.Spider
|
||||
ConfigData entity.ConfigSpiderData
|
||||
}
|
||||
|
||||
// 生成爬虫文件
|
||||
func (g ScrapyGenerator) Generate() error {
|
||||
// 生成 items.py
|
||||
if err := g.ProcessItems(); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 生成 spider.py
|
||||
if err := g.ProcessSpider(); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// 生成 items.py
|
||||
func (g ScrapyGenerator) ProcessItems() error {
|
||||
// 待处理文件名
|
||||
src := g.Spider.Src
|
||||
filePath := filepath.Join(src, "config_spider", "items.py")
|
||||
|
||||
// 获取所有字段
|
||||
fields := g.GetAllFields()
|
||||
|
||||
// 字段名列表(包含默认字段名)
|
||||
fieldNames := []string{
|
||||
"_id",
|
||||
"task_id",
|
||||
"ts",
|
||||
}
|
||||
|
||||
// 加入字段
|
||||
for _, field := range fields {
|
||||
fieldNames = append(fieldNames, field.Name)
|
||||
}
|
||||
|
||||
// 将字段名转化为python代码
|
||||
str := ""
|
||||
for _, fieldName := range fieldNames {
|
||||
line := g.PadCode(fmt.Sprintf("%s = scrapy.Field()", fieldName), 1)
|
||||
str += line
|
||||
}
|
||||
|
||||
// 将占位符替换为代码
|
||||
if err := utils.SetFileVariable(filePath, constants.AnchorItems, str); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// 生成 spider.py
|
||||
func (g ScrapyGenerator) ProcessSpider() error {
|
||||
// 待处理文件名
|
||||
src := g.Spider.Src
|
||||
filePath := filepath.Join(src, "config_spider", "spiders", "spider.py")
|
||||
|
||||
// 替换 start_stage
|
||||
if err := utils.SetFileVariable(filePath, constants.AnchorStartStage, "parse_"+GetStartStageName(g.ConfigData)); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 替换 start_url
|
||||
if err := utils.SetFileVariable(filePath, constants.AnchorStartUrl, g.ConfigData.StartUrl); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 替换 parsers
|
||||
strParser := ""
|
||||
for _, stage := range g.ConfigData.Stages {
|
||||
stageName := stage.Name
|
||||
stageStr := g.GetParserString(stageName, stage)
|
||||
strParser += stageStr
|
||||
}
|
||||
if err := utils.SetFileVariable(filePath, constants.AnchorParsers, strParser); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) GetParserString(stageName string, stage entity.Stage) string {
|
||||
// 构造函数定义行
|
||||
strDef := g.PadCode(fmt.Sprintf("def parse_%s(self, response):", stageName), 1)
|
||||
|
||||
strParse := ""
|
||||
if stage.IsList {
|
||||
// 列表逻辑
|
||||
strParse = g.GetListParserString(stageName, stage)
|
||||
} else {
|
||||
// 非列表逻辑
|
||||
strParse = g.GetNonListParserString(stageName, stage)
|
||||
}
|
||||
|
||||
// 构造
|
||||
str := fmt.Sprintf(`%s%s`, strDef, strParse)
|
||||
|
||||
return str
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) PadCode(str string, num int) string {
|
||||
res := ""
|
||||
for i := 0; i < num; i++ {
|
||||
res += " "
|
||||
}
|
||||
res += str
|
||||
res += "\n"
|
||||
return res
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) GetNonListParserString(stageName string, stage entity.Stage) string {
|
||||
str := ""
|
||||
|
||||
// 获取或构造item
|
||||
str += g.PadCode("item = Item() if response.meta.get('item') is None else response.meta.get('item')", 2)
|
||||
|
||||
// 遍历字段列表
|
||||
for _, f := range stage.Fields {
|
||||
line := fmt.Sprintf(`item['%s'] = response.%s.extract_first()`, f.Name, g.GetExtractStringFromField(f))
|
||||
line = g.PadCode(line, 2)
|
||||
str += line
|
||||
}
|
||||
|
||||
// next stage 字段
|
||||
if f, err := g.GetNextStageField(stage); err == nil {
|
||||
// 如果找到 next stage 字段,进行下一个回调
|
||||
str += g.PadCode(fmt.Sprintf(`yield scrapy.Request(url="get_real_url(response, item['%s'])", callback=self.parse_%s, meta={'item': item})`, f.Name, f.NextStage), 2)
|
||||
} else {
|
||||
// 如果没找到 next stage 字段,返回 item
|
||||
str += g.PadCode(fmt.Sprintf(`yield item`), 2)
|
||||
}
|
||||
|
||||
// 加入末尾换行
|
||||
str += g.PadCode("", 0)
|
||||
|
||||
return str
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) GetListParserString(stageName string, stage entity.Stage) string {
|
||||
str := ""
|
||||
|
||||
// 获取前一个 stage 的 item
|
||||
str += g.PadCode(`prev_item = response.meta.get('item')`, 2)
|
||||
|
||||
// for 循环遍历列表
|
||||
str += g.PadCode(fmt.Sprintf(`for elem in response.%s:`, g.GetListString(stage)), 2)
|
||||
|
||||
// 构造item
|
||||
str += g.PadCode(`item = Item()`, 3)
|
||||
|
||||
// 遍历字段列表
|
||||
for _, f := range stage.Fields {
|
||||
line := fmt.Sprintf(`item['%s'] = elem.%s.extract_first()`, f.Name, g.GetExtractStringFromField(f))
|
||||
line = g.PadCode(line, 3)
|
||||
str += line
|
||||
}
|
||||
|
||||
// 把前一个 stage 的 item 值赋给当前 item
|
||||
str += g.PadCode(`if prev_item is not None:`, 3)
|
||||
str += g.PadCode(`for key, value in prev_item.items():`, 4)
|
||||
str += g.PadCode(`item[key] = value`, 5)
|
||||
|
||||
// next stage 字段
|
||||
if f, err := g.GetNextStageField(stage); err == nil {
|
||||
// 如果找到 next stage 字段,进行下一个回调
|
||||
str += g.PadCode(fmt.Sprintf(`yield scrapy.Request(url=get_real_url(response, item['%s']), callback=self.parse_%s, meta={'item': item})`, f.Name, f.NextStage), 3)
|
||||
} else {
|
||||
// 如果没找到 next stage 字段,返回 item
|
||||
str += g.PadCode(fmt.Sprintf(`yield item`), 3)
|
||||
}
|
||||
|
||||
// 分页
|
||||
if stage.PageCss != "" || stage.PageXpath != "" {
|
||||
str += g.PadCode(fmt.Sprintf(`next_url = response.%s.extract_first()`, g.GetExtractStringFromStage(stage)), 2)
|
||||
str += g.PadCode(fmt.Sprintf(`yield scrapy.Request(url=get_real_url(response, next_url), callback=self.parse_%s, meta={'item': prev_item})`, stageName), 2)
|
||||
}
|
||||
|
||||
// 加入末尾换行
|
||||
str += g.PadCode("", 0)
|
||||
|
||||
return str
|
||||
}
|
||||
|
||||
// 获取所有字段
|
||||
func (g ScrapyGenerator) GetAllFields() []entity.Field {
|
||||
return GetAllFields(g.ConfigData)
|
||||
}
|
||||
|
||||
// 获取包含 next stage 的字段
|
||||
func (g ScrapyGenerator) GetNextStageField(stage entity.Stage) (entity.Field, error) {
|
||||
for _, field := range stage.Fields {
|
||||
if field.NextStage != "" {
|
||||
return field, nil
|
||||
}
|
||||
}
|
||||
return entity.Field{}, errors.New("cannot find next stage field")
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) GetExtractStringFromField(f entity.Field) string {
|
||||
if f.Css != "" {
|
||||
// 如果为CSS
|
||||
if f.Attr == "" {
|
||||
// 文本
|
||||
return fmt.Sprintf(`css('%s::text')`, f.Css)
|
||||
} else {
|
||||
// 属性
|
||||
return fmt.Sprintf(`css('%s::attr("%s")')`, f.Css, f.Attr)
|
||||
}
|
||||
} else {
|
||||
// 如果为XPath
|
||||
if f.Attr == "" {
|
||||
// 文本
|
||||
return fmt.Sprintf(`xpath('string(%s)')`, f.Xpath)
|
||||
} else {
|
||||
// 属性
|
||||
return fmt.Sprintf(`xpath('%s/@%s')`, f.Xpath, f.Attr)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) GetExtractStringFromStage(stage entity.Stage) string {
|
||||
// 分页元素属性,默认为 href
|
||||
pageAttr := "href"
|
||||
if stage.PageAttr != "" {
|
||||
pageAttr = stage.PageAttr
|
||||
}
|
||||
|
||||
if stage.PageCss != "" {
|
||||
// 如果为CSS
|
||||
return fmt.Sprintf(`css('%s::attr("%s")')`, stage.PageCss, pageAttr)
|
||||
} else {
|
||||
// 如果为XPath
|
||||
return fmt.Sprintf(`xpath('%s/@%s')`, stage.PageXpath, pageAttr)
|
||||
}
|
||||
}
|
||||
|
||||
func (g ScrapyGenerator) GetListString(stage entity.Stage) string {
|
||||
if stage.ListCss != "" {
|
||||
return fmt.Sprintf(`css('%s')`, stage.ListCss)
|
||||
} else {
|
||||
return fmt.Sprintf(`xpath('%s')`, stage.ListXpath)
|
||||
}
|
||||
}
|
||||
@@ -20,10 +20,13 @@ type GridFs struct {
|
||||
}
|
||||
|
||||
type File struct {
|
||||
Name string `json:"name"`
|
||||
Path string `json:"path"`
|
||||
IsDir bool `json:"is_dir"`
|
||||
Size int64 `json:"size"`
|
||||
Name string `json:"name"`
|
||||
Path string `json:"path"`
|
||||
RelativePath string `json:"relative_path"`
|
||||
IsDir bool `json:"is_dir"`
|
||||
Size int64 `json:"size"`
|
||||
Children []File `json:"children"`
|
||||
Label string `json:"label"`
|
||||
}
|
||||
|
||||
func (f *GridFs) Remove() {
|
||||
|
||||
@@ -55,7 +55,7 @@ func GetCurrentNode() (Node, error) {
|
||||
for {
|
||||
// 如果错误次数超过10次
|
||||
if errNum >= 10 {
|
||||
panic("cannot get current node")
|
||||
return node, errors.New("cannot get current node")
|
||||
}
|
||||
|
||||
// 尝试获取节点
|
||||
@@ -63,7 +63,9 @@ func GetCurrentNode() (Node, error) {
|
||||
// 如果获取失败
|
||||
if err != nil {
|
||||
// 如果为主节点,表示为第一次注册,插入节点信息
|
||||
if IsMaster() {
|
||||
// update: 增加具体错误过滤。防止加入多个master节点,后续需要职责拆分,
|
||||
//只在master节点运行的时候才检测master节点的信息是否存在
|
||||
if IsMaster() && err == mgo.ErrNotFound {
|
||||
// 获取本机信息
|
||||
ip, mac, key, err := GetNodeBaseInfo()
|
||||
if err != nil {
|
||||
@@ -143,6 +145,7 @@ func (n *Node) GetTasks() ([]Task, error) {
|
||||
return tasks, nil
|
||||
}
|
||||
|
||||
// 节点列表
|
||||
func GetNodeList(filter interface{}) ([]Node, error) {
|
||||
s, c := database.GetCol("nodes")
|
||||
defer s.Close()
|
||||
@@ -156,6 +159,7 @@ func GetNodeList(filter interface{}) ([]Node, error) {
|
||||
return results, nil
|
||||
}
|
||||
|
||||
// 节点信息
|
||||
func GetNode(id bson.ObjectId) (Node, error) {
|
||||
var node Node
|
||||
|
||||
@@ -169,13 +173,14 @@ func GetNode(id bson.ObjectId) (Node, error) {
|
||||
defer s.Close()
|
||||
|
||||
if err := c.FindId(id).One(&node); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
//log.Errorf("get node error: %s, id: %s", err.Error(), id.Hex())
|
||||
//debug.PrintStack()
|
||||
return node, err
|
||||
}
|
||||
return node, nil
|
||||
}
|
||||
|
||||
// 节点信息
|
||||
func GetNodeByKey(key string) (Node, error) {
|
||||
s, c := database.GetCol("nodes")
|
||||
defer s.Close()
|
||||
@@ -191,6 +196,7 @@ func GetNodeByKey(key string) (Node, error) {
|
||||
return node, nil
|
||||
}
|
||||
|
||||
// 更新节点
|
||||
func UpdateNode(id bson.ObjectId, item Node) error {
|
||||
s, c := database.GetCol("nodes")
|
||||
defer s.Close()
|
||||
@@ -206,6 +212,7 @@ func UpdateNode(id bson.ObjectId, item Node) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// 任务列表
|
||||
func GetNodeTaskList(id bson.ObjectId) ([]Task, error) {
|
||||
node, err := GetNode(id)
|
||||
if err != nil {
|
||||
@@ -218,6 +225,7 @@ func GetNodeTaskList(id bson.ObjectId) ([]Task, error) {
|
||||
return tasks, nil
|
||||
}
|
||||
|
||||
// 节点数
|
||||
func GetNodeCount(query interface{}) (int, error) {
|
||||
s, c := database.GetCol("nodes")
|
||||
defer s.Close()
|
||||
|
||||
146
backend/model/project.go
Normal file
146
backend/model/project.go
Normal file
@@ -0,0 +1,146 @@
|
||||
package model
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/database"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"runtime/debug"
|
||||
"time"
|
||||
)
|
||||
|
||||
type Project struct {
|
||||
Id bson.ObjectId `json:"_id" bson:"_id"`
|
||||
Name string `json:"name" bson:"name"`
|
||||
Description string `json:"description" bson:"description"`
|
||||
Tags []string `json:"tags" bson:"tags"`
|
||||
|
||||
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
|
||||
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
|
||||
|
||||
// 前端展示
|
||||
Spiders []Spider `json:"spiders" bson:"spiders"`
|
||||
}
|
||||
|
||||
func (p *Project) Save() error {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
p.UpdateTs = time.Now()
|
||||
|
||||
if err := c.UpdateId(p.Id, p); err != nil {
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (p *Project) Add() error {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
p.Id = bson.NewObjectId()
|
||||
p.UpdateTs = time.Now()
|
||||
p.CreateTs = time.Now()
|
||||
if err := c.Insert(p); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (p *Project) GetSpiders() ([]Spider, error) {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
|
||||
var query interface{}
|
||||
if p.Id.Hex() == constants.ObjectIdNull {
|
||||
query = bson.M{
|
||||
"$or": []bson.M{
|
||||
{"project_id": p.Id},
|
||||
{"project_id": bson.M{"$exists": false}},
|
||||
},
|
||||
}
|
||||
} else {
|
||||
query = bson.M{"project_id": p.Id}
|
||||
}
|
||||
|
||||
var spiders []Spider
|
||||
if err := c.Find(query).All(&spiders); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return spiders, err
|
||||
}
|
||||
|
||||
return spiders, nil
|
||||
}
|
||||
|
||||
func GetProject(id bson.ObjectId) (Project, error) {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
var p Project
|
||||
if err := c.Find(bson.M{"_id": id}).One(&p); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return p, err
|
||||
}
|
||||
return p, nil
|
||||
}
|
||||
|
||||
func GetProjectList(filter interface{}, skip int, sortKey string) ([]Project, error) {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
var projects []Project
|
||||
if err := c.Find(filter).Skip(skip).Limit(constants.Infinite).Sort(sortKey).All(&projects); err != nil {
|
||||
debug.PrintStack()
|
||||
return projects, err
|
||||
}
|
||||
return projects, nil
|
||||
}
|
||||
|
||||
func GetProjectListTotal(filter interface{}) (int, error) {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
var result int
|
||||
result, err := c.Find(filter).Count()
|
||||
if err != nil {
|
||||
return result, err
|
||||
}
|
||||
return result, nil
|
||||
}
|
||||
|
||||
func UpdateProject(id bson.ObjectId, item Project) error {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
var result Project
|
||||
if err := c.FindId(id).One(&result); err != nil {
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
if err := item.Save(); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func RemoveProject(id bson.ObjectId) error {
|
||||
s, c := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
var result User
|
||||
if err := c.FindId(id).One(&result); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if err := c.RemoveId(id); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
@@ -12,19 +12,23 @@ import (
|
||||
)
|
||||
|
||||
type Schedule struct {
|
||||
Id bson.ObjectId `json:"_id" bson:"_id"`
|
||||
Name string `json:"name" bson:"name"`
|
||||
Description string `json:"description" bson:"description"`
|
||||
SpiderId bson.ObjectId `json:"spider_id" bson:"spider_id"`
|
||||
NodeId bson.ObjectId `json:"node_id" bson:"node_id"`
|
||||
NodeKey string `json:"node_key" bson:"node_key"`
|
||||
Cron string `json:"cron" bson:"cron"`
|
||||
EntryId cron.EntryID `json:"entry_id" bson:"entry_id"`
|
||||
Param string `json:"param" bson:"param"`
|
||||
Id bson.ObjectId `json:"_id" bson:"_id"`
|
||||
Name string `json:"name" bson:"name"`
|
||||
Description string `json:"description" bson:"description"`
|
||||
SpiderId bson.ObjectId `json:"spider_id" bson:"spider_id"`
|
||||
Cron string `json:"cron" bson:"cron"`
|
||||
EntryId cron.EntryID `json:"entry_id" bson:"entry_id"`
|
||||
Param string `json:"param" bson:"param"`
|
||||
RunType string `json:"run_type" bson:"run_type"`
|
||||
NodeIds []bson.ObjectId `json:"node_ids" bson:"node_ids"`
|
||||
Status string `json:"status" bson:"status"`
|
||||
Enabled bool `json:"enabled" bson:"enabled"`
|
||||
UserId bson.ObjectId `json:"user_id" bson:"user_id"`
|
||||
|
||||
// 前端展示
|
||||
SpiderName string `json:"spider_name" bson:"spider_name"`
|
||||
NodeName string `json:"node_name" bson:"node_name"`
|
||||
Nodes []Node `json:"nodes" bson:"nodes"`
|
||||
Message string `json:"message" bson:"message"`
|
||||
|
||||
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
|
||||
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
|
||||
@@ -46,27 +50,6 @@ func (sch *Schedule) Delete() error {
|
||||
return c.RemoveId(sch.Id)
|
||||
}
|
||||
|
||||
func (sch *Schedule) SyncNodeIdAndSpiderId(node Node, spider Spider) {
|
||||
sch.syncNodeId(node)
|
||||
sch.syncSpiderId(spider)
|
||||
}
|
||||
|
||||
func (sch *Schedule) syncNodeId(node Node) {
|
||||
if node.Id.Hex() == sch.NodeId.Hex() {
|
||||
return
|
||||
}
|
||||
sch.NodeId = node.Id
|
||||
_ = sch.Save()
|
||||
}
|
||||
|
||||
func (sch *Schedule) syncSpiderId(spider Spider) {
|
||||
if spider.Id.Hex() == sch.SpiderId.Hex() {
|
||||
return
|
||||
}
|
||||
sch.SpiderId = spider.Id
|
||||
_ = sch.Save()
|
||||
}
|
||||
|
||||
func GetScheduleList(filter interface{}) ([]Schedule, error) {
|
||||
s, c := database.GetCol("schedules")
|
||||
defer s.Close()
|
||||
@@ -79,28 +62,25 @@ func GetScheduleList(filter interface{}) ([]Schedule, error) {
|
||||
var schs []Schedule
|
||||
for _, schedule := range schedules {
|
||||
// 获取节点名称
|
||||
if schedule.NodeId == bson.ObjectIdHex(constants.ObjectIdNull) {
|
||||
// 选择所有节点
|
||||
schedule.NodeName = "All Nodes"
|
||||
} else {
|
||||
// 选择单一节点
|
||||
node, err := GetNode(schedule.NodeId)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
continue
|
||||
schedule.Nodes = []Node{}
|
||||
if schedule.RunType == constants.RunTypeSelectedNodes {
|
||||
for _, nodeId := range schedule.NodeIds {
|
||||
// 选择单一节点
|
||||
node, _ := GetNode(nodeId)
|
||||
schedule.Nodes = append(schedule.Nodes, node)
|
||||
}
|
||||
schedule.NodeName = node.Name
|
||||
}
|
||||
|
||||
// 获取爬虫名称
|
||||
spider, err := GetSpider(schedule.SpiderId)
|
||||
if err != nil && err == mgo.ErrNotFound {
|
||||
log.Errorf("get spider by id: %s, error: %s", schedule.SpiderId.Hex(), err.Error())
|
||||
debug.PrintStack()
|
||||
_ = schedule.Delete()
|
||||
continue
|
||||
schedule.Status = constants.ScheduleStatusError
|
||||
schedule.Message = constants.ScheduleStatusErrorNotFoundSpider
|
||||
} else {
|
||||
schedule.SpiderName = spider.Name
|
||||
}
|
||||
schedule.SpiderName = spider.Name
|
||||
|
||||
schs = append(schs, schedule)
|
||||
}
|
||||
return schs, nil
|
||||
@@ -125,12 +105,8 @@ func UpdateSchedule(id bson.ObjectId, item Schedule) error {
|
||||
if err := c.FindId(id).One(&result); err != nil {
|
||||
return err
|
||||
}
|
||||
node, err := GetNode(item.NodeId)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
item.NodeKey = node.Key
|
||||
item.UpdateTs = time.Now()
|
||||
if err := item.Save(); err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -141,15 +117,9 @@ func AddSchedule(item Schedule) error {
|
||||
s, c := database.GetCol("schedules")
|
||||
defer s.Close()
|
||||
|
||||
node, err := GetNode(item.NodeId)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
item.Id = bson.NewObjectId()
|
||||
item.CreateTs = time.Now()
|
||||
item.UpdateTs = time.Now()
|
||||
item.NodeKey = node.Key
|
||||
|
||||
if err := c.Insert(&item); err != nil {
|
||||
debug.PrintStack()
|
||||
|
||||
@@ -1,11 +1,17 @@
|
||||
package model
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/database"
|
||||
"crawlab/entity"
|
||||
"crawlab/utils"
|
||||
"errors"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"gopkg.in/yaml.v2"
|
||||
"io/ioutil"
|
||||
"path/filepath"
|
||||
"runtime/debug"
|
||||
"time"
|
||||
)
|
||||
@@ -25,25 +31,21 @@ type Spider struct {
|
||||
Site string `json:"site" bson:"site"` // 爬虫网站
|
||||
Envs []Env `json:"envs" bson:"envs"` // 环境变量
|
||||
Remark string `json:"remark" bson:"remark"` // 备注
|
||||
Src string `json:"src" bson:"src"` // 源码位置
|
||||
ProjectId bson.ObjectId `json:"project_id" bson:"project_id"` // 项目ID
|
||||
|
||||
// 自定义爬虫
|
||||
Src string `json:"src" bson:"src"` // 源码位置
|
||||
Cmd string `json:"cmd" bson:"cmd"` // 执行命令
|
||||
|
||||
// 可配置爬虫
|
||||
Template string `json:"template" bson:"template"` // Spiderfile模版
|
||||
|
||||
// 前端展示
|
||||
LastRunTs time.Time `json:"last_run_ts"` // 最后一次执行时间
|
||||
LastStatus string `json:"last_status"` // 最后执行状态
|
||||
|
||||
// TODO: 可配置爬虫
|
||||
//Fields []interface{} `json:"fields"`
|
||||
//DetailFields []interface{} `json:"detail_fields"`
|
||||
//CrawlType string `json:"crawl_type"`
|
||||
//StartUrl string `json:"start_url"`
|
||||
//UrlPattern string `json:"url_pattern"`
|
||||
//ItemSelector string `json:"item_selector"`
|
||||
//ItemSelectorType string `json:"item_selector_type"`
|
||||
//PaginationSelector string `json:"pagination_selector"`
|
||||
//PaginationSelectorType string `json:"pagination_selector_type"`
|
||||
LastRunTs time.Time `json:"last_run_ts"` // 最后一次执行时间
|
||||
LastStatus string `json:"last_status"` // 最后执行状态
|
||||
Config entity.ConfigSpiderData `json:"config"` // 可配置爬虫配置
|
||||
|
||||
// 时间
|
||||
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
|
||||
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
|
||||
}
|
||||
@@ -55,6 +57,11 @@ func (spider *Spider) Save() error {
|
||||
|
||||
spider.UpdateTs = time.Now()
|
||||
|
||||
// 兼容没有项目ID的爬虫
|
||||
if spider.ProjectId.Hex() == "" {
|
||||
spider.ProjectId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
}
|
||||
|
||||
if err := c.UpdateId(spider.Id, spider); err != nil {
|
||||
debug.PrintStack()
|
||||
return err
|
||||
@@ -98,24 +105,29 @@ func (spider *Spider) GetLastTask() (Task, error) {
|
||||
return tasks[0], nil
|
||||
}
|
||||
|
||||
// 删除爬虫
|
||||
func (spider *Spider) Delete() error {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
return c.RemoveId(spider.Id)
|
||||
}
|
||||
|
||||
// 爬虫列表
|
||||
func GetSpiderList(filter interface{}, skip int, limit int) ([]Spider, int, error) {
|
||||
// 获取爬虫列表
|
||||
func GetSpiderList(filter interface{}, skip int, limit int, sortStr string) ([]Spider, int, error) {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
|
||||
// 获取爬虫列表
|
||||
var spiders []Spider
|
||||
if err := c.Find(filter).Skip(skip).Limit(limit).Sort("+name").All(&spiders); err != nil {
|
||||
if err := c.Find(filter).Skip(skip).Limit(limit).Sort(sortStr).All(&spiders); err != nil {
|
||||
debug.PrintStack()
|
||||
return spiders, 0, err
|
||||
}
|
||||
|
||||
if spiders == nil {
|
||||
spiders = []Spider{}
|
||||
}
|
||||
|
||||
// 遍历爬虫列表
|
||||
for i, spider := range spiders {
|
||||
// 获取最后一次任务
|
||||
@@ -136,7 +148,7 @@ func GetSpiderList(filter interface{}, skip int, limit int) ([]Spider, int, erro
|
||||
return spiders, count, nil
|
||||
}
|
||||
|
||||
// 获取爬虫
|
||||
// 获取爬虫(根据FileId)
|
||||
func GetSpiderByFileId(fileId bson.ObjectId) *Spider {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
@@ -150,34 +162,44 @@ func GetSpiderByFileId(fileId bson.ObjectId) *Spider {
|
||||
return result
|
||||
}
|
||||
|
||||
// 获取爬虫
|
||||
func GetSpiderByName(name string) *Spider {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
|
||||
var result *Spider
|
||||
if err := c.Find(bson.M{"name": name}).One(&result); err != nil {
|
||||
log.Errorf("get spider error: %s, spider_name: %s", err.Error(), name)
|
||||
debug.PrintStack()
|
||||
return nil
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
// 获取爬虫
|
||||
func GetSpider(id bson.ObjectId) (Spider, error) {
|
||||
// 获取爬虫(根据名称)
|
||||
func GetSpiderByName(name string) Spider {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
|
||||
var result Spider
|
||||
if err := c.FindId(id).One(&result); err != nil {
|
||||
if err := c.Find(bson.M{"name": name}).One(&result); err != nil && err != mgo.ErrNotFound {
|
||||
log.Errorf("get spider error: %s, spider_name: %s", err.Error(), name)
|
||||
//debug.PrintStack()
|
||||
return result
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
// 获取爬虫(根据ID)
|
||||
func GetSpider(id bson.ObjectId) (Spider, error) {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
|
||||
// 获取爬虫
|
||||
var spider Spider
|
||||
if err := c.FindId(id).One(&spider); err != nil {
|
||||
if err != mgo.ErrNotFound {
|
||||
log.Errorf("get spider error: %s, id: %id", err.Error(), id.Hex())
|
||||
debug.PrintStack()
|
||||
}
|
||||
return result, err
|
||||
return spider, err
|
||||
}
|
||||
return result, nil
|
||||
|
||||
// 如果为可配置爬虫,获取爬虫配置
|
||||
if spider.Type == constants.Configurable && utils.Exists(filepath.Join(spider.Src, "Spiderfile")) {
|
||||
config, err := GetConfigSpiderData(spider)
|
||||
if err != nil {
|
||||
return spider, err
|
||||
}
|
||||
spider.Config = config
|
||||
}
|
||||
return spider, nil
|
||||
}
|
||||
|
||||
// 更新爬虫
|
||||
@@ -217,10 +239,12 @@ func RemoveSpider(id bson.ObjectId) error {
|
||||
s, gf := database.GetGridFs("files")
|
||||
defer s.Close()
|
||||
|
||||
if err := gf.RemoveId(result.FileId); err != nil {
|
||||
log.Error("remove file error, id:" + result.FileId.Hex())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
if result.FileId.Hex() != constants.ObjectIdNull {
|
||||
if err := gf.RemoveId(result.FileId); err != nil {
|
||||
log.Error("remove file error, id:" + result.FileId.Hex())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
@@ -245,7 +269,7 @@ func RemoveAllSpider() error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// 爬虫总数
|
||||
// 获取爬虫总数
|
||||
func GetSpiderCount() (int, error) {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
@@ -257,23 +281,29 @@ func GetSpiderCount() (int, error) {
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// 爬虫类型
|
||||
func GetSpiderTypes() ([]*entity.SpiderType, error) {
|
||||
s, c := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
// 获取爬虫定时任务
|
||||
func GetConfigSpiderData(spider Spider) (entity.ConfigSpiderData, error) {
|
||||
// 构造配置数据
|
||||
configData := entity.ConfigSpiderData{}
|
||||
|
||||
group := bson.M{
|
||||
"$group": bson.M{
|
||||
"_id": "$type",
|
||||
"count": bson.M{"$sum": 1},
|
||||
},
|
||||
}
|
||||
var types []*entity.SpiderType
|
||||
if err := c.Pipe([]bson.M{group}).All(&types); err != nil {
|
||||
log.Errorf("get spider types error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return nil, err
|
||||
// 校验爬虫类别
|
||||
if spider.Type != constants.Configurable {
|
||||
return configData, errors.New("not a configurable spider")
|
||||
}
|
||||
|
||||
return types, nil
|
||||
// Spiderfile 目录
|
||||
sfPath := filepath.Join(spider.Src, "Spiderfile")
|
||||
|
||||
// 读取YAML文件
|
||||
yamlFile, err := ioutil.ReadFile(sfPath)
|
||||
if err != nil {
|
||||
return configData, err
|
||||
}
|
||||
|
||||
// 反序列化
|
||||
if err := yaml.Unmarshal(yamlFile, &configData); err != nil {
|
||||
return configData, err
|
||||
}
|
||||
|
||||
return configData, nil
|
||||
}
|
||||
|
||||
@@ -25,6 +25,7 @@ type Task struct {
|
||||
RuntimeDuration float64 `json:"runtime_duration" bson:"runtime_duration"`
|
||||
TotalDuration float64 `json:"total_duration" bson:"total_duration"`
|
||||
Pid int `json:"pid" bson:"pid"`
|
||||
UserId bson.ObjectId `json:"user_id" bson:"user_id"`
|
||||
|
||||
// 前端数据
|
||||
SpiderName string `json:"spider_name"`
|
||||
@@ -61,6 +62,7 @@ func (t *Task) Save() error {
|
||||
defer s.Close()
|
||||
t.UpdateTs = time.Now()
|
||||
if err := c.UpdateId(t.Id, t); err != nil {
|
||||
log.Errorf("update task error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
@@ -93,7 +95,7 @@ func (t *Task) GetResults(pageNum int, pageSize int) (results []interface{}, tot
|
||||
query := bson.M{
|
||||
"task_id": t.Id,
|
||||
}
|
||||
if err = c.Find(query).Skip((pageNum - 1) * pageSize).Limit(pageSize).Sort("-create_ts").All(&results); err != nil {
|
||||
if err = c.Find(query).Skip((pageNum - 1) * pageSize).Limit(pageSize).All(&results); err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
@@ -116,18 +118,12 @@ func GetTaskList(filter interface{}, skip int, limit int, sortKey string) ([]Tas
|
||||
|
||||
for i, task := range tasks {
|
||||
// 获取爬虫名称
|
||||
spider, err := task.GetSpider()
|
||||
if err != nil || spider.Id.Hex() == "" {
|
||||
_ = spider.Delete()
|
||||
} else {
|
||||
if spider, err := task.GetSpider(); err == nil {
|
||||
tasks[i].SpiderName = spider.DisplayName
|
||||
}
|
||||
|
||||
// 获取节点名称
|
||||
node, err := task.GetNode()
|
||||
if node.Id.Hex() == "" || err != nil {
|
||||
_ = task.Delete()
|
||||
} else {
|
||||
if node, err := task.GetNode(); err == nil {
|
||||
tasks[i].NodeName = node.Name
|
||||
}
|
||||
}
|
||||
@@ -141,6 +137,8 @@ func GetTaskListTotal(filter interface{}) (int, error) {
|
||||
var result int
|
||||
result, err := c.Find(filter).Count()
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return result, err
|
||||
}
|
||||
return result, nil
|
||||
@@ -152,6 +150,7 @@ func GetTask(id string) (Task, error) {
|
||||
|
||||
var task Task
|
||||
if err := c.FindId(id).One(&task); err != nil {
|
||||
log.Infof("get task error: %s, id: %s", err.Error(), id)
|
||||
debug.PrintStack()
|
||||
return task, err
|
||||
}
|
||||
@@ -166,6 +165,8 @@ func AddTask(item Task) error {
|
||||
item.UpdateTs = time.Now()
|
||||
|
||||
if err := c.Insert(&item); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
@@ -177,6 +178,8 @@ func RemoveTask(id string) error {
|
||||
|
||||
var result Task
|
||||
if err := c.FindId(id).One(&result); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
@@ -187,6 +190,20 @@ func RemoveTask(id string) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func RemoveTaskByStatus(status string) error {
|
||||
tasks, err := GetTaskList(bson.M{"status": status}, 0, constants.Infinite, "-create_ts")
|
||||
if err != nil {
|
||||
log.Error("get tasks error:" + err.Error())
|
||||
}
|
||||
for _, task := range tasks {
|
||||
if err := RemoveTask(task.Id); err != nil {
|
||||
log.Error("remove task error:" + err.Error())
|
||||
continue
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// 删除task by spider_id
|
||||
func RemoveTaskBySpiderId(id bson.ObjectId) error {
|
||||
tasks, err := GetTaskList(bson.M{"spider_id": id}, 0, constants.Infinite, "-create_ts")
|
||||
|
||||
@@ -16,11 +16,20 @@ type User struct {
|
||||
Username string `json:"username" bson:"username"`
|
||||
Password string `json:"password" bson:"password"`
|
||||
Role string `json:"role" bson:"role"`
|
||||
Email string `json:"email" bson:"email"`
|
||||
Setting UserSetting `json:"setting" bson:"setting"`
|
||||
|
||||
CreateTs time.Time `json:"create_ts" bson:"create_ts"`
|
||||
UpdateTs time.Time `json:"update_ts" bson:"update_ts"`
|
||||
}
|
||||
|
||||
type UserSetting struct {
|
||||
NotificationTrigger string `json:"notification_trigger" bson:"notification_trigger"`
|
||||
DingTalkRobotWebhook string `json:"ding_talk_robot_webhook" bson:"ding_talk_robot_webhook"`
|
||||
WechatRobotWebhook string `json:"wechat_robot_webhook" bson:"wechat_robot_webhook"`
|
||||
EnabledNotifications []string `json:"enabled_notifications" bson:"enabled_notifications"`
|
||||
}
|
||||
|
||||
func (user *User) Save() error {
|
||||
s, c := database.GetCol("users")
|
||||
defer s.Close()
|
||||
|
||||
97
backend/model/variable.go
Normal file
97
backend/model/variable.go
Normal file
@@ -0,0 +1,97 @@
|
||||
package model
|
||||
|
||||
import (
|
||||
"crawlab/database"
|
||||
"errors"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"runtime/debug"
|
||||
)
|
||||
|
||||
/**
|
||||
全局变量
|
||||
*/
|
||||
|
||||
type Variable struct {
|
||||
Id bson.ObjectId `json:"_id" bson:"_id"`
|
||||
Key string `json:"key" bson:"key"`
|
||||
Value string `json:"value" bson:"value"`
|
||||
Remark string `json:"remark" bson:"remark"`
|
||||
}
|
||||
|
||||
func (model *Variable) Save() error {
|
||||
s, c := database.GetCol("variable")
|
||||
defer s.Close()
|
||||
|
||||
if err := c.UpdateId(model.Id, model); err != nil {
|
||||
log.Errorf("update variable error: %s", err.Error())
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (model *Variable) Add() error {
|
||||
s, c := database.GetCol("variable")
|
||||
defer s.Close()
|
||||
|
||||
// key 去重
|
||||
_, err := GetByKey(model.Key)
|
||||
if err == nil {
|
||||
return errors.New("key already exists")
|
||||
}
|
||||
|
||||
model.Id = bson.NewObjectId()
|
||||
if err := c.Insert(model); err != nil {
|
||||
log.Errorf("add variable error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (model *Variable) Delete() error {
|
||||
s, c := database.GetCol("variable")
|
||||
defer s.Close()
|
||||
|
||||
if err := c.RemoveId(model.Id); err != nil {
|
||||
log.Errorf("remove variable error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func GetByKey(key string) (Variable, error) {
|
||||
s, c := database.GetCol("variable")
|
||||
defer s.Close()
|
||||
|
||||
var model Variable
|
||||
if err := c.Find(bson.M{"key": key}).One(&model); err != nil {
|
||||
log.Errorf("variable found error: %s, key: %s", err.Error(), key)
|
||||
return model, err
|
||||
}
|
||||
return model, nil
|
||||
}
|
||||
|
||||
func GetVariable(id bson.ObjectId) (Variable, error) {
|
||||
s, c := database.GetCol("variable")
|
||||
defer s.Close()
|
||||
|
||||
var model Variable
|
||||
if err := c.FindId(id).One(&model); err != nil {
|
||||
log.Errorf("variable found error: %s", err.Error())
|
||||
return model, err
|
||||
}
|
||||
return model, nil
|
||||
}
|
||||
|
||||
func GetVariableList() []Variable {
|
||||
s, c := database.GetCol("variable")
|
||||
defer s.Close()
|
||||
|
||||
var list []Variable
|
||||
if err := c.Find(nil).All(&list); err != nil {
|
||||
|
||||
}
|
||||
return list
|
||||
}
|
||||
316
backend/routes/config_spider.go
Normal file
316
backend/routes/config_spider.go
Normal file
@@ -0,0 +1,316 @@
|
||||
package routes
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/entity"
|
||||
"crawlab/model"
|
||||
"crawlab/services"
|
||||
"crawlab/utils"
|
||||
"fmt"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"github.com/spf13/viper"
|
||||
"gopkg.in/yaml.v2"
|
||||
"io"
|
||||
"io/ioutil"
|
||||
"net/http"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// 添加可配置爬虫
|
||||
func PutConfigSpider(c *gin.Context) {
|
||||
var spider model.Spider
|
||||
if err := c.ShouldBindJSON(&spider); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 爬虫名称不能为空
|
||||
if spider.Name == "" {
|
||||
HandleErrorF(http.StatusBadRequest, c, "spider name should not be empty")
|
||||
return
|
||||
}
|
||||
|
||||
// 模版名不能为空
|
||||
if spider.Template == "" {
|
||||
HandleErrorF(http.StatusBadRequest, c, "spider template should not be empty")
|
||||
return
|
||||
}
|
||||
|
||||
// 判断爬虫是否存在
|
||||
if spider := model.GetSpiderByName(spider.Name); spider.Name != "" {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("spider for '%s' already exists", spider.Name))
|
||||
return
|
||||
}
|
||||
|
||||
// 设置爬虫类别
|
||||
spider.Type = constants.Configurable
|
||||
|
||||
// 将FileId置空
|
||||
spider.FileId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
|
||||
// 创建爬虫目录
|
||||
spiderDir := filepath.Join(viper.GetString("spider.path"), spider.Name)
|
||||
if utils.Exists(spiderDir) {
|
||||
if err := os.RemoveAll(spiderDir); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
if err := os.MkdirAll(spiderDir, 0777); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
spider.Src = spiderDir
|
||||
|
||||
// 复制Spiderfile模版
|
||||
contentByte, err := ioutil.ReadFile("./template/spiderfile/Spiderfile." + spider.Template)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
f, err := os.Create(filepath.Join(spider.Src, "Spiderfile"))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
defer f.Close()
|
||||
if _, err := f.Write(contentByte); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 添加爬虫到数据库
|
||||
if err := spider.Add(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: spider,
|
||||
})
|
||||
}
|
||||
|
||||
// 更改可配置爬虫
|
||||
func PostConfigSpider(c *gin.Context) {
|
||||
PostSpider(c)
|
||||
}
|
||||
|
||||
// 上传可配置爬虫Spiderfile
|
||||
func UploadConfigSpider(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
// 获取爬虫
|
||||
var spider model.Spider
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(id))
|
||||
if err != nil {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("cannot find spider (id: %s)", id))
|
||||
}
|
||||
|
||||
// 获取上传文件
|
||||
file, header, err := c.Request.FormFile("file")
|
||||
if err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 文件名称必须为Spiderfile
|
||||
filename := header.Filename
|
||||
if filename != "Spiderfile" && filename != "Spiderfile.yaml" && filename != "Spiderfile.yml" {
|
||||
HandleErrorF(http.StatusBadRequest, c, "filename must be 'Spiderfile(.yaml|.yml)'")
|
||||
return
|
||||
}
|
||||
|
||||
// 爬虫目录
|
||||
spiderDir := filepath.Join(viper.GetString("spider.path"), spider.Name)
|
||||
|
||||
// 爬虫Spiderfile文件路径
|
||||
sfPath := filepath.Join(spiderDir, filename)
|
||||
|
||||
// 创建(如果不存在)或打开Spiderfile(如果存在)
|
||||
var f *os.File
|
||||
if utils.Exists(sfPath) {
|
||||
f, err = os.OpenFile(sfPath, os.O_WRONLY, 0777)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
}
|
||||
} else {
|
||||
f, err = os.Create(sfPath)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
}
|
||||
}
|
||||
|
||||
// 将上传的文件拷贝到爬虫Spiderfile文件
|
||||
_, err = io.Copy(f, file)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 关闭Spiderfile文件
|
||||
_ = f.Close()
|
||||
|
||||
// 构造配置数据
|
||||
configData := entity.ConfigSpiderData{}
|
||||
|
||||
// 读取YAML文件
|
||||
yamlFile, err := ioutil.ReadFile(sfPath)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 反序列化
|
||||
if err := yaml.Unmarshal(yamlFile, &configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 根据序列化后的数据处理爬虫文件
|
||||
if err := services.ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func PostConfigSpiderSpiderfile(c *gin.Context) {
|
||||
type Body struct {
|
||||
Content string `json:"content"`
|
||||
}
|
||||
|
||||
id := c.Param("id")
|
||||
|
||||
// 文件内容
|
||||
var reqBody Body
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
content := reqBody.Content
|
||||
|
||||
// 获取爬虫
|
||||
var spider model.Spider
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(id))
|
||||
if err != nil {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("cannot find spider (id: %s)", id))
|
||||
return
|
||||
}
|
||||
|
||||
// 反序列化
|
||||
var configData entity.ConfigSpiderData
|
||||
if err := yaml.Unmarshal([]byte(content), &configData); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 校验configData
|
||||
if err := services.ValidateSpiderfile(configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 写文件
|
||||
if err := ioutil.WriteFile(filepath.Join(spider.Src, "Spiderfile"), []byte(content), os.ModePerm); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 根据序列化后的数据处理爬虫文件
|
||||
if err := services.ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func PostConfigSpiderConfig(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
// 获取爬虫
|
||||
var spider model.Spider
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(id))
|
||||
if err != nil {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("cannot find spider (id: %s)", id))
|
||||
return
|
||||
}
|
||||
|
||||
// 反序列化配置数据
|
||||
var configData entity.ConfigSpiderData
|
||||
if err := c.ShouldBindJSON(&configData); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 校验configData
|
||||
if err := services.ValidateSpiderfile(configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 替换Spiderfile文件
|
||||
if err := services.GenerateSpiderfileFromConfigData(spider, configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 根据序列化后的数据处理爬虫文件
|
||||
if err := services.ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func GetConfigSpiderConfig(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
// 校验ID
|
||||
if !bson.IsObjectIdHex(id) {
|
||||
HandleErrorF(http.StatusBadRequest, c, "invalid id")
|
||||
}
|
||||
|
||||
// 获取爬虫
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(id))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: spider.Config,
|
||||
})
|
||||
}
|
||||
|
||||
// 获取模版名称列表
|
||||
func GetConfigSpiderTemplateList(c *gin.Context) {
|
||||
var data []string
|
||||
for _, fInfo := range utils.ListDir("./template/spiderfile") {
|
||||
templateName := strings.Replace(fInfo.Name(), "Spiderfile.", "", -1)
|
||||
data = append(data, templateName)
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: data,
|
||||
})
|
||||
}
|
||||
190
backend/routes/projects.go
Normal file
190
backend/routes/projects.go
Normal file
@@ -0,0 +1,190 @@
|
||||
package routes
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/database"
|
||||
"crawlab/model"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"net/http"
|
||||
)
|
||||
|
||||
func GetProjectList(c *gin.Context) {
|
||||
tag := c.Query("tag")
|
||||
|
||||
// 筛选条件
|
||||
query := bson.M{}
|
||||
if tag != "" {
|
||||
query["tags"] = tag
|
||||
}
|
||||
|
||||
// 获取列表
|
||||
projects, err := model.GetProjectList(query, 0, "+_id")
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 获取总数
|
||||
total, err := model.GetProjectListTotal(query)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 获取每个项目的爬虫列表
|
||||
for i, p := range projects {
|
||||
spiders, err := p.GetSpiders()
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
projects[i].Spiders = spiders
|
||||
}
|
||||
|
||||
// 获取未被分配的爬虫数量
|
||||
if tag == "" {
|
||||
noProject := model.Project{
|
||||
Id: bson.ObjectIdHex(constants.ObjectIdNull),
|
||||
Name: "No Project",
|
||||
Description: "Not assigned to any project",
|
||||
}
|
||||
spiders, err := noProject.GetSpiders()
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
noProject.Spiders = spiders
|
||||
projects = append(projects, noProject)
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, ListResponse{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: projects,
|
||||
Total: total,
|
||||
})
|
||||
}
|
||||
|
||||
func PutProject(c *gin.Context) {
|
||||
// 绑定请求数据
|
||||
var p model.Project
|
||||
if err := c.ShouldBindJSON(&p); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
if err := p.Add(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func PostProject(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
if !bson.IsObjectIdHex(id) {
|
||||
HandleErrorF(http.StatusBadRequest, c, "invalid id")
|
||||
}
|
||||
|
||||
var item model.Project
|
||||
if err := c.ShouldBindJSON(&item); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
if err := model.UpdateProject(bson.ObjectIdHex(id), item); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func DeleteProject(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
if !bson.IsObjectIdHex(id) {
|
||||
HandleErrorF(http.StatusBadRequest, c, "invalid id")
|
||||
return
|
||||
}
|
||||
|
||||
// 从数据库中删除该爬虫
|
||||
if err := model.RemoveProject(bson.ObjectIdHex(id)); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 获取相关的爬虫
|
||||
var spiders []model.Spider
|
||||
s, col := database.GetCol("spiders")
|
||||
defer s.Close()
|
||||
if err := col.Find(bson.M{"project_id": bson.ObjectIdHex(id)}).All(&spiders); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 将爬虫的项目ID置空
|
||||
for _, spider := range spiders {
|
||||
spider.ProjectId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
if err := spider.Save(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func GetProjectTags(c *gin.Context) {
|
||||
type Result struct {
|
||||
Tag string `json:"tag" bson:"tag"`
|
||||
}
|
||||
|
||||
s, col := database.GetCol("projects")
|
||||
defer s.Close()
|
||||
|
||||
pipeline := []bson.M{
|
||||
{
|
||||
"$unwind": "$tags",
|
||||
},
|
||||
{
|
||||
"$group": bson.M{
|
||||
"_id": "$tags",
|
||||
},
|
||||
},
|
||||
{
|
||||
"$sort": bson.M{
|
||||
"_id": 1,
|
||||
},
|
||||
},
|
||||
{
|
||||
"$addFields": bson.M{
|
||||
"tag": "$_id",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
var items []Result
|
||||
if err := col.Pipe(pipeline).All(&items); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: items,
|
||||
})
|
||||
}
|
||||
@@ -14,11 +14,7 @@ func GetScheduleList(c *gin.Context) {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: results,
|
||||
})
|
||||
HandleSuccessData(c, results)
|
||||
}
|
||||
|
||||
func GetSchedule(c *gin.Context) {
|
||||
@@ -29,11 +25,8 @@ func GetSchedule(c *gin.Context) {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: result,
|
||||
})
|
||||
|
||||
HandleSuccessData(c, result)
|
||||
}
|
||||
|
||||
func PostSchedule(c *gin.Context) {
|
||||
@@ -48,7 +41,7 @@ func PostSchedule(c *gin.Context) {
|
||||
|
||||
// 验证cron表达式
|
||||
if err := services.ParserCron(newItem.Cron); err != nil {
|
||||
HandleError(http.StatusOK, c, err)
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -65,10 +58,7 @@ func PostSchedule(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
func PutSchedule(c *gin.Context) {
|
||||
@@ -82,10 +72,13 @@ func PutSchedule(c *gin.Context) {
|
||||
|
||||
// 验证cron表达式
|
||||
if err := services.ParserCron(item.Cron); err != nil {
|
||||
HandleError(http.StatusOK, c, err)
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 加入用户ID
|
||||
item.UserId = services.GetCurrentUser(c).Id
|
||||
|
||||
// 更新数据库
|
||||
if err := model.AddSchedule(item); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
@@ -98,10 +91,7 @@ func PutSchedule(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
func DeleteSchedule(c *gin.Context) {
|
||||
@@ -119,8 +109,25 @@ func DeleteSchedule(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
// 停止定时任务
|
||||
func DisableSchedule(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
if err := services.Sched.Disable(bson.ObjectIdHex(id)); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
// 运行定时任务
|
||||
func EnableSchedule(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
if err := services.Sched.Enable(bson.ObjectIdHex(id)); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
33
backend/routes/setting.go
Normal file
33
backend/routes/setting.go
Normal file
@@ -0,0 +1,33 @@
|
||||
package routes
|
||||
|
||||
import (
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/spf13/viper"
|
||||
"net/http"
|
||||
)
|
||||
|
||||
type SettingBody struct {
|
||||
AllowRegister string `json:"allow_register"`
|
||||
}
|
||||
|
||||
func GetVersion(c *gin.Context) {
|
||||
version := viper.GetString("version")
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: version,
|
||||
})
|
||||
}
|
||||
|
||||
func GetSetting(c *gin.Context) {
|
||||
allowRegister := viper.GetString("setting.allowRegister")
|
||||
|
||||
body := SettingBody{AllowRegister: allowRegister}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: body,
|
||||
})
|
||||
}
|
||||
@@ -7,6 +7,7 @@ import (
|
||||
"crawlab/model"
|
||||
"crawlab/services"
|
||||
"crawlab/utils"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/globalsign/mgo"
|
||||
@@ -17,6 +18,7 @@ import (
|
||||
"io/ioutil"
|
||||
"net/http"
|
||||
"os"
|
||||
"path"
|
||||
"path/filepath"
|
||||
"runtime/debug"
|
||||
"strconv"
|
||||
@@ -25,22 +27,49 @@ import (
|
||||
)
|
||||
|
||||
func GetSpiderList(c *gin.Context) {
|
||||
pageNum, _ := c.GetQuery("pageNum")
|
||||
pageSize, _ := c.GetQuery("pageSize")
|
||||
pageNum, _ := c.GetQuery("page_num")
|
||||
pageSize, _ := c.GetQuery("page_size")
|
||||
keyword, _ := c.GetQuery("keyword")
|
||||
pid, _ := c.GetQuery("project_id")
|
||||
t, _ := c.GetQuery("type")
|
||||
sortKey, _ := c.GetQuery("sort_key")
|
||||
sortDirection, _ := c.GetQuery("sort_direction")
|
||||
|
||||
// 筛选
|
||||
filter := bson.M{
|
||||
"name": bson.M{"$regex": bson.RegEx{Pattern: keyword, Options: "im"}},
|
||||
}
|
||||
|
||||
if t != "" {
|
||||
if t != "" && t != "all" {
|
||||
filter["type"] = t
|
||||
}
|
||||
if pid == "" {
|
||||
// do nothing
|
||||
} else if pid == constants.ObjectIdNull {
|
||||
filter["$or"] = []bson.M{
|
||||
{"project_id": bson.ObjectIdHex(pid)},
|
||||
{"project_id": bson.M{"$exists": false}},
|
||||
}
|
||||
} else {
|
||||
filter["project_id"] = bson.ObjectIdHex(pid)
|
||||
}
|
||||
|
||||
// 排序
|
||||
sortStr := "-_id"
|
||||
if sortKey != "" && sortDirection != "" {
|
||||
if sortDirection == constants.DESCENDING {
|
||||
sortStr = "-" + sortKey
|
||||
} else if sortDirection == constants.ASCENDING {
|
||||
sortStr = "+" + sortKey
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, "invalid sort_direction")
|
||||
}
|
||||
}
|
||||
|
||||
// 分页
|
||||
page := &entity.Page{}
|
||||
page.GetPage(pageNum, pageSize)
|
||||
results, count, err := model.GetSpiderList(filter, page.Skip, page.Limit)
|
||||
|
||||
results, count, err := model.GetSpiderList(filter, page.Skip, page.Limit, sortStr)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
@@ -117,6 +146,64 @@ func PublishSpider(c *gin.Context) {
|
||||
}
|
||||
|
||||
func PutSpider(c *gin.Context) {
|
||||
var spider model.Spider
|
||||
if err := c.ShouldBindJSON(&spider); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 爬虫名称不能为空
|
||||
if spider.Name == "" {
|
||||
HandleErrorF(http.StatusBadRequest, c, "spider name should not be empty")
|
||||
return
|
||||
}
|
||||
|
||||
// 判断爬虫是否存在
|
||||
if spider := model.GetSpiderByName(spider.Name); spider.Name != "" {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("spider for '%s' already exists", spider.Name))
|
||||
return
|
||||
}
|
||||
|
||||
// 设置爬虫类别
|
||||
spider.Type = constants.Customized
|
||||
|
||||
// 将FileId置空
|
||||
spider.FileId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
|
||||
// 创建爬虫目录
|
||||
spiderDir := filepath.Join(viper.GetString("spider.path"), spider.Name)
|
||||
if utils.Exists(spiderDir) {
|
||||
if err := os.RemoveAll(spiderDir); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
if err := os.MkdirAll(spiderDir, 0777); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
spider.Src = spiderDir
|
||||
|
||||
// 添加爬虫到数据库
|
||||
if err := spider.Add(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 同步到GridFS
|
||||
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: spider,
|
||||
})
|
||||
}
|
||||
|
||||
func UploadSpider(c *gin.Context) {
|
||||
// 从body中获取文件
|
||||
uploadFile, err := c.FormFile("file")
|
||||
if err != nil {
|
||||
@@ -125,6 +212,144 @@ func PutSpider(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
// 获取参数
|
||||
name := c.PostForm("name")
|
||||
displayName := c.PostForm("display_name")
|
||||
col := c.PostForm("col")
|
||||
cmd := c.PostForm("cmd")
|
||||
|
||||
// 如果不为zip文件,返回错误
|
||||
if !strings.HasSuffix(uploadFile.Filename, ".zip") {
|
||||
HandleError(http.StatusBadRequest, c, errors.New("not a valid zip file"))
|
||||
return
|
||||
}
|
||||
|
||||
// 以防tmp目录不存在
|
||||
tmpPath := viper.GetString("other.tmppath")
|
||||
if !utils.Exists(tmpPath) {
|
||||
if err := os.MkdirAll(tmpPath, os.ModePerm); err != nil {
|
||||
log.Error("mkdir other.tmppath dir error:" + err.Error())
|
||||
debug.PrintStack()
|
||||
HandleError(http.StatusBadRequest, c, errors.New("mkdir other.tmppath dir error"))
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
// 保存到本地临时文件
|
||||
randomId := uuid.NewV4()
|
||||
tmpFilePath := filepath.Join(tmpPath, randomId.String()+".zip")
|
||||
if err := c.SaveUploadedFile(uploadFile, tmpFilePath); err != nil {
|
||||
log.Error("save upload file error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 获取 GridFS 实例
|
||||
s, gf := database.GetGridFs("files")
|
||||
defer s.Close()
|
||||
|
||||
// 判断文件是否已经存在
|
||||
var gfFile model.GridFs
|
||||
if err := gf.Find(bson.M{"filename": uploadFile.Filename}).One(&gfFile); err == nil {
|
||||
// 已经存在文件,则删除
|
||||
_ = gf.RemoveId(gfFile.Id)
|
||||
}
|
||||
|
||||
// 上传到GridFs
|
||||
fid, err := services.UploadToGridFs(uploadFile.Filename, tmpFilePath)
|
||||
if err != nil {
|
||||
log.Errorf("upload to grid fs error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
idx := strings.LastIndex(uploadFile.Filename, "/")
|
||||
targetFilename := uploadFile.Filename[idx+1:]
|
||||
|
||||
// 判断爬虫是否存在
|
||||
spiderName := strings.Replace(targetFilename, ".zip", "", 1)
|
||||
if name != "" {
|
||||
spiderName = name
|
||||
}
|
||||
spider := model.GetSpiderByName(spiderName)
|
||||
if spider.Name == "" {
|
||||
// 保存爬虫信息
|
||||
srcPath := viper.GetString("spider.path")
|
||||
spider := model.Spider{
|
||||
Name: spiderName,
|
||||
DisplayName: spiderName,
|
||||
Type: constants.Customized,
|
||||
Src: filepath.Join(srcPath, spiderName),
|
||||
FileId: fid,
|
||||
}
|
||||
if name != "" {
|
||||
spider.Name = name
|
||||
}
|
||||
if displayName != "" {
|
||||
spider.DisplayName = displayName
|
||||
}
|
||||
if col != "" {
|
||||
spider.Col = col
|
||||
}
|
||||
if cmd != "" {
|
||||
spider.Cmd = cmd
|
||||
}
|
||||
_ = spider.Add()
|
||||
} else {
|
||||
if name != "" {
|
||||
spider.Name = name
|
||||
}
|
||||
if displayName != "" {
|
||||
spider.DisplayName = displayName
|
||||
}
|
||||
if col != "" {
|
||||
spider.Col = col
|
||||
}
|
||||
if cmd != "" {
|
||||
spider.Cmd = cmd
|
||||
}
|
||||
// 更新file_id
|
||||
spider.FileId = fid
|
||||
_ = spider.Save()
|
||||
}
|
||||
|
||||
// 发起同步
|
||||
services.PublishAllSpiders()
|
||||
|
||||
// 获取爬虫
|
||||
spider = model.GetSpiderByName(spiderName)
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: spider,
|
||||
})
|
||||
}
|
||||
|
||||
func UploadSpiderFromId(c *gin.Context) {
|
||||
// TODO: 与 UploadSpider 部分逻辑重复,需要优化代码
|
||||
// 爬虫ID
|
||||
spiderId := c.Param("id")
|
||||
|
||||
// 获取爬虫
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
|
||||
if err != nil {
|
||||
if err == mgo.ErrNotFound {
|
||||
HandleErrorF(http.StatusNotFound, c, "cannot find spider")
|
||||
} else {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
// 从body中获取文件
|
||||
uploadFile, err := c.FormFile("file")
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 如果不为zip文件,返回错误
|
||||
if !strings.HasSuffix(uploadFile.Filename, ".zip") {
|
||||
debug.PrintStack()
|
||||
@@ -153,6 +378,7 @@ func PutSpider(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
// 获取 GridFS 实例
|
||||
s, gf := database.GetGridFs("files")
|
||||
defer s.Close()
|
||||
|
||||
@@ -171,28 +397,12 @@ func PutSpider(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
idx := strings.LastIndex(uploadFile.Filename, "/")
|
||||
targetFilename := uploadFile.Filename[idx+1:]
|
||||
// 更新file_id
|
||||
spider.FileId = fid
|
||||
_ = spider.Save()
|
||||
|
||||
// 判断爬虫是否存在
|
||||
spiderName := strings.Replace(targetFilename, ".zip", "", 1)
|
||||
spider := model.GetSpiderByName(spiderName)
|
||||
if spider == nil {
|
||||
// 保存爬虫信息
|
||||
srcPath := viper.GetString("spider.path")
|
||||
spider := model.Spider{
|
||||
Name: spiderName,
|
||||
DisplayName: spiderName,
|
||||
Type: constants.Customized,
|
||||
Src: filepath.Join(srcPath, spiderName),
|
||||
FileId: fid,
|
||||
}
|
||||
_ = spider.Add()
|
||||
} else {
|
||||
// 更新file_id
|
||||
spider.FileId = fid
|
||||
_ = spider.Save()
|
||||
}
|
||||
// 发起同步
|
||||
services.PublishSpider(spider)
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
@@ -241,6 +451,8 @@ func GetSpiderTasks(c *gin.Context) {
|
||||
})
|
||||
}
|
||||
|
||||
// 爬虫文件管理
|
||||
|
||||
func GetSpiderDir(c *gin.Context) {
|
||||
// 爬虫ID
|
||||
id := c.Param("id")
|
||||
@@ -282,6 +494,12 @@ func GetSpiderDir(c *gin.Context) {
|
||||
})
|
||||
}
|
||||
|
||||
type SpiderFileReqBody struct {
|
||||
Path string `json:"path"`
|
||||
Content string `json:"content"`
|
||||
NewPath string `json:"new_path"`
|
||||
}
|
||||
|
||||
func GetSpiderFile(c *gin.Context) {
|
||||
// 爬虫ID
|
||||
id := c.Param("id")
|
||||
@@ -310,9 +528,34 @@ func GetSpiderFile(c *gin.Context) {
|
||||
})
|
||||
}
|
||||
|
||||
type SpiderFileReqBody struct {
|
||||
Path string `json:"path"`
|
||||
Content string `json:"content"`
|
||||
func GetSpiderFileTree(c *gin.Context) {
|
||||
// 爬虫ID
|
||||
id := c.Param("id")
|
||||
|
||||
// 获取爬虫
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(id))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 获取目录下文件列表
|
||||
spiderPath := viper.GetString("spider.path")
|
||||
spiderFilePath := filepath.Join(spiderPath, spider.Name)
|
||||
|
||||
// 获取文件目录树
|
||||
fileNodeTree, err := services.GetFileNodeTree(spiderFilePath, 0)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 返回结果
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: fileNodeTree,
|
||||
})
|
||||
}
|
||||
|
||||
func PostSpiderFile(c *gin.Context) {
|
||||
@@ -339,6 +582,12 @@ func PostSpiderFile(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
// 同步到GridFS
|
||||
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 返回结果
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
@@ -346,17 +595,158 @@ func PostSpiderFile(c *gin.Context) {
|
||||
})
|
||||
}
|
||||
|
||||
// 爬虫类型
|
||||
func GetSpiderTypes(c *gin.Context) {
|
||||
types, err := model.GetSpiderTypes()
|
||||
func PutSpiderFile(c *gin.Context) {
|
||||
spiderId := c.Param("id")
|
||||
var reqBody SpiderFileReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 文件路径
|
||||
filePath := path.Join(spider.Src, reqBody.Path)
|
||||
|
||||
// 如果文件已存在,则报错
|
||||
if utils.Exists(filePath) {
|
||||
HandleErrorF(http.StatusInternalServerError, c, fmt.Sprintf(`%s already exists`, filePath))
|
||||
return
|
||||
}
|
||||
|
||||
// 写入文件
|
||||
if err := ioutil.WriteFile(filePath, []byte(reqBody.Content), 0777); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 同步到GridFS
|
||||
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func PutSpiderDir(c *gin.Context) {
|
||||
spiderId := c.Param("id")
|
||||
var reqBody SpiderFileReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 文件路径
|
||||
filePath := path.Join(spider.Src, reqBody.Path)
|
||||
|
||||
// 如果文件已存在,则报错
|
||||
if utils.Exists(filePath) {
|
||||
HandleErrorF(http.StatusInternalServerError, c, fmt.Sprintf(`%s already exists`, filePath))
|
||||
return
|
||||
}
|
||||
|
||||
// 创建文件夹
|
||||
if err := os.MkdirAll(filePath, 0777); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 同步到GridFS
|
||||
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func DeleteSpiderFile(c *gin.Context) {
|
||||
spiderId := c.Param("id")
|
||||
var reqBody SpiderFileReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
filePath := path.Join(spider.Src, reqBody.Path)
|
||||
if err := os.RemoveAll(filePath); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 同步到GridFS
|
||||
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func RenameSpiderFile(c *gin.Context) {
|
||||
spiderId := c.Param("id")
|
||||
var reqBody SpiderFileReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
}
|
||||
spider, err := model.GetSpider(bson.ObjectIdHex(spiderId))
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 原文件路径
|
||||
filePath := path.Join(spider.Src, reqBody.Path)
|
||||
newFilePath := path.Join(path.Join(path.Dir(filePath), reqBody.NewPath))
|
||||
|
||||
// 如果新文件已存在,则报错
|
||||
if utils.Exists(newFilePath) {
|
||||
HandleErrorF(http.StatusInternalServerError, c, fmt.Sprintf(`%s already exists`, newFilePath))
|
||||
return
|
||||
}
|
||||
|
||||
// 重命名
|
||||
if err := os.Rename(filePath, newFilePath); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 删除原文件
|
||||
if err := os.RemoveAll(filePath); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
}
|
||||
|
||||
// 同步到GridFS
|
||||
if err := services.UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: types,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -479,3 +869,25 @@ func GetSpiderStats(c *gin.Context) {
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
func GetSpiderSchedules(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
if !bson.IsObjectIdHex(id) {
|
||||
HandleErrorF(http.StatusBadRequest, c, "spider_id is invalid")
|
||||
return
|
||||
}
|
||||
|
||||
// 获取定时任务
|
||||
list, err := model.GetScheduleList(bson.M{"spider_id": bson.ObjectIdHex(id)})
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: list,
|
||||
})
|
||||
}
|
||||
|
||||
316
backend/routes/system.go
Normal file
316
backend/routes/system.go
Normal file
@@ -0,0 +1,316 @@
|
||||
package routes
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/entity"
|
||||
"crawlab/services"
|
||||
"fmt"
|
||||
"github.com/gin-gonic/gin"
|
||||
"net/http"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func GetLangList(c *gin.Context) {
|
||||
nodeId := c.Param("id")
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: services.GetLangList(nodeId),
|
||||
})
|
||||
}
|
||||
|
||||
func GetDepList(c *gin.Context) {
|
||||
nodeId := c.Param("id")
|
||||
lang := c.Query("lang")
|
||||
depName := c.Query("dep_name")
|
||||
|
||||
var depList []entity.Dependency
|
||||
if lang == constants.Python {
|
||||
list, err := services.GetPythonDepList(nodeId, depName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
depList = list
|
||||
} else if lang == constants.Nodejs {
|
||||
list, err := services.GetNodejsDepList(nodeId, depName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
depList = list
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: depList,
|
||||
})
|
||||
}
|
||||
|
||||
func GetInstalledDepList(c *gin.Context) {
|
||||
nodeId := c.Param("id")
|
||||
lang := c.Query("lang")
|
||||
var depList []entity.Dependency
|
||||
if lang == constants.Python {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
list, err := services.GetPythonLocalInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
depList = list
|
||||
} else {
|
||||
list, err := services.GetPythonRemoteInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
depList = list
|
||||
}
|
||||
} else if lang == constants.Nodejs {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
list, err := services.GetNodejsLocalInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
depList = list
|
||||
} else {
|
||||
list, err := services.GetNodejsRemoteInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
depList = list
|
||||
}
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: depList,
|
||||
})
|
||||
}
|
||||
|
||||
func GetAllDepList(c *gin.Context) {
|
||||
lang := c.Param("lang")
|
||||
depName := c.Query("dep_name")
|
||||
|
||||
// 获取所有依赖列表
|
||||
var list []string
|
||||
if lang == constants.Python {
|
||||
_list, err := services.GetPythonDepListFromRedis()
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
list = _list
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
|
||||
return
|
||||
}
|
||||
|
||||
// 过滤依赖列表
|
||||
var depList []string
|
||||
for _, name := range list {
|
||||
if strings.HasPrefix(strings.ToLower(name), strings.ToLower(depName)) {
|
||||
depList = append(depList, name)
|
||||
}
|
||||
}
|
||||
|
||||
// 只取前20
|
||||
var returnList []string
|
||||
for i, name := range depList {
|
||||
if i >= 10 {
|
||||
break
|
||||
}
|
||||
returnList = append(returnList, name)
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: returnList,
|
||||
})
|
||||
}
|
||||
|
||||
func InstallDep(c *gin.Context) {
|
||||
type ReqBody struct {
|
||||
Lang string `json:"lang"`
|
||||
DepName string `json:"dep_name"`
|
||||
}
|
||||
|
||||
nodeId := c.Param("id")
|
||||
|
||||
var reqBody ReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
if reqBody.Lang == constants.Python {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
_, err := services.InstallPythonLocalDep(reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
} else {
|
||||
_, err := services.InstallPythonRemoteDep(nodeId, reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else if reqBody.Lang == constants.Nodejs {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
_, err := services.InstallNodejsLocalDep(reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
} else {
|
||||
_, err := services.InstallNodejsRemoteDep(nodeId, reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", reqBody.Lang))
|
||||
return
|
||||
}
|
||||
|
||||
// TODO: check if install is successful
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func UninstallDep(c *gin.Context) {
|
||||
type ReqBody struct {
|
||||
Lang string `json:"lang"`
|
||||
DepName string `json:"dep_name"`
|
||||
}
|
||||
|
||||
nodeId := c.Param("id")
|
||||
|
||||
var reqBody ReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
}
|
||||
|
||||
if reqBody.Lang == constants.Python {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
_, err := services.UninstallPythonLocalDep(reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
} else {
|
||||
_, err := services.UninstallPythonRemoteDep(nodeId, reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else if reqBody.Lang == constants.Nodejs {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
_, err := services.UninstallNodejsLocalDep(reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
} else {
|
||||
_, err := services.UninstallNodejsRemoteDep(nodeId, reqBody.DepName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", reqBody.Lang))
|
||||
return
|
||||
}
|
||||
|
||||
// TODO: check if uninstall is successful
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func GetDepJson(c *gin.Context) {
|
||||
depName := c.Param("dep_name")
|
||||
lang := c.Param("lang")
|
||||
|
||||
var dep entity.Dependency
|
||||
if lang == constants.Python {
|
||||
_dep, err := services.FetchPythonDepInfo(depName)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
dep = _dep
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", lang))
|
||||
return
|
||||
}
|
||||
|
||||
c.Header("Cache-Control", "max-age=86400")
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: dep,
|
||||
})
|
||||
}
|
||||
|
||||
func InstallLang(c *gin.Context) {
|
||||
type ReqBody struct {
|
||||
Lang string `json:"lang"`
|
||||
}
|
||||
|
||||
nodeId := c.Param("id")
|
||||
|
||||
var reqBody ReqBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
if reqBody.Lang == constants.Nodejs {
|
||||
if services.IsMasterNode(nodeId) {
|
||||
_, err := services.InstallNodejsLocalLang()
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
} else {
|
||||
_, err := services.InstallNodejsRemoteLang(nodeId)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
HandleErrorF(http.StatusBadRequest, c, fmt.Sprintf("%s is not implemented", reqBody.Lang))
|
||||
return
|
||||
}
|
||||
|
||||
// TODO: check if install is successful
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
@@ -9,7 +9,6 @@ import (
|
||||
"encoding/csv"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
uuid "github.com/satori/go.uuid"
|
||||
"net/http"
|
||||
)
|
||||
|
||||
@@ -18,6 +17,7 @@ type TaskListRequestData struct {
|
||||
PageSize int `form:"page_size"`
|
||||
NodeId string `form:"node_id"`
|
||||
SpiderId string `form:"spider_id"`
|
||||
Status string `form:"status"`
|
||||
}
|
||||
|
||||
type TaskResultsRequestData struct {
|
||||
@@ -29,14 +29,14 @@ func GetTaskList(c *gin.Context) {
|
||||
// 绑定数据
|
||||
data := TaskListRequestData{}
|
||||
if err := c.ShouldBindQuery(&data); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
if data.PageNum == 0 {
|
||||
data.PageNum = 1
|
||||
}
|
||||
if data.PageSize == 0 {
|
||||
data.PageNum = 10
|
||||
data.PageSize = 10
|
||||
}
|
||||
|
||||
// 过滤条件
|
||||
@@ -47,6 +47,10 @@ func GetTaskList(c *gin.Context) {
|
||||
if data.SpiderId != "" {
|
||||
query["spider_id"] = bson.ObjectIdHex(data.SpiderId)
|
||||
}
|
||||
//新增根据任务状态获取task列表
|
||||
if data.Status != "" {
|
||||
query["status"] = data.Status
|
||||
}
|
||||
|
||||
// 获取任务列表
|
||||
tasks, err := model.GetTaskList(query, (data.PageNum-1)*data.PageSize, data.PageSize, "-create_ts")
|
||||
@@ -78,49 +82,117 @@ func GetTask(c *gin.Context) {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: result,
|
||||
})
|
||||
HandleSuccessData(c, result)
|
||||
}
|
||||
|
||||
func PutTask(c *gin.Context) {
|
||||
// 生成任务ID
|
||||
id := uuid.NewV4()
|
||||
type TaskRequestBody struct {
|
||||
SpiderId bson.ObjectId `json:"spider_id"`
|
||||
RunType string `json:"run_type"`
|
||||
NodeIds []bson.ObjectId `json:"node_ids"`
|
||||
Param string `json:"param"`
|
||||
}
|
||||
|
||||
// 绑定数据
|
||||
var t model.Task
|
||||
if err := c.ShouldBindJSON(&t); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
t.Id = id.String()
|
||||
t.Status = constants.StatusPending
|
||||
|
||||
// 如果没有传入node_id,则置为null
|
||||
if t.NodeId.Hex() == "" {
|
||||
t.NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
}
|
||||
|
||||
// 将任务存入数据库
|
||||
if err := model.AddTask(t); err != nil {
|
||||
var reqBody TaskRequestBody
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 加入任务队列
|
||||
if err := services.AssignTask(t); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
if reqBody.RunType == constants.RunTypeAllNodes {
|
||||
// 所有节点
|
||||
nodes, err := model.GetNodeList(nil)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
for _, node := range nodes {
|
||||
t := model.Task{
|
||||
SpiderId: reqBody.SpiderId,
|
||||
NodeId: node.Id,
|
||||
Param: reqBody.Param,
|
||||
UserId: services.GetCurrentUser(c).Id,
|
||||
}
|
||||
|
||||
if err := services.AddTask(t); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else if reqBody.RunType == constants.RunTypeRandom {
|
||||
// 随机
|
||||
t := model.Task{
|
||||
SpiderId: reqBody.SpiderId,
|
||||
Param: reqBody.Param,
|
||||
UserId: services.GetCurrentUser(c).Id,
|
||||
}
|
||||
if err := services.AddTask(t); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
} else if reqBody.RunType == constants.RunTypeSelectedNodes {
|
||||
// 指定节点
|
||||
for _, nodeId := range reqBody.NodeIds {
|
||||
t := model.Task{
|
||||
SpiderId: reqBody.SpiderId,
|
||||
NodeId: nodeId,
|
||||
Param: reqBody.Param,
|
||||
UserId: services.GetCurrentUser(c).Id,
|
||||
}
|
||||
|
||||
if err := services.AddTask(t); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
HandleErrorF(http.StatusInternalServerError, c, "invalid run_type")
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
func DeleteTaskByStatus(c *gin.Context) {
|
||||
status := c.Query("status")
|
||||
|
||||
//删除相应的日志文件
|
||||
if err := services.RemoveLogByTaskStatus(status); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
//删除该状态下的task
|
||||
if err := model.RemoveTaskByStatus(status); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
// 删除多个任务
|
||||
func DeleteMultipleTask(c *gin.Context) {
|
||||
ids := make(map[string][]string)
|
||||
if err := c.ShouldBindJSON(&ids); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
list := ids["ids"]
|
||||
for _, id := range list {
|
||||
if err := services.RemoveLogByTaskId(id); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
if err := model.RemoveTask(id); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
// 删除单个任务
|
||||
func DeleteTask(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
@@ -129,33 +201,22 @@ func DeleteTask(c *gin.Context) {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
// 删除task
|
||||
if err := model.RemoveTask(id); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
func GetTaskLog(c *gin.Context) {
|
||||
id := c.Param("id")
|
||||
|
||||
logStr, err := services.GetTaskLog(id)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: logStr,
|
||||
})
|
||||
HandleSuccessData(c, logStr)
|
||||
}
|
||||
|
||||
func GetTaskResults(c *gin.Context) {
|
||||
@@ -164,7 +225,7 @@ func GetTaskResults(c *gin.Context) {
|
||||
// 绑定数据
|
||||
data := TaskResultsRequestData{}
|
||||
if err := c.ShouldBindQuery(&data); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -266,9 +327,5 @@ func CancelTask(c *gin.Context) {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
@@ -21,6 +21,8 @@ type UserListRequestData struct {
|
||||
type UserRequestData struct {
|
||||
Username string `json:"username"`
|
||||
Password string `json:"password"`
|
||||
Role string `json:"role"`
|
||||
Email string `json:"email"`
|
||||
}
|
||||
|
||||
func GetUser(c *gin.Context) {
|
||||
@@ -88,13 +90,13 @@ func PutUser(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
// 添加用户
|
||||
user := model.User{
|
||||
Username: strings.ToLower(reqData.Username),
|
||||
Password: utils.EncryptPassword(reqData.Password),
|
||||
Role: constants.RoleNormal,
|
||||
// 默认为正常用户
|
||||
if reqData.Role == "" {
|
||||
reqData.Role = constants.RoleNormal
|
||||
}
|
||||
if err := user.Add(); err != nil {
|
||||
|
||||
// 添加用户
|
||||
if err := services.CreateNewUser(reqData.Username, reqData.Password, reqData.Role, reqData.Email); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
@@ -199,3 +201,41 @@ func GetMe(c *gin.Context) {
|
||||
User: user,
|
||||
}, nil)
|
||||
}
|
||||
|
||||
func PostMe(c *gin.Context) {
|
||||
ctx := context.WithGinContext(c)
|
||||
user := ctx.User()
|
||||
if user == nil {
|
||||
ctx.FailedWithError(constants.ErrorUserNotFound, http.StatusUnauthorized)
|
||||
return
|
||||
}
|
||||
var reqBody model.User
|
||||
if err := c.ShouldBindJSON(&reqBody); err != nil {
|
||||
HandleErrorF(http.StatusBadRequest, c, "invalid request")
|
||||
return
|
||||
}
|
||||
if reqBody.Email != "" {
|
||||
user.Email = reqBody.Email
|
||||
}
|
||||
if reqBody.Password != "" {
|
||||
user.Password = utils.EncryptPassword(reqBody.Password)
|
||||
}
|
||||
if reqBody.Setting.NotificationTrigger != "" {
|
||||
user.Setting.NotificationTrigger = reqBody.Setting.NotificationTrigger
|
||||
}
|
||||
if reqBody.Setting.DingTalkRobotWebhook != "" {
|
||||
user.Setting.DingTalkRobotWebhook = reqBody.Setting.DingTalkRobotWebhook
|
||||
}
|
||||
if reqBody.Setting.WechatRobotWebhook != "" {
|
||||
user.Setting.WechatRobotWebhook = reqBody.Setting.WechatRobotWebhook
|
||||
}
|
||||
user.Setting.EnabledNotifications = reqBody.Setting.EnabledNotifications
|
||||
if err := user.Save(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
c.JSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
@@ -1,17 +1,15 @@
|
||||
package routes
|
||||
|
||||
import (
|
||||
"github.com/apex/log"
|
||||
"github.com/gin-gonic/gin"
|
||||
"net/http"
|
||||
"runtime/debug"
|
||||
)
|
||||
|
||||
func HandleError(statusCode int, c *gin.Context, err error) {
|
||||
log.Errorf("handle error:" + err.Error())
|
||||
debug.PrintStack()
|
||||
c.AbortWithStatusJSON(statusCode, Response{
|
||||
Status: "ok",
|
||||
Message: "error",
|
||||
Status: "error",
|
||||
Message: "failure",
|
||||
Error: err.Error(),
|
||||
})
|
||||
}
|
||||
@@ -24,3 +22,18 @@ func HandleErrorF(statusCode int, c *gin.Context, err string) {
|
||||
Error: err,
|
||||
})
|
||||
}
|
||||
|
||||
func HandleSuccess(c *gin.Context) {
|
||||
c.AbortWithStatusJSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
})
|
||||
}
|
||||
|
||||
func HandleSuccessData(c *gin.Context, data interface{}) {
|
||||
c.AbortWithStatusJSON(http.StatusOK, Response{
|
||||
Status: "ok",
|
||||
Message: "success",
|
||||
Data: data,
|
||||
})
|
||||
}
|
||||
|
||||
62
backend/routes/variable.go
Normal file
62
backend/routes/variable.go
Normal file
@@ -0,0 +1,62 @@
|
||||
package routes
|
||||
|
||||
import (
|
||||
"crawlab/model"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"net/http"
|
||||
)
|
||||
|
||||
// 新增
|
||||
func PutVariable(c *gin.Context) {
|
||||
var variable model.Variable
|
||||
if err := c.ShouldBindJSON(&variable); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
if err := variable.Add(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
// 修改
|
||||
func PostVariable(c *gin.Context) {
|
||||
var id = c.Param("id")
|
||||
var variable model.Variable
|
||||
if err := c.ShouldBindJSON(&variable); err != nil {
|
||||
HandleError(http.StatusBadRequest, c, err)
|
||||
return
|
||||
}
|
||||
variable.Id = bson.ObjectIdHex(id)
|
||||
if err := variable.Save(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
HandleSuccess(c)
|
||||
}
|
||||
|
||||
// 删除
|
||||
func DeleteVariable(c *gin.Context) {
|
||||
var idStr = c.Param("id")
|
||||
var id = bson.ObjectIdHex(idStr)
|
||||
variable, err := model.GetVariable(id)
|
||||
if err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
variable.Id = id
|
||||
if err := variable.Delete(); err != nil {
|
||||
HandleError(http.StatusInternalServerError, c, err)
|
||||
return
|
||||
}
|
||||
HandleSuccess(c)
|
||||
|
||||
}
|
||||
|
||||
// 列表
|
||||
func GetVariableList(c *gin.Context) {
|
||||
list := model.GetVariableList()
|
||||
HandleSuccessData(c, list)
|
||||
}
|
||||
17
backend/scripts/install-nodejs.sh
Normal file
17
backend/scripts/install-nodejs.sh
Normal file
@@ -0,0 +1,17 @@
|
||||
#!/bin/env bash
|
||||
|
||||
# install nvm
|
||||
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.35.2/install.sh | bash
|
||||
export NVM_DIR="$([ -z "${XDG_CONFIG_HOME-}" ] && printf %s "${HOME}/.nvm" || printf %s "${XDG_CONFIG_HOME}/nvm")"
|
||||
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
|
||||
|
||||
# install Node.js v8.12
|
||||
nvm install 8.12
|
||||
|
||||
# create soft links
|
||||
ln -s $HOME/.nvm/versions/node/v8.12.0/bin/npm /usr/local/bin/npm
|
||||
ln -s $HOME/.nvm/versions/node/v8.12.0/bin/node /usr/local/bin/node
|
||||
|
||||
# environments manipulation
|
||||
export NODE_PATH=$HOME.nvm/versions/node/v8.12.0/lib/node_modules
|
||||
export PATH=$NODE_PATH:$PATH
|
||||
273
backend/services/config_spider.go
Normal file
273
backend/services/config_spider.go
Normal file
@@ -0,0 +1,273 @@
|
||||
package services
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/database"
|
||||
"crawlab/entity"
|
||||
"crawlab/model"
|
||||
"crawlab/model/config_spider"
|
||||
"crawlab/services/spider_handler"
|
||||
"crawlab/utils"
|
||||
"errors"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
uuid "github.com/satori/go.uuid"
|
||||
"github.com/spf13/viper"
|
||||
"gopkg.in/yaml.v2"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func GenerateConfigSpiderFiles(spider model.Spider, configData entity.ConfigSpiderData) error {
|
||||
// 校验Spiderfile正确性
|
||||
if err := ValidateSpiderfile(configData); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 构造代码生成器
|
||||
generator := config_spider.ScrapyGenerator{
|
||||
Spider: spider,
|
||||
ConfigData: configData,
|
||||
}
|
||||
|
||||
// 生成代码
|
||||
if err := generator.Generate(); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// 验证Spiderfile
|
||||
func ValidateSpiderfile(configData entity.ConfigSpiderData) error {
|
||||
// 获取所有字段
|
||||
fields := config_spider.GetAllFields(configData)
|
||||
|
||||
// 校验是否存在 start_url
|
||||
if configData.StartUrl == "" {
|
||||
return errors.New("spiderfile invalid: start_url is empty")
|
||||
}
|
||||
|
||||
// 校验是否存在 start_stage
|
||||
if configData.StartStage == "" {
|
||||
return errors.New("spiderfile invalid: start_stage is empty")
|
||||
}
|
||||
|
||||
// 校验是否存在 stages
|
||||
if len(configData.Stages) == 0 {
|
||||
return errors.New("spiderfile invalid: stages is empty")
|
||||
}
|
||||
|
||||
// 校验stages
|
||||
dict := map[string]int{}
|
||||
for _, stage := range configData.Stages {
|
||||
stageName := stage.Name
|
||||
|
||||
// stage 名称不能为空
|
||||
if stageName == "" {
|
||||
return errors.New("spiderfile invalid: stage name is empty")
|
||||
}
|
||||
|
||||
// stage 名称不能为保留字符串
|
||||
// NOTE: 如果有其他Engine,可以扩展,默认为Scrapy
|
||||
if configData.Engine == "" || configData.Engine == constants.EngineScrapy {
|
||||
if strings.Contains(constants.ScrapyProtectedStageNames, stageName) {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: stage name '%s' is protected", stageName))
|
||||
}
|
||||
} else {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: engine '%s' is not implemented", configData.Engine))
|
||||
}
|
||||
|
||||
// stage 名称不能重复
|
||||
if dict[stageName] == 1 {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: stage name '%s' is duplicated", stageName))
|
||||
}
|
||||
dict[stageName] = 1
|
||||
|
||||
// stage 字段不能为空
|
||||
if len(stage.Fields) == 0 {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has no fields", stageName))
|
||||
}
|
||||
|
||||
// 是否包含 next_stage
|
||||
hasNextStage := false
|
||||
|
||||
// 遍历字段列表
|
||||
for _, field := range stage.Fields {
|
||||
// stage 的 next stage 只能有一个
|
||||
if field.NextStage != "" {
|
||||
if hasNextStage {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has more than 1 next_stage", stageName))
|
||||
}
|
||||
hasNextStage = true
|
||||
}
|
||||
|
||||
// 字段里 css 和 xpath 只能包含一个
|
||||
if field.Css != "" && field.Xpath != "" {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: field '%s' in stage '%s' has both css and xpath set which is prohibited", field.Name, stageName))
|
||||
}
|
||||
}
|
||||
|
||||
// stage 里 page_css 和 page_xpath 只能包含一个
|
||||
if stage.PageCss != "" && stage.PageXpath != "" {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has both page_css and page_xpath set which is prohibited", stageName))
|
||||
}
|
||||
|
||||
// stage 里 list_css 和 list_xpath 只能包含一个
|
||||
if stage.ListCss != "" && stage.ListXpath != "" {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: stage '%s' has both list_css and list_xpath set which is prohibited", stageName))
|
||||
}
|
||||
|
||||
// 如果 stage 的 is_list 为 true 但 list_css 为空,报错
|
||||
if stage.IsList && (stage.ListCss == "" && stage.ListXpath == "") {
|
||||
return errors.New("spiderfile invalid: stage with is_list = true should have either list_css or list_xpath being set")
|
||||
}
|
||||
}
|
||||
|
||||
// 校验字段唯一性
|
||||
if !IsUniqueConfigSpiderFields(fields) {
|
||||
return errors.New("spiderfile invalid: fields not unique")
|
||||
}
|
||||
|
||||
// 字段名称不能为保留字符串
|
||||
for _, field := range fields {
|
||||
if strings.Contains(constants.ScrapyProtectedFieldNames, field.Name) {
|
||||
return errors.New(fmt.Sprintf("spiderfile invalid: field name '%s' is protected", field.Name))
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func IsUniqueConfigSpiderFields(fields []entity.Field) bool {
|
||||
dict := map[string]int{}
|
||||
for _, field := range fields {
|
||||
if dict[field.Name] == 1 {
|
||||
return false
|
||||
}
|
||||
dict[field.Name] = 1
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
func ProcessSpiderFilesFromConfigData(spider model.Spider, configData entity.ConfigSpiderData) error {
|
||||
spiderDir := spider.Src
|
||||
|
||||
// 删除已有的爬虫文件
|
||||
for _, fInfo := range utils.ListDir(spiderDir) {
|
||||
// 不删除Spiderfile
|
||||
if fInfo.Name() == "Spiderfile" {
|
||||
continue
|
||||
}
|
||||
|
||||
// 删除其他文件
|
||||
if err := os.RemoveAll(filepath.Join(spiderDir, fInfo.Name())); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
// 拷贝爬虫文件
|
||||
tplDir := "./template/scrapy"
|
||||
for _, fInfo := range utils.ListDir(tplDir) {
|
||||
// 跳过Spiderfile
|
||||
if fInfo.Name() == "Spiderfile" {
|
||||
continue
|
||||
}
|
||||
|
||||
srcPath := filepath.Join(tplDir, fInfo.Name())
|
||||
if fInfo.IsDir() {
|
||||
dirPath := filepath.Join(spiderDir, fInfo.Name())
|
||||
if err := utils.CopyDir(srcPath, dirPath); err != nil {
|
||||
return err
|
||||
}
|
||||
} else {
|
||||
if err := utils.CopyFile(srcPath, filepath.Join(spiderDir, fInfo.Name())); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 更改爬虫文件
|
||||
if err := GenerateConfigSpiderFiles(spider, configData); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 打包为 zip 文件
|
||||
files, err := utils.GetFilesFromDir(spiderDir)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
randomId := uuid.NewV4()
|
||||
tmpFilePath := filepath.Join(viper.GetString("other.tmppath"), spider.Name+"."+randomId.String()+".zip")
|
||||
spiderZipFileName := spider.Name + ".zip"
|
||||
if err := utils.Compress(files, tmpFilePath); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 获取 GridFS 实例
|
||||
s, gf := database.GetGridFs("files")
|
||||
defer s.Close()
|
||||
|
||||
// 判断文件是否已经存在
|
||||
var gfFile model.GridFs
|
||||
if err := gf.Find(bson.M{"filename": spiderZipFileName}).One(&gfFile); err == nil {
|
||||
// 已经存在文件,则删除
|
||||
_ = gf.RemoveId(gfFile.Id)
|
||||
}
|
||||
|
||||
// 上传到GridFs
|
||||
fid, err := UploadToGridFs(spiderZipFileName, tmpFilePath)
|
||||
if err != nil {
|
||||
log.Errorf("upload to grid fs error: %s", err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// 保存爬虫 FileId
|
||||
spider.FileId = fid
|
||||
_ = spider.Save()
|
||||
|
||||
// 获取爬虫同步实例
|
||||
spiderSync := spider_handler.SpiderSync{
|
||||
Spider: spider,
|
||||
}
|
||||
|
||||
// 获取gfFile
|
||||
gfFile2 := model.GetGridFs(spider.FileId)
|
||||
|
||||
// 生成MD5
|
||||
spiderSync.CreateMd5File(gfFile2.Md5)
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func GenerateSpiderfileFromConfigData(spider model.Spider, configData entity.ConfigSpiderData) error {
|
||||
// Spiderfile 路径
|
||||
sfPath := filepath.Join(spider.Src, "Spiderfile")
|
||||
|
||||
// 生成Yaml内容
|
||||
sfContentByte, err := yaml.Marshal(configData)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 打开文件
|
||||
var f *os.File
|
||||
if utils.Exists(sfPath) {
|
||||
f, err = os.OpenFile(sfPath, os.O_WRONLY|os.O_TRUNC, 0777)
|
||||
} else {
|
||||
f, err = os.OpenFile(sfPath, os.O_CREATE, 0777)
|
||||
}
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer f.Close()
|
||||
|
||||
// 写入内容
|
||||
if _, err := f.Write(sfContentByte); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
65
backend/services/file.go
Normal file
65
backend/services/file.go
Normal file
@@ -0,0 +1,65 @@
|
||||
package services
|
||||
|
||||
import (
|
||||
"crawlab/model"
|
||||
"github.com/apex/log"
|
||||
"os"
|
||||
"path"
|
||||
"runtime/debug"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func GetFileNodeTree(dstPath string, level int) (f model.File, err error) {
|
||||
return getFileNodeTree(dstPath, level, dstPath)
|
||||
}
|
||||
|
||||
func getFileNodeTree(dstPath string, level int, rootPath string) (f model.File, err error) {
|
||||
dstF, err := os.Open(dstPath)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return f, err
|
||||
}
|
||||
defer dstF.Close()
|
||||
fileInfo, err := dstF.Stat()
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return f, nil
|
||||
}
|
||||
if !fileInfo.IsDir() { //如果dstF是文件
|
||||
return model.File{
|
||||
Label: fileInfo.Name(),
|
||||
Name: fileInfo.Name(),
|
||||
Path: strings.Replace(dstPath, rootPath, "", -1),
|
||||
IsDir: false,
|
||||
Size: fileInfo.Size(),
|
||||
Children: nil,
|
||||
}, nil
|
||||
} else { //如果dstF是文件夹
|
||||
dir, err := dstF.Readdir(0) //获取文件夹下各个文件或文件夹的fileInfo
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return f, nil
|
||||
}
|
||||
f = model.File{
|
||||
Label: path.Base(dstPath),
|
||||
Name: path.Base(dstPath),
|
||||
Path: strings.Replace(dstPath, rootPath, "", -1),
|
||||
IsDir: true,
|
||||
Size: 0,
|
||||
Children: nil,
|
||||
}
|
||||
for _, subFileInfo := range dir {
|
||||
subFileNode, err := getFileNodeTree(path.Join(dstPath, subFileInfo.Name()), level+1, rootPath)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return f, err
|
||||
}
|
||||
f.Children = append(f.Children, subFileNode)
|
||||
}
|
||||
return f, nil
|
||||
}
|
||||
}
|
||||
@@ -49,10 +49,8 @@ func GetRemoteLog(task model.Task) (logStr string, err error) {
|
||||
select {
|
||||
case logStr = <-ch:
|
||||
log.Infof("get remote log")
|
||||
break
|
||||
case <-time.After(30 * time.Second):
|
||||
logStr = "get remote log timeout"
|
||||
break
|
||||
}
|
||||
|
||||
return logStr, nil
|
||||
@@ -119,6 +117,18 @@ func RemoveLogByTaskId(id string) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func RemoveLogByTaskStatus(status string) error {
|
||||
tasks, err := model.GetTaskList(bson.M{"status": status}, 0, constants.Infinite, "-create_ts")
|
||||
if err != nil {
|
||||
log.Error("get tasks error:" + err.Error())
|
||||
return err
|
||||
}
|
||||
for _, task := range tasks {
|
||||
RemoveLogByTaskId(task.Id)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func removeLog(t model.Task) {
|
||||
if err := RemoveLocalLog(t.LogPath); err != nil {
|
||||
log.Errorf("remove local log error: %s", err.Error())
|
||||
|
||||
@@ -12,6 +12,7 @@ import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"github.com/gomodule/redigo/redis"
|
||||
"runtime/debug"
|
||||
@@ -50,36 +51,44 @@ func GetNodeData() (Data, error) {
|
||||
return data, err
|
||||
}
|
||||
|
||||
func GetRedisNode(key string) (*Data, error) {
|
||||
// 获取节点数据
|
||||
value, err := database.RedisClient.HGet("nodes", key)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// 解析节点列表数据
|
||||
var data Data
|
||||
if err := json.Unmarshal([]byte(value), &data); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
return nil, err
|
||||
}
|
||||
return &data, nil
|
||||
}
|
||||
|
||||
// 更新所有节点状态
|
||||
func UpdateNodeStatus() {
|
||||
// 从Redis获取节点keys
|
||||
list, err := database.RedisClient.HKeys("nodes")
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
log.Errorf("get redis node keys error: %s", err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
// 遍历节点keys
|
||||
for _, key := range list {
|
||||
// 获取节点数据
|
||||
value, err := database.RedisClient.HGet("nodes", key)
|
||||
|
||||
data, err := GetRedisNode(key)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
return
|
||||
continue
|
||||
}
|
||||
|
||||
// 解析节点列表数据
|
||||
var data Data
|
||||
if err := json.Unmarshal([]byte(value), &data); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
// 如果记录的更新时间超过60秒,该节点被认为离线
|
||||
if time.Now().Unix()-data.UpdateTsUnix > 60 {
|
||||
// 在Redis中删除该节点
|
||||
if err := database.RedisClient.HDel("nodes", data.Key); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
log.Errorf("delete redis node key error:%s, key:%s", err.Error(), data.Key)
|
||||
}
|
||||
continue
|
||||
}
|
||||
@@ -94,22 +103,21 @@ func UpdateNodeStatus() {
|
||||
model.ResetNodeStatusToOffline(list)
|
||||
}
|
||||
|
||||
func handleNodeInfo(key string, data Data) {
|
||||
// 处理节点信息
|
||||
func handleNodeInfo(key string, data *Data) {
|
||||
// 添加同步锁
|
||||
v, err := database.RedisClient.Lock(key)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
defer database.RedisClient.UnLock(key, v)
|
||||
|
||||
// 更新节点信息到数据库
|
||||
s, c := database.GetCol("nodes")
|
||||
defer s.Close()
|
||||
|
||||
// 同个key可能因为并发,被注册多次
|
||||
var nodes []model.Node
|
||||
_ = c.Find(bson.M{"key": key}).All(&nodes)
|
||||
if len(nodes) > 1 {
|
||||
for _, node := range nodes {
|
||||
_ = c.RemoveId(node.Id)
|
||||
}
|
||||
}
|
||||
|
||||
var node model.Node
|
||||
if err := c.Find(bson.M{"key": key}).One(&node); err != nil {
|
||||
if err := c.Find(bson.M{"key": key}).One(&node); err != nil && err == mgo.ErrNotFound {
|
||||
// 数据库不存在该节点
|
||||
node = model.Node{
|
||||
Key: key,
|
||||
@@ -126,7 +134,7 @@ func handleNodeInfo(key string, data Data) {
|
||||
log.Errorf(err.Error())
|
||||
return
|
||||
}
|
||||
} else {
|
||||
} else if node.Key != "" {
|
||||
// 数据库存在该节点
|
||||
node.Status = constants.StatusOnline
|
||||
node.UpdateTs = time.Now()
|
||||
@@ -160,6 +168,7 @@ func UpdateNodeData() {
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
// 构造节点数据
|
||||
data := Data{
|
||||
Key: key,
|
||||
@@ -177,10 +186,12 @@ func UpdateNodeData() {
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
if err := database.RedisClient.HSet("nodes", key, utils.BytesToString(dataBytes)); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func MasterNodeCallback(message redis.Message) (err error) {
|
||||
@@ -258,7 +269,7 @@ func InitNodeService() error {
|
||||
return err
|
||||
}
|
||||
|
||||
// 如果为主节点,每30秒刷新所有节点信息
|
||||
// 如果为主节点,每10秒刷新所有节点信息
|
||||
if model.IsMaster() {
|
||||
spec := "*/10 * * * * *"
|
||||
if _, err := c.AddFunc(spec, UpdateNodeStatus); err != nil {
|
||||
|
||||
138
backend/services/notification/mail.go
Normal file
138
backend/services/notification/mail.go
Normal file
@@ -0,0 +1,138 @@
|
||||
package notification
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"github.com/apex/log"
|
||||
"github.com/matcornic/hermes"
|
||||
"gopkg.in/gomail.v2"
|
||||
"net/mail"
|
||||
"os"
|
||||
"runtime/debug"
|
||||
"strconv"
|
||||
)
|
||||
|
||||
func SendMail(toEmail string, toName string, subject string, content string) error {
|
||||
// hermes instance
|
||||
h := hermes.Hermes{
|
||||
Theme: new(hermes.Default),
|
||||
Product: hermes.Product{
|
||||
Name: "Crawlab Team",
|
||||
Copyright: "© 2019 Crawlab, Made by Crawlab-Team",
|
||||
},
|
||||
}
|
||||
|
||||
// config
|
||||
port, _ := strconv.Atoi(os.Getenv("CRAWLAB_NOTIFICATION_MAIL_PORT"))
|
||||
password := os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SMTP_PASSWORD")
|
||||
SMTPUser := os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SMTP_USER")
|
||||
smtpConfig := smtpAuthentication{
|
||||
Server: os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SERVER"),
|
||||
Port: port,
|
||||
SenderEmail: os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SENDEREMAIL"),
|
||||
SenderIdentity: os.Getenv("CRAWLAB_NOTIFICATION_MAIL_SENDERIDENTITY"),
|
||||
SMTPPassword: password,
|
||||
SMTPUser: SMTPUser,
|
||||
}
|
||||
options := sendOptions{
|
||||
To: toEmail,
|
||||
Subject: subject,
|
||||
}
|
||||
|
||||
// email instance
|
||||
email := hermes.Email{
|
||||
Body: hermes.Body{
|
||||
Name: toName,
|
||||
FreeMarkdown: hermes.Markdown(content + GetFooter()),
|
||||
},
|
||||
}
|
||||
|
||||
// generate html
|
||||
html, err := h.GenerateHTML(email)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
// generate text
|
||||
text, err := h.GeneratePlainText(email)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
// send the email
|
||||
if err := send(smtpConfig, options, html, text); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
type smtpAuthentication struct {
|
||||
Server string
|
||||
Port int
|
||||
SenderEmail string
|
||||
SenderIdentity string
|
||||
SMTPUser string
|
||||
SMTPPassword string
|
||||
}
|
||||
|
||||
// sendOptions are options for sending an email
|
||||
type sendOptions struct {
|
||||
To string
|
||||
Subject string
|
||||
}
|
||||
|
||||
// send sends the email
|
||||
func send(smtpConfig smtpAuthentication, options sendOptions, htmlBody string, txtBody string) error {
|
||||
|
||||
if smtpConfig.Server == "" {
|
||||
return errors.New("SMTP server config is empty")
|
||||
}
|
||||
if smtpConfig.Port == 0 {
|
||||
return errors.New("SMTP port config is empty")
|
||||
}
|
||||
|
||||
if smtpConfig.SMTPUser == "" {
|
||||
return errors.New("SMTP user is empty")
|
||||
}
|
||||
|
||||
if smtpConfig.SenderIdentity == "" {
|
||||
return errors.New("SMTP sender identity is empty")
|
||||
}
|
||||
|
||||
if smtpConfig.SenderEmail == "" {
|
||||
return errors.New("SMTP sender email is empty")
|
||||
}
|
||||
|
||||
if options.To == "" {
|
||||
return errors.New("no receiver emails configured")
|
||||
}
|
||||
|
||||
from := mail.Address{
|
||||
Name: smtpConfig.SenderIdentity,
|
||||
Address: smtpConfig.SenderEmail,
|
||||
}
|
||||
|
||||
m := gomail.NewMessage()
|
||||
m.SetHeader("From", from.String())
|
||||
m.SetHeader("To", options.To)
|
||||
m.SetHeader("Subject", options.Subject)
|
||||
|
||||
m.SetBody("text/plain", txtBody)
|
||||
m.AddAlternative("text/html", htmlBody)
|
||||
|
||||
d := gomail.NewPlainDialer(smtpConfig.Server, smtpConfig.Port, smtpConfig.SMTPUser, smtpConfig.SMTPPassword)
|
||||
|
||||
return d.DialAndSend(m)
|
||||
}
|
||||
|
||||
func GetFooter() string {
|
||||
return `
|
||||
[Github](https://github.com/crawlab-team/crawlab) | [Documentation](http://docs.crawlab.cn) | [Docker](https://hub.docker.com/r/tikazyq/crawlab)
|
||||
`
|
||||
}
|
||||
59
backend/services/notification/mobile.go
Normal file
59
backend/services/notification/mobile.go
Normal file
@@ -0,0 +1,59 @@
|
||||
package notification
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"github.com/apex/log"
|
||||
"github.com/imroc/req"
|
||||
"runtime/debug"
|
||||
)
|
||||
|
||||
func SendMobileNotification(webhook string, title string, content string) error {
|
||||
type ResBody struct {
|
||||
ErrCode int `json:"errcode"`
|
||||
ErrMsg string `json:"errmsg"`
|
||||
}
|
||||
|
||||
// 请求头
|
||||
header := req.Header{
|
||||
"Content-Type": "application/json; charset=utf-8",
|
||||
}
|
||||
|
||||
// 请求数据
|
||||
data := req.Param{
|
||||
"msgtype": "markdown",
|
||||
"markdown": req.Param{
|
||||
"title": title,
|
||||
"text": content,
|
||||
"content": content,
|
||||
},
|
||||
"at": req.Param{
|
||||
"atMobiles": []string{},
|
||||
"isAtAll": false,
|
||||
},
|
||||
}
|
||||
|
||||
// 发起请求
|
||||
res, err := req.Post(webhook, header, req.BodyJSON(&data))
|
||||
if err != nil {
|
||||
log.Errorf("dingtalk notification error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
// 解析响应
|
||||
var resBody ResBody
|
||||
if err := res.ToJSON(&resBody); err != nil {
|
||||
log.Errorf("dingtalk notification error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
// 判断响应是否报错
|
||||
if resBody.ErrCode != 0 {
|
||||
log.Errorf("dingtalk notification error: " + resBody.ErrMsg)
|
||||
debug.PrintStack()
|
||||
return errors.New(resBody.ErrMsg)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
@@ -6,6 +6,7 @@ import (
|
||||
"net"
|
||||
"reflect"
|
||||
"runtime/debug"
|
||||
"sync"
|
||||
)
|
||||
|
||||
type Register interface {
|
||||
@@ -97,25 +98,31 @@ func getMac() (string, error) {
|
||||
var register Register
|
||||
|
||||
// 获得注册器
|
||||
func GetRegister() Register {
|
||||
if register != nil {
|
||||
return register
|
||||
}
|
||||
var once sync.Once
|
||||
|
||||
registerType := viper.GetString("server.register.type")
|
||||
if registerType == "mac" {
|
||||
register = &MacRegister{}
|
||||
} else {
|
||||
ip := viper.GetString("server.register.ip")
|
||||
if ip == "" {
|
||||
log.Error("server.register.ip is empty")
|
||||
debug.PrintStack()
|
||||
return nil
|
||||
func GetRegister() Register {
|
||||
once.Do(func() {
|
||||
|
||||
if register != nil {
|
||||
register = register
|
||||
}
|
||||
register = &IpRegister{
|
||||
Ip: ip,
|
||||
|
||||
registerType := viper.GetString("server.register.type")
|
||||
if registerType == "mac" {
|
||||
register = &MacRegister{}
|
||||
} else {
|
||||
ip := viper.GetString("server.register.ip")
|
||||
if ip == "" {
|
||||
log.Error("server.register.ip is empty")
|
||||
debug.PrintStack()
|
||||
register = nil
|
||||
}
|
||||
register = &IpRegister{
|
||||
Ip: ip,
|
||||
}
|
||||
}
|
||||
}
|
||||
log.Info("register type is :" + reflect.TypeOf(register).String())
|
||||
log.Info("register type is :" + reflect.TypeOf(register).String())
|
||||
|
||||
})
|
||||
return register
|
||||
}
|
||||
|
||||
234
backend/services/rpc.go
Normal file
234
backend/services/rpc.go
Normal file
@@ -0,0 +1,234 @@
|
||||
package services
|
||||
|
||||
import (
|
||||
"crawlab/constants"
|
||||
"crawlab/database"
|
||||
"crawlab/entity"
|
||||
"crawlab/model"
|
||||
"crawlab/utils"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/gomodule/redigo/redis"
|
||||
uuid "github.com/satori/go.uuid"
|
||||
"runtime/debug"
|
||||
)
|
||||
|
||||
type RpcMessage struct {
|
||||
Id string `json:"id"`
|
||||
Method string `json:"method"`
|
||||
Params map[string]string `json:"params"`
|
||||
Result string `json:"result"`
|
||||
}
|
||||
|
||||
func RpcServerInstallLang(msg RpcMessage) RpcMessage {
|
||||
lang := GetRpcParam("lang", msg.Params)
|
||||
if lang == constants.Nodejs {
|
||||
output, _ := InstallNodejsLocalLang()
|
||||
msg.Result = output
|
||||
}
|
||||
return msg
|
||||
}
|
||||
|
||||
func RpcClientInstallLang(nodeId string, lang string) (output string, err error) {
|
||||
params := map[string]string{}
|
||||
params["lang"] = lang
|
||||
|
||||
data, err := RpcClientFunc(nodeId, constants.RpcInstallLang, params, 600)()
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
output = data
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func RpcServerInstallDep(msg RpcMessage) RpcMessage {
|
||||
lang := GetRpcParam("lang", msg.Params)
|
||||
depName := GetRpcParam("dep_name", msg.Params)
|
||||
if lang == constants.Python {
|
||||
output, _ := InstallPythonLocalDep(depName)
|
||||
msg.Result = output
|
||||
}
|
||||
return msg
|
||||
}
|
||||
|
||||
func RpcClientInstallDep(nodeId string, lang string, depName string) (output string, err error) {
|
||||
params := map[string]string{}
|
||||
params["lang"] = lang
|
||||
params["dep_name"] = depName
|
||||
|
||||
data, err := RpcClientFunc(nodeId, constants.RpcInstallDep, params, 10)()
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
output = data
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func RpcServerUninstallDep(msg RpcMessage) RpcMessage {
|
||||
lang := GetRpcParam("lang", msg.Params)
|
||||
depName := GetRpcParam("dep_name", msg.Params)
|
||||
if lang == constants.Python {
|
||||
output, _ := UninstallPythonLocalDep(depName)
|
||||
msg.Result = output
|
||||
}
|
||||
return msg
|
||||
}
|
||||
|
||||
func RpcClientUninstallDep(nodeId string, lang string, depName string) (output string, err error) {
|
||||
params := map[string]string{}
|
||||
params["lang"] = lang
|
||||
params["dep_name"] = depName
|
||||
|
||||
data, err := RpcClientFunc(nodeId, constants.RpcUninstallDep, params, 60)()
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
output = data
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func RpcServerGetInstalledDepList(nodeId string, msg RpcMessage) RpcMessage {
|
||||
lang := GetRpcParam("lang", msg.Params)
|
||||
if lang == constants.Python {
|
||||
depList, _ := GetPythonLocalInstalledDepList(nodeId)
|
||||
resultStr, _ := json.Marshal(depList)
|
||||
msg.Result = string(resultStr)
|
||||
} else if lang == constants.Nodejs {
|
||||
depList, _ := GetNodejsLocalInstalledDepList(nodeId)
|
||||
resultStr, _ := json.Marshal(depList)
|
||||
msg.Result = string(resultStr)
|
||||
}
|
||||
return msg
|
||||
}
|
||||
|
||||
func RpcClientGetInstalledDepList(nodeId string, lang string) (list []entity.Dependency, err error) {
|
||||
params := map[string]string{}
|
||||
params["lang"] = lang
|
||||
|
||||
data, err := RpcClientFunc(nodeId, constants.RpcGetInstalledDepList, params, 10)()
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
// 反序列化结果
|
||||
if err := json.Unmarshal([]byte(data), &list); err != nil {
|
||||
return list, err
|
||||
}
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func RpcClientFunc(nodeId string, method string, params map[string]string, timeout int) func() (string, error) {
|
||||
return func() (result string, err error) {
|
||||
// 请求ID
|
||||
id := uuid.NewV4().String()
|
||||
|
||||
// 构造RPC消息
|
||||
msg := RpcMessage{
|
||||
Id: id,
|
||||
Method: method,
|
||||
Params: params,
|
||||
Result: "",
|
||||
}
|
||||
|
||||
// 发送RPC消息
|
||||
msgStr := ObjectToString(msg)
|
||||
if err := database.RedisClient.LPush(fmt.Sprintf("rpc:%s", nodeId), msgStr); err != nil {
|
||||
return result, err
|
||||
}
|
||||
|
||||
// 获取RPC回复消息
|
||||
dataStr, err := database.RedisClient.BRPop(fmt.Sprintf("rpc:%s", nodeId), timeout)
|
||||
if err != nil {
|
||||
return result, err
|
||||
}
|
||||
|
||||
// 反序列化消息
|
||||
if err := json.Unmarshal([]byte(dataStr), &msg); err != nil {
|
||||
return result, err
|
||||
}
|
||||
|
||||
return msg.Result, err
|
||||
}
|
||||
}
|
||||
|
||||
func GetRpcParam(key string, params map[string]string) string {
|
||||
return params[key]
|
||||
}
|
||||
|
||||
func ObjectToString(params interface{}) string {
|
||||
bytes, _ := json.Marshal(params)
|
||||
return utils.BytesToString(bytes)
|
||||
}
|
||||
|
||||
var IsRpcStopped = false
|
||||
|
||||
func StopRpcService() {
|
||||
IsRpcStopped = true
|
||||
}
|
||||
|
||||
func InitRpcService() error {
|
||||
go func() {
|
||||
for {
|
||||
// 获取当前节点
|
||||
node, err := model.GetCurrentNode()
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 获取获取消息队列信息
|
||||
dataStr, err := database.RedisClient.BRPop(fmt.Sprintf("rpc:%s", node.Id.Hex()), 0)
|
||||
if err != nil {
|
||||
if err != redis.ErrNil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
}
|
||||
continue
|
||||
}
|
||||
|
||||
// 反序列化消息
|
||||
var msg RpcMessage
|
||||
if err := json.Unmarshal([]byte(dataStr), &msg); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 根据Method调用本地方法
|
||||
var replyMsg RpcMessage
|
||||
if msg.Method == constants.RpcInstallDep {
|
||||
replyMsg = RpcServerInstallDep(msg)
|
||||
} else if msg.Method == constants.RpcUninstallDep {
|
||||
replyMsg = RpcServerUninstallDep(msg)
|
||||
} else if msg.Method == constants.RpcInstallLang {
|
||||
replyMsg = RpcServerInstallLang(msg)
|
||||
} else if msg.Method == constants.RpcGetInstalledDepList {
|
||||
replyMsg = RpcServerGetInstalledDepList(node.Id.Hex(), msg)
|
||||
} else {
|
||||
continue
|
||||
}
|
||||
|
||||
// 发送返回消息
|
||||
if err := database.RedisClient.LPush(fmt.Sprintf("rpc:%s", node.Id.Hex()), ObjectToString(replyMsg)); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 如果停止RPC服务,则返回
|
||||
if IsRpcStopped {
|
||||
return
|
||||
}
|
||||
}
|
||||
}()
|
||||
return nil
|
||||
}
|
||||
@@ -4,8 +4,10 @@ import (
|
||||
"crawlab/constants"
|
||||
"crawlab/lib/cron"
|
||||
"crawlab/model"
|
||||
"errors"
|
||||
"github.com/apex/log"
|
||||
"github.com/satori/go.uuid"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
uuid "github.com/satori/go.uuid"
|
||||
"runtime/debug"
|
||||
)
|
||||
|
||||
@@ -15,48 +17,59 @@ type Scheduler struct {
|
||||
cron *cron.Cron
|
||||
}
|
||||
|
||||
func AddTask(s model.Schedule) func() {
|
||||
func AddScheduleTask(s model.Schedule) func() {
|
||||
return func() {
|
||||
node, err := model.GetNodeByKey(s.NodeKey)
|
||||
if err != nil || node.Id.Hex() == "" {
|
||||
log.Errorf("get node by key error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
spider := model.GetSpiderByName(s.SpiderName)
|
||||
if spider == nil || spider.Id.Hex() == "" {
|
||||
log.Errorf("get spider by name error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
// 同步ID到定时任务
|
||||
s.SyncNodeIdAndSpiderId(node, *spider)
|
||||
|
||||
// 生成任务ID
|
||||
id := uuid.NewV4()
|
||||
|
||||
// 生成任务模型
|
||||
t := model.Task{
|
||||
Id: id.String(),
|
||||
SpiderId: spider.Id,
|
||||
NodeId: node.Id,
|
||||
Status: constants.StatusPending,
|
||||
Param: s.Param,
|
||||
}
|
||||
if s.RunType == constants.RunTypeAllNodes {
|
||||
// 所有节点
|
||||
nodes, err := model.GetNodeList(nil)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
for _, node := range nodes {
|
||||
t := model.Task{
|
||||
Id: id.String(),
|
||||
SpiderId: s.SpiderId,
|
||||
NodeId: node.Id,
|
||||
Param: s.Param,
|
||||
UserId: s.UserId,
|
||||
}
|
||||
|
||||
// 将任务存入数据库
|
||||
if err := model.AddTask(t); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
if err := AddTask(t); err != nil {
|
||||
return
|
||||
}
|
||||
}
|
||||
} else if s.RunType == constants.RunTypeRandom {
|
||||
// 随机
|
||||
t := model.Task{
|
||||
Id: id.String(),
|
||||
SpiderId: s.SpiderId,
|
||||
Param: s.Param,
|
||||
UserId: s.UserId,
|
||||
}
|
||||
if err := AddTask(t); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
} else if s.RunType == constants.RunTypeSelectedNodes {
|
||||
// 指定节点
|
||||
for _, nodeId := range s.NodeIds {
|
||||
t := model.Task{
|
||||
Id: id.String(),
|
||||
SpiderId: s.SpiderId,
|
||||
NodeId: nodeId,
|
||||
Param: s.Param,
|
||||
UserId: s.UserId,
|
||||
}
|
||||
|
||||
// 加入任务队列
|
||||
if err := AssignTask(t); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
if err := AddTask(t); err != nil {
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
return
|
||||
}
|
||||
}
|
||||
@@ -96,8 +109,8 @@ func (s *Scheduler) Start() error {
|
||||
func (s *Scheduler) AddJob(job model.Schedule) error {
|
||||
spec := job.Cron
|
||||
|
||||
// 添加任务
|
||||
eid, err := s.cron.AddFunc(spec, AddTask(job))
|
||||
// 添加定时任务
|
||||
eid, err := s.cron.AddFunc(spec, AddScheduleTask(job))
|
||||
if err != nil {
|
||||
log.Errorf("add func task error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
@@ -106,6 +119,12 @@ func (s *Scheduler) AddJob(job model.Schedule) error {
|
||||
|
||||
// 更新EntryID
|
||||
job.EntryId = eid
|
||||
|
||||
// 更新状态
|
||||
job.Status = constants.ScheduleStatusRunning
|
||||
job.Enabled = true
|
||||
|
||||
// 保存定时任务
|
||||
if err := job.Save(); err != nil {
|
||||
log.Errorf("job save error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
@@ -134,6 +153,41 @@ func ParserCron(spec string) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// 禁用定时任务
|
||||
func (s *Scheduler) Disable(id bson.ObjectId) error {
|
||||
schedule, err := model.GetSchedule(id)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if schedule.EntryId == 0 {
|
||||
return errors.New("entry id not found")
|
||||
}
|
||||
|
||||
// 从cron服务中删除该任务
|
||||
s.cron.Remove(schedule.EntryId)
|
||||
|
||||
// 更新状态
|
||||
schedule.Status = constants.ScheduleStatusStop
|
||||
schedule.Enabled = false
|
||||
|
||||
if err = schedule.Save(); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// 启用定时任务
|
||||
func (s *Scheduler) Enable(id bson.ObjectId) error {
|
||||
schedule, err := model.GetSchedule(id)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := s.AddJob(schedule); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *Scheduler) Update() error {
|
||||
// 删除所有定时任务
|
||||
s.RemoveAll()
|
||||
@@ -146,11 +200,26 @@ func (s *Scheduler) Update() error {
|
||||
return err
|
||||
}
|
||||
|
||||
user, err := model.GetUserByUsername("admin")
|
||||
if err != nil {
|
||||
log.Errorf("get admin user error: %s", err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// 遍历任务列表
|
||||
for i := 0; i < len(sList); i++ {
|
||||
// 单个任务
|
||||
job := sList[i]
|
||||
|
||||
if job.Status == constants.ScheduleStatusStop {
|
||||
continue
|
||||
}
|
||||
|
||||
// 兼容以前版本
|
||||
if job.UserId.Hex() == "" {
|
||||
job.UserId = user.Id
|
||||
}
|
||||
|
||||
// 添加到定时任务
|
||||
if err := s.AddJob(job); err != nil {
|
||||
log.Errorf("add job error: %s, job: %s, cron: %s", err.Error(), job.Name, job.Cron)
|
||||
|
||||
@@ -12,11 +12,14 @@ import (
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"github.com/satori/go.uuid"
|
||||
"github.com/spf13/viper"
|
||||
"gopkg.in/yaml.v2"
|
||||
"io/ioutil"
|
||||
"os"
|
||||
"path"
|
||||
"path/filepath"
|
||||
"runtime/debug"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type SpiderFileData struct {
|
||||
@@ -30,6 +33,59 @@ type SpiderUploadMessage struct {
|
||||
SpiderId string
|
||||
}
|
||||
|
||||
// 从主节点上传爬虫到GridFS
|
||||
func UploadSpiderToGridFsFromMaster(spider model.Spider) error {
|
||||
// 爬虫所在目录
|
||||
spiderDir := spider.Src
|
||||
|
||||
// 打包为 zip 文件
|
||||
files, err := utils.GetFilesFromDir(spiderDir)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
randomId := uuid.NewV4()
|
||||
tmpFilePath := filepath.Join(viper.GetString("other.tmppath"), spider.Name+"."+randomId.String()+".zip")
|
||||
spiderZipFileName := spider.Name + ".zip"
|
||||
if err := utils.Compress(files, tmpFilePath); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// 获取 GridFS 实例
|
||||
s, gf := database.GetGridFs("files")
|
||||
defer s.Close()
|
||||
|
||||
// 判断文件是否已经存在
|
||||
var gfFile model.GridFs
|
||||
if err := gf.Find(bson.M{"filename": spiderZipFileName}).One(&gfFile); err == nil {
|
||||
// 已经存在文件,则删除
|
||||
_ = gf.RemoveId(gfFile.Id)
|
||||
}
|
||||
|
||||
// 上传到GridFs
|
||||
fid, err := UploadToGridFs(spiderZipFileName, tmpFilePath)
|
||||
if err != nil {
|
||||
log.Errorf("upload to grid fs error: %s", err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
// 保存爬虫 FileId
|
||||
spider.FileId = fid
|
||||
_ = spider.Save()
|
||||
|
||||
// 获取爬虫同步实例
|
||||
spiderSync := spider_handler.SpiderSync{
|
||||
Spider: spider,
|
||||
}
|
||||
|
||||
// 获取gfFile
|
||||
gfFile2 := model.GetGridFs(spider.FileId)
|
||||
|
||||
// 生成MD5
|
||||
spiderSync.CreateMd5File(gfFile2.Md5)
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// 上传zip文件到GridFS
|
||||
func UploadToGridFs(fileName string, filePath string) (fid bson.ObjectId, err error) {
|
||||
fid = ""
|
||||
@@ -59,6 +115,7 @@ func UploadToGridFs(fileName string, filePath string) (fid bson.ObjectId, err er
|
||||
}
|
||||
// 关闭文件,提交写入
|
||||
if err = f.Close(); err != nil {
|
||||
debug.PrintStack()
|
||||
return "", err
|
||||
}
|
||||
// 文件ID
|
||||
@@ -100,7 +157,7 @@ func ReadFileByStep(filePath string, handle func([]byte, *mgo.GridFile), fileCre
|
||||
// 发布所有爬虫
|
||||
func PublishAllSpiders() {
|
||||
// 获取爬虫列表
|
||||
spiders, _, _ := model.GetSpiderList(nil, 0, constants.Infinite)
|
||||
spiders, _, _ := model.GetSpiderList(nil, 0, constants.Infinite, "-_id")
|
||||
if len(spiders) == 0 {
|
||||
return
|
||||
}
|
||||
@@ -116,12 +173,23 @@ func PublishAllSpiders() {
|
||||
|
||||
// 发布爬虫
|
||||
func PublishSpider(spider model.Spider) {
|
||||
// 查询gf file,不存在则删除
|
||||
gfFile := model.GetGridFs(spider.FileId)
|
||||
if gfFile == nil {
|
||||
_ = model.RemoveSpider(spider.Id)
|
||||
var gfFile *model.GridFs
|
||||
if spider.FileId.Hex() != constants.ObjectIdNull {
|
||||
// 查询gf file,不存在则标记为爬虫文件不存在
|
||||
gfFile = model.GetGridFs(spider.FileId)
|
||||
if gfFile == nil {
|
||||
spider.FileId = constants.ObjectIdNull
|
||||
_ = spider.Save()
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
// 如果FileId为空,表示还没有上传爬虫到GridFS,则跳过
|
||||
if spider.FileId == bson.ObjectIdHex(constants.ObjectIdNull) {
|
||||
return
|
||||
}
|
||||
|
||||
// 获取爬虫同步实例
|
||||
spiderSync := spider_handler.SpiderSync{
|
||||
Spider: spider,
|
||||
}
|
||||
@@ -138,21 +206,14 @@ func PublishSpider(spider model.Spider) {
|
||||
md5 := filepath.Join(path, spider_handler.Md5File)
|
||||
if !utils.Exists(md5) {
|
||||
log.Infof("md5 file not found: %s", md5)
|
||||
spiderSync.RemoveSpiderFile()
|
||||
spiderSync.Download()
|
||||
spiderSync.CreateMd5File(gfFile.Md5)
|
||||
spiderSync.RemoveDownCreate(gfFile.Md5)
|
||||
return
|
||||
}
|
||||
// md5值不一样,则下载
|
||||
md5Str := utils.ReadFileOneLine(md5)
|
||||
// 去掉空格以及换行符
|
||||
md5Str = strings.Replace(md5Str, " ", "", -1)
|
||||
md5Str = strings.Replace(md5Str, "\n", "", -1)
|
||||
md5Str := utils.GetSpiderMd5Str(md5)
|
||||
if gfFile.Md5 != md5Str {
|
||||
log.Infof("md5 is different, gf-md5:%s, file-md5:%s", gfFile.Md5, md5Str)
|
||||
spiderSync.RemoveSpiderFile()
|
||||
spiderSync.Download()
|
||||
spiderSync.CreateMd5File(gfFile.Md5)
|
||||
spiderSync.RemoveDownCreate(gfFile.Md5)
|
||||
return
|
||||
}
|
||||
}
|
||||
@@ -206,5 +267,110 @@ func InitSpiderService() error {
|
||||
// 启动定时任务
|
||||
c.Start()
|
||||
|
||||
if model.IsMaster() {
|
||||
// 添加Demo爬虫
|
||||
templateSpidersDir := "./template/spiders"
|
||||
for _, info := range utils.ListDir(templateSpidersDir) {
|
||||
if !info.IsDir() {
|
||||
continue
|
||||
}
|
||||
spiderName := info.Name()
|
||||
|
||||
// 如果爬虫在数据库中不存在,则添加
|
||||
spider := model.GetSpiderByName(spiderName)
|
||||
if spider.Name != "" {
|
||||
// 存在同名爬虫,跳过
|
||||
continue
|
||||
}
|
||||
|
||||
// 拷贝爬虫
|
||||
templateSpiderPath := path.Join(templateSpidersDir, spiderName)
|
||||
spiderPath := path.Join(viper.GetString("spider.path"), spiderName)
|
||||
if utils.Exists(spiderPath) {
|
||||
utils.RemoveFiles(spiderPath)
|
||||
}
|
||||
if err := utils.CopyDir(templateSpiderPath, spiderPath); err != nil {
|
||||
log.Errorf("copy error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 构造配置数据
|
||||
configData := entity.ConfigSpiderData{}
|
||||
|
||||
// 读取YAML文件
|
||||
yamlFile, err := ioutil.ReadFile(path.Join(spiderPath, "Spiderfile"))
|
||||
if err != nil {
|
||||
log.Errorf("read yaml error: " + err.Error())
|
||||
//debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 反序列化
|
||||
if err := yaml.Unmarshal(yamlFile, &configData); err != nil {
|
||||
log.Errorf("unmarshal error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
if configData.Type == constants.Customized {
|
||||
// 添加该爬虫到数据库
|
||||
spider = model.Spider{
|
||||
Id: bson.NewObjectId(),
|
||||
Name: spiderName,
|
||||
DisplayName: configData.DisplayName,
|
||||
Type: constants.Customized,
|
||||
Col: configData.Col,
|
||||
Src: spiderPath,
|
||||
Remark: configData.Remark,
|
||||
ProjectId: bson.ObjectIdHex(constants.ObjectIdNull),
|
||||
FileId: bson.ObjectIdHex(constants.ObjectIdNull),
|
||||
Cmd: configData.Cmd,
|
||||
}
|
||||
if err := spider.Add(); err != nil {
|
||||
log.Errorf("add spider error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 上传爬虫到GridFS
|
||||
if err := UploadSpiderToGridFsFromMaster(spider); err != nil {
|
||||
log.Errorf("upload spider error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
} else if configData.Type == constants.Configurable || configData.Type == "config" {
|
||||
// 添加该爬虫到数据库
|
||||
spider = model.Spider{
|
||||
Id: bson.NewObjectId(),
|
||||
Name: configData.Name,
|
||||
DisplayName: configData.DisplayName,
|
||||
Type: constants.Configurable,
|
||||
Col: configData.Col,
|
||||
Src: spiderPath,
|
||||
Remark: configData.Remark,
|
||||
ProjectId: bson.ObjectIdHex(constants.ObjectIdNull),
|
||||
FileId: bson.ObjectIdHex(constants.ObjectIdNull),
|
||||
Config: configData,
|
||||
}
|
||||
if err := spider.Add(); err != nil {
|
||||
log.Errorf("add spider error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
|
||||
// 根据序列化后的数据处理爬虫文件
|
||||
if err := ProcessSpiderFilesFromConfigData(spider, configData); err != nil {
|
||||
log.Errorf("add spider error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
continue
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 发布所有爬虫
|
||||
PublishAllSpiders()
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -4,12 +4,14 @@ import (
|
||||
"crawlab/database"
|
||||
"crawlab/model"
|
||||
"crawlab/utils"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"github.com/satori/go.uuid"
|
||||
"github.com/spf13/viper"
|
||||
"io"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"runtime/debug"
|
||||
)
|
||||
@@ -24,7 +26,7 @@ type SpiderSync struct {
|
||||
|
||||
func (s *SpiderSync) CreateMd5File(md5 string) {
|
||||
path := filepath.Join(viper.GetString("spider.path"), s.Spider.Name)
|
||||
utils.CreateFilePath(path)
|
||||
utils.CreateDirPath(path)
|
||||
|
||||
fileName := filepath.Join(path, Md5File)
|
||||
file := utils.OpenFile(fileName)
|
||||
@@ -37,6 +39,12 @@ func (s *SpiderSync) CreateMd5File(md5 string) {
|
||||
}
|
||||
}
|
||||
|
||||
func (s *SpiderSync) RemoveDownCreate(md5 string) {
|
||||
s.RemoveSpiderFile()
|
||||
s.Download()
|
||||
s.CreateMd5File(md5)
|
||||
}
|
||||
|
||||
// 获得下载锁的key
|
||||
func (s *SpiderSync) GetLockDownloadKey(spiderId string) string {
|
||||
node, _ := model.GetCurrentNode()
|
||||
@@ -59,10 +67,14 @@ func (s *SpiderSync) RemoveSpiderFile() {
|
||||
// 检测是否已经下载中
|
||||
func (s *SpiderSync) CheckDownLoading(spiderId string, fileId string) (bool, string) {
|
||||
key := s.GetLockDownloadKey(spiderId)
|
||||
if _, err := database.RedisClient.HGet("spider", key); err == nil {
|
||||
return true, key
|
||||
key2, err := database.RedisClient.HGet("spider", key)
|
||||
if err != nil {
|
||||
return false, key2
|
||||
}
|
||||
return false, key
|
||||
if key2 == "" {
|
||||
return false, key2
|
||||
}
|
||||
return true, key2
|
||||
}
|
||||
|
||||
// 下载爬虫
|
||||
@@ -71,6 +83,7 @@ func (s *SpiderSync) Download() {
|
||||
fileId := s.Spider.FileId.Hex()
|
||||
isDownloading, key := s.CheckDownLoading(spiderId, fileId)
|
||||
if isDownloading {
|
||||
log.Infof(fmt.Sprintf("spider is already being downloaded, spider id: %s", s.Spider.Id.Hex()))
|
||||
return
|
||||
} else {
|
||||
_ = database.RedisClient.HSet("spider", key, key)
|
||||
@@ -99,7 +112,6 @@ func (s *SpiderSync) Download() {
|
||||
// 创建临时文件
|
||||
tmpFilePath := filepath.Join(tmpPath, randomId.String()+".zip")
|
||||
tmpFile := utils.OpenFile(tmpFilePath)
|
||||
defer utils.Close(tmpFile)
|
||||
|
||||
// 将该文件写入临时文件
|
||||
if _, err := io.Copy(tmpFile, f); err != nil {
|
||||
@@ -119,6 +131,15 @@ func (s *SpiderSync) Download() {
|
||||
return
|
||||
}
|
||||
|
||||
//递归修改目标文件夹权限
|
||||
// 解决scrapy.setting中开启LOG_ENABLED 和 LOG_FILE时不能创建log文件的问题
|
||||
cmd := exec.Command("chmod", "-R", "777", dstPath)
|
||||
if err := cmd.Run(); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
// 关闭临时文件
|
||||
if err := tmpFile.Close(); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
|
||||
@@ -4,28 +4,42 @@ import (
|
||||
"crawlab/constants"
|
||||
"crawlab/database"
|
||||
"crawlab/entity"
|
||||
"crawlab/lib/cron"
|
||||
"crawlab/model"
|
||||
"crawlab/utils"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/imroc/req"
|
||||
"os/exec"
|
||||
"path"
|
||||
"regexp"
|
||||
"runtime/debug"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
)
|
||||
|
||||
// 系统信息 chan 映射
|
||||
var SystemInfoChanMap = utils.NewChanMap()
|
||||
|
||||
func GetRemoteSystemInfo(id string) (sysInfo entity.SystemInfo, err error) {
|
||||
// 从远端获取系统信息
|
||||
func GetRemoteSystemInfo(nodeId string) (sysInfo entity.SystemInfo, err error) {
|
||||
// 发送消息
|
||||
msg := entity.NodeMessage{
|
||||
Type: constants.MsgTypeGetSystemInfo,
|
||||
NodeId: id,
|
||||
NodeId: nodeId,
|
||||
}
|
||||
|
||||
// 序列化
|
||||
msgBytes, _ := json.Marshal(&msg)
|
||||
if _, err := database.RedisClient.Publish("nodes:"+id, utils.BytesToString(msgBytes)); err != nil {
|
||||
if _, err := database.RedisClient.Publish("nodes:"+nodeId, utils.BytesToString(msgBytes)); err != nil {
|
||||
return entity.SystemInfo{}, err
|
||||
}
|
||||
|
||||
// 通道
|
||||
ch := SystemInfoChanMap.ChanBlocked(id)
|
||||
ch := SystemInfoChanMap.ChanBlocked(nodeId)
|
||||
|
||||
// 等待响应,阻塞
|
||||
sysInfoStr := <-ch
|
||||
@@ -38,11 +52,534 @@ func GetRemoteSystemInfo(id string) (sysInfo entity.SystemInfo, err error) {
|
||||
return sysInfo, nil
|
||||
}
|
||||
|
||||
func GetSystemInfo(id string) (sysInfo entity.SystemInfo, err error) {
|
||||
if IsMasterNode(id) {
|
||||
// 获取系统信息
|
||||
func GetSystemInfo(nodeId string) (sysInfo entity.SystemInfo, err error) {
|
||||
if IsMasterNode(nodeId) {
|
||||
sysInfo, err = model.GetLocalSystemInfo()
|
||||
} else {
|
||||
sysInfo, err = GetRemoteSystemInfo(id)
|
||||
sysInfo, err = GetRemoteSystemInfo(nodeId)
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
// 获取语言列表
|
||||
func GetLangList(nodeId string) []entity.Lang {
|
||||
list := []entity.Lang{
|
||||
{Name: "Python", ExecutableName: "python", ExecutablePath: "/usr/local/bin/python", DepExecutablePath: "/usr/local/bin/pip"},
|
||||
{Name: "Node.js", ExecutableName: "node", ExecutablePath: "/usr/local/bin/node", DepExecutablePath: "/usr/local/bin/npm"},
|
||||
//{Name: "Java", ExecutableName: "java", ExecutablePath: "/usr/local/bin/java"},
|
||||
}
|
||||
for i, lang := range list {
|
||||
list[i].Installed = IsInstalledLang(nodeId, lang)
|
||||
}
|
||||
return list
|
||||
}
|
||||
|
||||
// 根据语言名获取语言实例
|
||||
func GetLangFromLangName(nodeId string, name string) entity.Lang {
|
||||
langList := GetLangList(nodeId)
|
||||
for _, lang := range langList {
|
||||
if lang.ExecutableName == name {
|
||||
return lang
|
||||
}
|
||||
}
|
||||
return entity.Lang{}
|
||||
}
|
||||
|
||||
// 是否已安装该依赖
|
||||
func IsInstalledLang(nodeId string, lang entity.Lang) bool {
|
||||
sysInfo, err := GetSystemInfo(nodeId)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
for _, exec := range sysInfo.Executables {
|
||||
if exec.Path == lang.ExecutablePath {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// 是否已安装该依赖
|
||||
func IsInstalledDep(installedDepList []entity.Dependency, dep entity.Dependency) bool {
|
||||
for _, _dep := range installedDepList {
|
||||
if strings.ToLower(_dep.Name) == strings.ToLower(dep.Name) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// 初始化函数
|
||||
func InitDepsFetcher() error {
|
||||
c := cron.New(cron.WithSeconds())
|
||||
c.Start()
|
||||
if _, err := c.AddFunc("0 */5 * * * *", UpdatePythonDepList); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
go func() {
|
||||
UpdatePythonDepList()
|
||||
}()
|
||||
return nil
|
||||
}
|
||||
|
||||
// =========
|
||||
// Python
|
||||
// =========
|
||||
|
||||
type PythonDepJsonData struct {
|
||||
Info PythonDepJsonDataInfo `json:"info"`
|
||||
}
|
||||
|
||||
type PythonDepJsonDataInfo struct {
|
||||
Name string `json:"name"`
|
||||
Summary string `json:"summary"`
|
||||
Version string `json:"version"`
|
||||
}
|
||||
|
||||
type PythonDepNameDict struct {
|
||||
Name string `json:"name"`
|
||||
Weight int `json:"weight"`
|
||||
}
|
||||
|
||||
type PythonDepNameDictSlice []PythonDepNameDict
|
||||
|
||||
func (s PythonDepNameDictSlice) Len() int { return len(s) }
|
||||
func (s PythonDepNameDictSlice) Swap(i, j int) { s[i], s[j] = s[j], s[i] }
|
||||
func (s PythonDepNameDictSlice) Less(i, j int) bool { return s[i].Weight > s[j].Weight }
|
||||
|
||||
// 获取Python本地依赖列表
|
||||
func GetPythonDepList(nodeId string, searchDepName string) ([]entity.Dependency, error) {
|
||||
var list []entity.Dependency
|
||||
|
||||
// 先从 Redis 获取
|
||||
depList, err := GetPythonDepListFromRedis()
|
||||
if err != nil {
|
||||
return list, err
|
||||
}
|
||||
|
||||
// 过滤相似的依赖
|
||||
var depNameList PythonDepNameDictSlice
|
||||
for _, depName := range depList {
|
||||
if strings.HasPrefix(strings.ToLower(depName), strings.ToLower(searchDepName)) {
|
||||
var weight int
|
||||
if strings.ToLower(depName) == strings.ToLower(searchDepName) {
|
||||
weight = 3
|
||||
} else if strings.HasPrefix(strings.ToLower(depName), strings.ToLower(searchDepName)) {
|
||||
weight = 2
|
||||
} else {
|
||||
weight = 1
|
||||
}
|
||||
depNameList = append(depNameList, PythonDepNameDict{
|
||||
Name: depName,
|
||||
Weight: weight,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// 获取已安装依赖列表
|
||||
var installedDepList []entity.Dependency
|
||||
if IsMasterNode(nodeId) {
|
||||
installedDepList, err = GetPythonLocalInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
return list, err
|
||||
}
|
||||
} else {
|
||||
installedDepList, err = GetPythonRemoteInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
return list, err
|
||||
}
|
||||
}
|
||||
|
||||
// 根据依赖名排序
|
||||
sort.Stable(depNameList)
|
||||
|
||||
// 遍历依赖名列表,取前20个
|
||||
for i, depNameDict := range depNameList {
|
||||
if i > 20 {
|
||||
break
|
||||
}
|
||||
dep := entity.Dependency{
|
||||
Name: depNameDict.Name,
|
||||
}
|
||||
dep.Installed = IsInstalledDep(installedDepList, dep)
|
||||
list = append(list, dep)
|
||||
}
|
||||
|
||||
// 从依赖源获取信息
|
||||
//list, err = GetPythonDepListWithInfo(list)
|
||||
|
||||
return list, nil
|
||||
}
|
||||
|
||||
// 获取Python依赖的源数据信息
|
||||
func GetPythonDepListWithInfo(depList []entity.Dependency) ([]entity.Dependency, error) {
|
||||
var goSync sync.WaitGroup
|
||||
for i, dep := range depList {
|
||||
if i > 10 {
|
||||
break
|
||||
}
|
||||
goSync.Add(1)
|
||||
go func(i int, dep entity.Dependency, depList []entity.Dependency, n *sync.WaitGroup) {
|
||||
url := fmt.Sprintf("https://pypi.org/pypi/%s/json", dep.Name)
|
||||
res, err := req.Get(url)
|
||||
if err != nil {
|
||||
n.Done()
|
||||
return
|
||||
}
|
||||
var data PythonDepJsonData
|
||||
if err := res.ToJSON(&data); err != nil {
|
||||
n.Done()
|
||||
return
|
||||
}
|
||||
depList[i].Version = data.Info.Version
|
||||
depList[i].Description = data.Info.Summary
|
||||
n.Done()
|
||||
}(i, dep, depList, &goSync)
|
||||
}
|
||||
goSync.Wait()
|
||||
return depList, nil
|
||||
}
|
||||
|
||||
func FetchPythonDepInfo(depName string) (entity.Dependency, error) {
|
||||
url := fmt.Sprintf("https://pypi.org/pypi/%s/json", depName)
|
||||
res, err := req.Get(url)
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return entity.Dependency{}, err
|
||||
}
|
||||
var data PythonDepJsonData
|
||||
if res.Response().StatusCode == 404 {
|
||||
return entity.Dependency{}, errors.New("get depName from [https://pypi.org] error: 404")
|
||||
}
|
||||
if err := res.ToJSON(&data); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return entity.Dependency{}, err
|
||||
}
|
||||
dep := entity.Dependency{
|
||||
Name: depName,
|
||||
Version: data.Info.Version,
|
||||
Description: data.Info.Summary,
|
||||
}
|
||||
return dep, nil
|
||||
}
|
||||
|
||||
// 从Redis获取Python依赖列表
|
||||
func GetPythonDepListFromRedis() ([]string, error) {
|
||||
var list []string
|
||||
|
||||
// 从 Redis 获取字符串
|
||||
rawData, err := database.RedisClient.HGet("system", "deps:python")
|
||||
if err != nil {
|
||||
return list, err
|
||||
}
|
||||
|
||||
// 反序列化
|
||||
if err := json.Unmarshal([]byte(rawData), &list); err != nil {
|
||||
return list, err
|
||||
}
|
||||
|
||||
// 如果为空,则从依赖源获取列表
|
||||
if len(list) == 0 {
|
||||
UpdatePythonDepList()
|
||||
}
|
||||
|
||||
return list, nil
|
||||
}
|
||||
|
||||
// 从Python依赖源获取依赖列表并返回
|
||||
func FetchPythonDepList() ([]string, error) {
|
||||
// 依赖URL
|
||||
url := "https://pypi.tuna.tsinghua.edu.cn/simple"
|
||||
|
||||
// 输出列表
|
||||
var list []string
|
||||
|
||||
// 请求URL
|
||||
res, err := req.Get(url)
|
||||
if err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return list, err
|
||||
}
|
||||
|
||||
// 获取响应数据
|
||||
text, err := res.ToString()
|
||||
if err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return list, err
|
||||
}
|
||||
|
||||
// 从响应数据中提取依赖名
|
||||
regex := regexp.MustCompile("<a href=\".*/\">(.*)</a>")
|
||||
for _, line := range strings.Split(text, "\n") {
|
||||
arr := regex.FindStringSubmatch(line)
|
||||
if len(arr) < 2 {
|
||||
continue
|
||||
}
|
||||
list = append(list, arr[1])
|
||||
}
|
||||
|
||||
// 赋值给列表
|
||||
return list, nil
|
||||
}
|
||||
|
||||
// 更新Python依赖列表到Redis
|
||||
func UpdatePythonDepList() {
|
||||
// 从依赖源获取列表
|
||||
list, _ := FetchPythonDepList()
|
||||
|
||||
// 序列化
|
||||
listBytes, err := json.Marshal(list)
|
||||
if err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
|
||||
// 设置Redis
|
||||
if err := database.RedisClient.HSet("system", "deps:python", string(listBytes)); err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
// 获取Python本地已安装的依赖列表
|
||||
func GetPythonLocalInstalledDepList(nodeId string) ([]entity.Dependency, error) {
|
||||
var list []entity.Dependency
|
||||
|
||||
lang := GetLangFromLangName(nodeId, constants.Python)
|
||||
if !IsInstalledLang(nodeId, lang) {
|
||||
return list, errors.New("python is not installed")
|
||||
}
|
||||
cmd := exec.Command("pip", "freeze")
|
||||
outputBytes, err := cmd.Output()
|
||||
if err != nil {
|
||||
debug.PrintStack()
|
||||
return list, err
|
||||
}
|
||||
|
||||
for _, line := range strings.Split(string(outputBytes), "\n") {
|
||||
arr := strings.Split(line, "==")
|
||||
if len(arr) < 2 {
|
||||
continue
|
||||
}
|
||||
dep := entity.Dependency{
|
||||
Name: strings.ToLower(arr[0]),
|
||||
Version: arr[1],
|
||||
Installed: true,
|
||||
}
|
||||
list = append(list, dep)
|
||||
}
|
||||
|
||||
return list, nil
|
||||
}
|
||||
|
||||
// 获取Python远端依赖列表
|
||||
func GetPythonRemoteInstalledDepList(nodeId string) ([]entity.Dependency, error) {
|
||||
depList, err := RpcClientGetInstalledDepList(nodeId, constants.Python)
|
||||
if err != nil {
|
||||
return depList, err
|
||||
}
|
||||
return depList, nil
|
||||
}
|
||||
|
||||
// 安装Python本地依赖
|
||||
func InstallPythonLocalDep(depName string) (string, error) {
|
||||
// 依赖镜像URL
|
||||
url := "https://pypi.tuna.tsinghua.edu.cn/simple"
|
||||
|
||||
cmd := exec.Command("pip", "install", depName, "-i", url)
|
||||
outputBytes, err := cmd.Output()
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return fmt.Sprintf("error: %s", err.Error()), err
|
||||
}
|
||||
return string(outputBytes), nil
|
||||
}
|
||||
|
||||
// 获取Python远端依赖列表
|
||||
func InstallPythonRemoteDep(nodeId string, depName string) (string, error) {
|
||||
output, err := RpcClientInstallDep(nodeId, constants.Python, depName)
|
||||
if err != nil {
|
||||
return output, err
|
||||
}
|
||||
return output, nil
|
||||
}
|
||||
|
||||
// 安装Python本地依赖
|
||||
func UninstallPythonLocalDep(depName string) (string, error) {
|
||||
cmd := exec.Command("pip", "uninstall", "-y", depName)
|
||||
outputBytes, err := cmd.Output()
|
||||
if err != nil {
|
||||
log.Errorf(string(outputBytes))
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return fmt.Sprintf("error: %s", err.Error()), err
|
||||
}
|
||||
return string(outputBytes), nil
|
||||
}
|
||||
|
||||
// 获取Python远端依赖列表
|
||||
func UninstallPythonRemoteDep(nodeId string, depName string) (string, error) {
|
||||
output, err := RpcClientUninstallDep(nodeId, constants.Python, depName)
|
||||
if err != nil {
|
||||
return output, err
|
||||
}
|
||||
return output, nil
|
||||
}
|
||||
|
||||
// ==============
|
||||
// Node.js
|
||||
// ==============
|
||||
|
||||
func InstallNodejsLocalLang() (string, error) {
|
||||
cmd := exec.Command("/bin/sh", path.Join("scripts", "install-nodejs.sh"))
|
||||
output, err := cmd.Output()
|
||||
if err != nil {
|
||||
log.Error(err.Error())
|
||||
debug.PrintStack()
|
||||
return string(output), err
|
||||
}
|
||||
|
||||
// TODO: check if Node.js is installed successfully
|
||||
|
||||
return string(output), nil
|
||||
}
|
||||
|
||||
// 获取Node.js远端依赖列表
|
||||
func InstallNodejsRemoteLang(nodeId string) (string, error) {
|
||||
output, err := RpcClientInstallLang(nodeId, constants.Nodejs)
|
||||
if err != nil {
|
||||
return output, err
|
||||
}
|
||||
return output, nil
|
||||
}
|
||||
|
||||
// 获取Nodejs本地已安装的依赖列表
|
||||
func GetNodejsLocalInstalledDepList(nodeId string) ([]entity.Dependency, error) {
|
||||
var list []entity.Dependency
|
||||
|
||||
lang := GetLangFromLangName(nodeId, constants.Nodejs)
|
||||
if !IsInstalledLang(nodeId, lang) {
|
||||
return list, errors.New("nodejs is not installed")
|
||||
}
|
||||
cmd := exec.Command("npm", "ls", "-g", "--depth", "0")
|
||||
outputBytes, _ := cmd.Output()
|
||||
//if err != nil {
|
||||
// log.Error("error: " + string(outputBytes))
|
||||
// debug.PrintStack()
|
||||
// return list, err
|
||||
//}
|
||||
|
||||
regex := regexp.MustCompile("\\s(.*)@(.*)")
|
||||
for _, line := range strings.Split(string(outputBytes), "\n") {
|
||||
arr := regex.FindStringSubmatch(line)
|
||||
if len(arr) < 3 {
|
||||
continue
|
||||
}
|
||||
dep := entity.Dependency{
|
||||
Name: strings.ToLower(arr[1]),
|
||||
Version: arr[2],
|
||||
Installed: true,
|
||||
}
|
||||
list = append(list, dep)
|
||||
}
|
||||
|
||||
return list, nil
|
||||
}
|
||||
|
||||
// 获取Nodejs远端依赖列表
|
||||
func GetNodejsRemoteInstalledDepList(nodeId string) ([]entity.Dependency, error) {
|
||||
depList, err := RpcClientGetInstalledDepList(nodeId, constants.Nodejs)
|
||||
if err != nil {
|
||||
return depList, err
|
||||
}
|
||||
return depList, nil
|
||||
}
|
||||
|
||||
// 安装Nodejs本地依赖
|
||||
func InstallNodejsLocalDep(depName string) (string, error) {
|
||||
// 依赖镜像URL
|
||||
url := "https://registry.npm.taobao.org"
|
||||
|
||||
cmd := exec.Command("npm", "install", depName, "-g", "--registry", url)
|
||||
outputBytes, err := cmd.Output()
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return fmt.Sprintf("error: %s", err.Error()), err
|
||||
}
|
||||
return string(outputBytes), nil
|
||||
}
|
||||
|
||||
// 获取Nodejs远端依赖列表
|
||||
func InstallNodejsRemoteDep(nodeId string, depName string) (string, error) {
|
||||
output, err := RpcClientInstallDep(nodeId, constants.Nodejs, depName)
|
||||
if err != nil {
|
||||
return output, err
|
||||
}
|
||||
return output, nil
|
||||
}
|
||||
|
||||
// 安装Nodejs本地依赖
|
||||
func UninstallNodejsLocalDep(depName string) (string, error) {
|
||||
cmd := exec.Command("npm", "uninstall", depName, "-g")
|
||||
outputBytes, err := cmd.Output()
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return fmt.Sprintf("error: %s", err.Error()), err
|
||||
}
|
||||
return string(outputBytes), nil
|
||||
}
|
||||
|
||||
// 获取Nodejs远端依赖列表
|
||||
func UninstallNodejsRemoteDep(nodeId string, depName string) (string, error) {
|
||||
output, err := RpcClientUninstallDep(nodeId, constants.Nodejs, depName)
|
||||
if err != nil {
|
||||
return output, err
|
||||
}
|
||||
return output, nil
|
||||
}
|
||||
|
||||
// 获取Nodejs本地依赖列表
|
||||
func GetNodejsDepList(nodeId string, searchDepName string) (depList []entity.Dependency, err error) {
|
||||
// 执行shell命令
|
||||
cmd := exec.Command("npm", "search", "--json", searchDepName)
|
||||
outputBytes, _ := cmd.Output()
|
||||
|
||||
// 获取已安装依赖列表
|
||||
var installedDepList []entity.Dependency
|
||||
if IsMasterNode(nodeId) {
|
||||
installedDepList, err = GetNodejsLocalInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
return depList, err
|
||||
}
|
||||
} else {
|
||||
installedDepList, err = GetNodejsRemoteInstalledDepList(nodeId)
|
||||
if err != nil {
|
||||
return depList, err
|
||||
}
|
||||
}
|
||||
|
||||
// 反序列化
|
||||
if err := json.Unmarshal(outputBytes, &depList); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return depList, err
|
||||
}
|
||||
|
||||
// 遍历安装列表
|
||||
for i, dep := range depList {
|
||||
depList[i].Installed = IsInstalledDep(installedDepList, dep)
|
||||
}
|
||||
|
||||
return depList, nil
|
||||
}
|
||||
|
||||
@@ -6,10 +6,15 @@ import (
|
||||
"crawlab/entity"
|
||||
"crawlab/lib/cron"
|
||||
"crawlab/model"
|
||||
"crawlab/services/notification"
|
||||
"crawlab/services/spider_handler"
|
||||
"crawlab/utils"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"github.com/apex/log"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
uuid "github.com/satori/go.uuid"
|
||||
"github.com/spf13/viper"
|
||||
"os"
|
||||
"os/exec"
|
||||
@@ -17,6 +22,7 @@ import (
|
||||
"runtime"
|
||||
"runtime/debug"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"syscall"
|
||||
"time"
|
||||
@@ -102,9 +108,34 @@ func AssignTask(task model.Task) error {
|
||||
|
||||
// 设置环境变量
|
||||
func SetEnv(cmd *exec.Cmd, envs []model.Env, taskId string, dataCol string) *exec.Cmd {
|
||||
// 默认把Node.js的全局node_modules加入环境变量
|
||||
envPath := os.Getenv("PATH")
|
||||
for _, _path := range strings.Split(envPath, ":") {
|
||||
if strings.Contains(_path, "/.nvm/versions/node/") {
|
||||
pathNodeModules := strings.Replace(_path, "/bin", "/lib/node_modules", -1)
|
||||
_ = os.Setenv("PATH", pathNodeModules+":"+envPath)
|
||||
_ = os.Setenv("NODE_PATH", pathNodeModules)
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// 默认环境变量
|
||||
cmd.Env = append(os.Environ(), "CRAWLAB_TASK_ID="+taskId)
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_COLLECTION="+dataCol)
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_HOST="+viper.GetString("mongo.host"))
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_PORT="+viper.GetString("mongo.port"))
|
||||
if viper.GetString("mongo.db") != "" {
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_DB="+viper.GetString("mongo.db"))
|
||||
}
|
||||
if viper.GetString("mongo.username") != "" {
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_USERNAME="+viper.GetString("mongo.username"))
|
||||
}
|
||||
if viper.GetString("mongo.password") != "" {
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_PASSWORD="+viper.GetString("mongo.password"))
|
||||
}
|
||||
if viper.GetString("mongo.authSource") != "" {
|
||||
cmd.Env = append(cmd.Env, "CRAWLAB_MONGO_AUTHSOURCE="+viper.GetString("mongo.authSource"))
|
||||
}
|
||||
cmd.Env = append(cmd.Env, "PYTHONUNBUFFERED=0")
|
||||
cmd.Env = append(cmd.Env, "PYTHONIOENCODING=utf-8")
|
||||
cmd.Env = append(cmd.Env, "TZ=Asia/Shanghai")
|
||||
@@ -114,7 +145,11 @@ func SetEnv(cmd *exec.Cmd, envs []model.Env, taskId string, dataCol string) *exe
|
||||
cmd.Env = append(cmd.Env, env.Name+"="+env.Value)
|
||||
}
|
||||
|
||||
// TODO 全局环境变量
|
||||
// 全局环境变量
|
||||
variables := model.GetVariableList()
|
||||
for _, variable := range variables {
|
||||
cmd.Env = append(cmd.Env, variable.Key+"="+variable.Value)
|
||||
}
|
||||
return cmd
|
||||
}
|
||||
|
||||
@@ -136,8 +171,15 @@ func FinishOrCancelTask(ch chan string, cmd *exec.Cmd, t model.Task) {
|
||||
log.Infof("process received signal: %s", signal)
|
||||
|
||||
if signal == constants.TaskCancel && cmd.Process != nil {
|
||||
var err error
|
||||
// 兼容windows
|
||||
if runtime.GOOS == constants.Windows {
|
||||
err = cmd.Process.Kill()
|
||||
} else {
|
||||
err = syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
|
||||
}
|
||||
// 取消进程
|
||||
if err := syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL); err != nil {
|
||||
if err != nil {
|
||||
log.Errorf("process kill error: %s", err.Error())
|
||||
debug.PrintStack()
|
||||
|
||||
@@ -217,7 +259,22 @@ func ExecuteShellCmd(cmdStr string, cwd string, t model.Task, s model.Spider) (e
|
||||
}
|
||||
|
||||
// 环境变量配置
|
||||
cmd = SetEnv(cmd, s.Envs, t.Id, s.Col)
|
||||
envs := s.Envs
|
||||
if s.Type == constants.Configurable {
|
||||
// 数据库配置
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_HOST", Value: viper.GetString("mongo.host")})
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_PORT", Value: viper.GetString("mongo.port")})
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_DB", Value: viper.GetString("mongo.db")})
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_USERNAME", Value: viper.GetString("mongo.username")})
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_PASSWORD", Value: viper.GetString("mongo.password")})
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_MONGO_AUTHSOURCE", Value: viper.GetString("mongo.authSource")})
|
||||
|
||||
// 设置配置
|
||||
for envName, envValue := range s.Config.Settings {
|
||||
envs = append(envs, model.Env{Name: "CRAWLAB_SETTING_" + envName, Value: envValue})
|
||||
}
|
||||
}
|
||||
cmd = SetEnv(cmd, envs, t.Id, s.Col)
|
||||
|
||||
// 起一个goroutine来监控进程
|
||||
ch := utils.TaskExecChanMap.ChanBlocked(t.Id)
|
||||
@@ -225,7 +282,9 @@ func ExecuteShellCmd(cmdStr string, cwd string, t model.Task, s model.Spider) (e
|
||||
go FinishOrCancelTask(ch, cmd, t)
|
||||
|
||||
// kill的时候,可以kill所有的子进程
|
||||
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
|
||||
if runtime.GOOS != constants.Windows {
|
||||
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
|
||||
}
|
||||
|
||||
// 启动进程
|
||||
if err := StartTaskProcess(cmd, t); err != nil {
|
||||
@@ -293,9 +352,12 @@ func SaveTaskResultCount(id string) func() {
|
||||
|
||||
// 执行任务
|
||||
func ExecuteTask(id int) {
|
||||
if flag, _ := LockList.Load(id); flag.(bool) {
|
||||
log.Debugf(GetWorkerPrefix(id) + "正在执行任务...")
|
||||
return
|
||||
if flag, ok := LockList.Load(id); ok {
|
||||
if flag.(bool) {
|
||||
log.Debugf(GetWorkerPrefix(id) + "正在执行任务...")
|
||||
return
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
// 上锁
|
||||
@@ -369,7 +431,14 @@ func ExecuteTask(id int) {
|
||||
)
|
||||
|
||||
// 执行命令
|
||||
cmd := spider.Cmd
|
||||
var cmd string
|
||||
if spider.Type == constants.Configurable {
|
||||
// 可配置爬虫命令
|
||||
cmd = "scrapy crawl config_spider"
|
||||
} else {
|
||||
// 自定义爬虫命令
|
||||
cmd = spider.Cmd
|
||||
}
|
||||
|
||||
// 加入参数
|
||||
if t.Param != "" {
|
||||
@@ -382,15 +451,17 @@ func ExecuteTask(id int) {
|
||||
t.Status = constants.StatusRunning // 任务状态
|
||||
t.WaitDuration = t.StartTs.Sub(t.CreateTs).Seconds() // 等待时长
|
||||
|
||||
// 文件检查
|
||||
if err := SpiderFileCheck(t, spider); err != nil {
|
||||
log.Errorf("spider file check error: %s", err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
// 开始执行任务
|
||||
log.Infof(GetWorkerPrefix(id) + "开始执行任务(ID:" + t.Id + ")")
|
||||
|
||||
// 储存任务
|
||||
if err := t.Save(); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
HandleTaskError(t, err)
|
||||
return
|
||||
}
|
||||
_ = t.Save()
|
||||
|
||||
// 起一个cron执行器来统计任务结果数
|
||||
if spider.Col != "" {
|
||||
@@ -404,9 +475,22 @@ func ExecuteTask(id int) {
|
||||
defer cronExec.Stop()
|
||||
}
|
||||
|
||||
// 获得触发任务用户
|
||||
user, err := model.GetUser(t.UserId)
|
||||
if err != nil {
|
||||
log.Errorf(GetWorkerPrefix(id) + err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
// 执行Shell命令
|
||||
if err := ExecuteShellCmd(cmd, cwd, t, spider); err != nil {
|
||||
log.Errorf(GetWorkerPrefix(id) + err.Error())
|
||||
|
||||
// 如果发生错误,则发送通知
|
||||
t, _ = model.GetTask(t.Id)
|
||||
if user.Setting.NotificationTrigger == constants.NotificationTriggerOnTaskEnd || user.Setting.NotificationTrigger == constants.NotificationTriggerOnTaskError {
|
||||
SendNotifications(user, t, spider)
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
@@ -429,6 +513,11 @@ func ExecuteTask(id int) {
|
||||
t.RuntimeDuration = t.FinishTs.Sub(t.StartTs).Seconds() // 运行时长
|
||||
t.TotalDuration = t.FinishTs.Sub(t.CreateTs).Seconds() // 总时长
|
||||
|
||||
// 如果是任务结束时发送通知,则发送通知
|
||||
if user.Setting.NotificationTrigger == constants.NotificationTriggerOnTaskEnd {
|
||||
SendNotifications(user, t, spider)
|
||||
}
|
||||
|
||||
// 保存任务
|
||||
if err := t.Save(); err != nil {
|
||||
log.Errorf(GetWorkerPrefix(id) + err.Error())
|
||||
@@ -444,6 +533,30 @@ func ExecuteTask(id int) {
|
||||
log.Infof(GetWorkerPrefix(id) + "任务(ID:" + t.Id + ")" + "执行完毕. 消耗时间:" + durationStr + "秒")
|
||||
}
|
||||
|
||||
func SpiderFileCheck(t model.Task, spider model.Spider) error {
|
||||
// 判断爬虫文件是否存在
|
||||
gfFile := model.GetGridFs(spider.FileId)
|
||||
if gfFile == nil {
|
||||
t.Error = "找不到爬虫文件,请重新上传"
|
||||
t.Status = constants.StatusError
|
||||
t.FinishTs = time.Now() // 结束时间
|
||||
t.RuntimeDuration = t.FinishTs.Sub(t.StartTs).Seconds() // 运行时长
|
||||
t.TotalDuration = t.FinishTs.Sub(t.CreateTs).Seconds() // 总时长
|
||||
_ = t.Save()
|
||||
return errors.New(t.Error)
|
||||
}
|
||||
|
||||
// 判断md5值是否一致
|
||||
path := filepath.Join(viper.GetString("spider.path"), spider.Name)
|
||||
md5File := filepath.Join(path, spider_handler.Md5File)
|
||||
md5 := utils.GetSpiderMd5Str(md5File)
|
||||
if gfFile.Md5 != md5 {
|
||||
spiderSync := spider_handler.SpiderSync{Spider: spider}
|
||||
spiderSync.RemoveDownCreate(gfFile.Md5)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func GetTaskLog(id string) (logStr string, err error) {
|
||||
task, err := model.GetTask(id)
|
||||
|
||||
@@ -452,6 +565,29 @@ func GetTaskLog(id string) (logStr string, err error) {
|
||||
}
|
||||
|
||||
if IsMasterNode(task.NodeId.Hex()) {
|
||||
if !utils.Exists(task.LogPath) {
|
||||
fileDir, err := MakeLogDir(task)
|
||||
|
||||
if err != nil {
|
||||
log.Errorf(err.Error())
|
||||
}
|
||||
|
||||
fileP := GetLogFilePaths(fileDir)
|
||||
|
||||
// 获取日志文件路径
|
||||
fLog, err := os.Create(fileP)
|
||||
defer fLog.Close()
|
||||
if err != nil {
|
||||
log.Errorf("create task log file error: %s", fileP)
|
||||
debug.PrintStack()
|
||||
}
|
||||
task.LogPath = fileP
|
||||
if err := task.Save(); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
}
|
||||
|
||||
}
|
||||
// 若为主节点,获取本机日志
|
||||
logBytes, err := model.GetLocalLog(task.LogPath)
|
||||
if err != nil {
|
||||
@@ -533,17 +669,188 @@ func CancelTask(id string) (err error) {
|
||||
return nil
|
||||
}
|
||||
|
||||
func HandleTaskError(t model.Task, err error) {
|
||||
log.Error("handle task error:" + err.Error())
|
||||
t.Status = constants.StatusError
|
||||
t.Error = err.Error()
|
||||
t.FinishTs = time.Now()
|
||||
if err := t.Save(); err != nil {
|
||||
func AddTask(t model.Task) error {
|
||||
// 生成任务ID
|
||||
id := uuid.NewV4()
|
||||
t.Id = id.String()
|
||||
|
||||
// 设置任务状态
|
||||
t.Status = constants.StatusPending
|
||||
|
||||
// 如果没有传入node_id,则置为null
|
||||
if t.NodeId.Hex() == "" {
|
||||
t.NodeId = bson.ObjectIdHex(constants.ObjectIdNull)
|
||||
}
|
||||
|
||||
// 将任务存入数据库
|
||||
if err := model.AddTask(t); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return
|
||||
return err
|
||||
}
|
||||
|
||||
// 加入任务队列
|
||||
if err := AssignTask(t); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func GetTaskEmailMarkdownContent(t model.Task, s model.Spider) string {
|
||||
n, _ := model.GetNode(t.NodeId)
|
||||
errMsg := ""
|
||||
statusMsg := fmt.Sprintf(`<span style="color:green">%s</span>`, t.Status)
|
||||
if t.Status == constants.StatusError {
|
||||
errMsg = " with errors"
|
||||
statusMsg = fmt.Sprintf(`<span style="color:red">%s</span>`, t.Status)
|
||||
}
|
||||
return fmt.Sprintf(`
|
||||
Your task has finished%s. Please find the task info below.
|
||||
|
||||
|
|
||||
--: | :--
|
||||
**Task ID:** | %s
|
||||
**Task Status:** | %s
|
||||
**Task Param:** | %s
|
||||
**Spider ID:** | %s
|
||||
**Spider Name:** | %s
|
||||
**Node:** | %s
|
||||
**Create Time:** | %s
|
||||
**Start Time:** | %s
|
||||
**Finish Time:** | %s
|
||||
**Wait Duration:** | %.0f sec
|
||||
**Runtime Duration:** | %.0f sec
|
||||
**Total Duration:** | %.0f sec
|
||||
**Number of Results:** | %d
|
||||
**Error:** | <span style="color:red">%s</span>
|
||||
|
||||
Please login to Crawlab to view the details.
|
||||
`,
|
||||
errMsg,
|
||||
t.Id,
|
||||
statusMsg,
|
||||
t.Param,
|
||||
s.Id.Hex(),
|
||||
s.Name,
|
||||
n.Name,
|
||||
utils.GetLocalTimeString(t.CreateTs),
|
||||
utils.GetLocalTimeString(t.StartTs),
|
||||
utils.GetLocalTimeString(t.FinishTs),
|
||||
t.WaitDuration,
|
||||
t.RuntimeDuration,
|
||||
t.TotalDuration,
|
||||
t.ResultCount,
|
||||
t.Error,
|
||||
)
|
||||
}
|
||||
|
||||
func GetTaskMarkdownContent(t model.Task, s model.Spider) string {
|
||||
n, _ := model.GetNode(t.NodeId)
|
||||
errMsg := ""
|
||||
errLog := "-"
|
||||
statusMsg := fmt.Sprintf(`<font color="#00FF00">%s</font>`, t.Status)
|
||||
if t.Status == constants.StatusError {
|
||||
errMsg = `(有错误)`
|
||||
errLog = fmt.Sprintf(`<font color="#FF0000">%s</font>`, t.Error)
|
||||
statusMsg = fmt.Sprintf(`<font color="#FF0000">%s</font>`, t.Status)
|
||||
}
|
||||
return fmt.Sprintf(`
|
||||
您的任务已完成%s,请查看任务信息如下。
|
||||
|
||||
> **任务ID:** %s
|
||||
> **任务状态:** %s
|
||||
> **任务参数:** %s
|
||||
> **爬虫ID:** %s
|
||||
> **爬虫名称:** %s
|
||||
> **节点:** %s
|
||||
> **创建时间:** %s
|
||||
> **开始时间:** %s
|
||||
> **完成时间:** %s
|
||||
> **等待时间:** %.0f秒
|
||||
> **运行时间:** %.0f秒
|
||||
> **总时间:** %.0f秒
|
||||
> **结果数:** %d
|
||||
> **错误:** %s
|
||||
|
||||
请登录Crawlab查看详情。
|
||||
`,
|
||||
errMsg,
|
||||
t.Id,
|
||||
statusMsg,
|
||||
t.Param,
|
||||
s.Id.Hex(),
|
||||
s.Name,
|
||||
n.Name,
|
||||
utils.GetLocalTimeString(t.CreateTs),
|
||||
utils.GetLocalTimeString(t.StartTs),
|
||||
utils.GetLocalTimeString(t.FinishTs),
|
||||
t.WaitDuration,
|
||||
t.RuntimeDuration,
|
||||
t.TotalDuration,
|
||||
t.ResultCount,
|
||||
errLog,
|
||||
)
|
||||
}
|
||||
|
||||
func SendTaskEmail(u model.User, t model.Task, s model.Spider) {
|
||||
statusMsg := "has finished"
|
||||
if t.Status == constants.StatusError {
|
||||
statusMsg = "has an error"
|
||||
}
|
||||
title := fmt.Sprintf("[Crawlab] Task for \"%s\" %s", s.Name, statusMsg)
|
||||
if err := notification.SendMail(
|
||||
u.Email,
|
||||
u.Username,
|
||||
title,
|
||||
GetTaskEmailMarkdownContent(t, s),
|
||||
); err != nil {
|
||||
log.Errorf("mail error: " + err.Error())
|
||||
debug.PrintStack()
|
||||
}
|
||||
}
|
||||
|
||||
func SendTaskDingTalk(u model.User, t model.Task, s model.Spider) {
|
||||
statusMsg := "已完成"
|
||||
if t.Status == constants.StatusError {
|
||||
statusMsg = "发生错误"
|
||||
}
|
||||
title := fmt.Sprintf("[Crawlab] \"%s\" 任务%s", s.Name, statusMsg)
|
||||
content := GetTaskMarkdownContent(t, s)
|
||||
if err := notification.SendMobileNotification(u.Setting.DingTalkRobotWebhook, title, content); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
}
|
||||
}
|
||||
|
||||
func SendTaskWechat(u model.User, t model.Task, s model.Spider) {
|
||||
content := GetTaskMarkdownContent(t, s)
|
||||
if err := notification.SendMobileNotification(u.Setting.WechatRobotWebhook, "", content); err != nil {
|
||||
log.Errorf(err.Error())
|
||||
debug.PrintStack()
|
||||
}
|
||||
}
|
||||
|
||||
func SendNotifications(u model.User, t model.Task, s model.Spider) {
|
||||
if u.Email != "" && utils.StringArrayContains(u.Setting.EnabledNotifications, constants.NotificationTypeMail) {
|
||||
go func() {
|
||||
SendTaskEmail(u, t, s)
|
||||
}()
|
||||
}
|
||||
|
||||
if u.Setting.DingTalkRobotWebhook != "" && utils.StringArrayContains(u.Setting.EnabledNotifications, constants.NotificationTypeDingTalk) {
|
||||
go func() {
|
||||
SendTaskDingTalk(u, t, s)
|
||||
}()
|
||||
}
|
||||
|
||||
if u.Setting.WechatRobotWebhook != "" && utils.StringArrayContains(u.Setting.EnabledNotifications, constants.NotificationTypeWechat) {
|
||||
go func() {
|
||||
SendTaskWechat(u, t, s)
|
||||
}()
|
||||
}
|
||||
debug.PrintStack()
|
||||
}
|
||||
|
||||
func InitTaskExecutor() error {
|
||||
|
||||
@@ -6,20 +6,18 @@ import (
|
||||
"crawlab/utils"
|
||||
"errors"
|
||||
"github.com/dgrijalva/jwt-go"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/globalsign/mgo/bson"
|
||||
"github.com/spf13/viper"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
func InitUserService() error {
|
||||
adminUser := model.User{
|
||||
Username: "admin",
|
||||
Password: utils.EncryptPassword("admin"),
|
||||
Role: constants.RoleAdmin,
|
||||
}
|
||||
_ = adminUser.Add()
|
||||
_ = CreateNewUser("admin", "admin", constants.RoleAdmin, "")
|
||||
return nil
|
||||
}
|
||||
|
||||
func MakeToken(user *model.User) (tokenStr string, err error) {
|
||||
token := jwt.NewWithClaims(jwt.SigningMethodHS256, jwt.MapClaims{
|
||||
"id": user.Id,
|
||||
@@ -91,3 +89,29 @@ func CheckToken(tokenStr string) (user model.User, err error) {
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func CreateNewUser(username string, password string, role string, email string) error {
|
||||
user := model.User{
|
||||
Username: strings.ToLower(username),
|
||||
Password: utils.EncryptPassword(password),
|
||||
Role: role,
|
||||
Email: email,
|
||||
Setting: model.UserSetting{
|
||||
NotificationTrigger: constants.NotificationTriggerNever,
|
||||
EnabledNotifications: []string{
|
||||
constants.NotificationTypeMail,
|
||||
constants.NotificationTypeDingTalk,
|
||||
constants.NotificationTypeWechat,
|
||||
},
|
||||
},
|
||||
}
|
||||
if err := user.Add(); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func GetCurrentUser(c *gin.Context) *model.User {
|
||||
data, _ := c.Get("currentUser")
|
||||
return data.(*model.User)
|
||||
}
|
||||
|
||||
12
backend/template/scrapy/config_spider/items.py
Normal file
12
backend/template/scrapy/config_spider/items.py
Normal file
@@ -0,0 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
# Define here the models for your scraped items
|
||||
#
|
||||
# See documentation in:
|
||||
# https://docs.scrapy.org/en/latest/topics/items.html
|
||||
|
||||
import scrapy
|
||||
|
||||
|
||||
class Item(scrapy.Item):
|
||||
###ITEMS###
|
||||
103
backend/template/scrapy/config_spider/middlewares.py
Normal file
103
backend/template/scrapy/config_spider/middlewares.py
Normal file
@@ -0,0 +1,103 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
# Define here the models for your spider middleware
|
||||
#
|
||||
# See documentation in:
|
||||
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
|
||||
|
||||
from scrapy import signals
|
||||
|
||||
|
||||
class ConfigSpiderSpiderMiddleware(object):
|
||||
# Not all methods need to be defined. If a method is not defined,
|
||||
# scrapy acts as if the spider middleware does not modify the
|
||||
# passed objects.
|
||||
|
||||
@classmethod
|
||||
def from_crawler(cls, crawler):
|
||||
# This method is used by Scrapy to create your spiders.
|
||||
s = cls()
|
||||
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
|
||||
return s
|
||||
|
||||
def process_spider_input(self, response, spider):
|
||||
# Called for each response that goes through the spider
|
||||
# middleware and into the spider.
|
||||
|
||||
# Should return None or raise an exception.
|
||||
return None
|
||||
|
||||
def process_spider_output(self, response, result, spider):
|
||||
# Called with the results returned from the Spider, after
|
||||
# it has processed the response.
|
||||
|
||||
# Must return an iterable of Request, dict or Item objects.
|
||||
for i in result:
|
||||
yield i
|
||||
|
||||
def process_spider_exception(self, response, exception, spider):
|
||||
# Called when a spider or process_spider_input() method
|
||||
# (from other spider middleware) raises an exception.
|
||||
|
||||
# Should return either None or an iterable of Request, dict
|
||||
# or Item objects.
|
||||
pass
|
||||
|
||||
def process_start_requests(self, start_requests, spider):
|
||||
# Called with the start requests of the spider, and works
|
||||
# similarly to the process_spider_output() method, except
|
||||
# that it doesn’t have a response associated.
|
||||
|
||||
# Must return only requests (not items).
|
||||
for r in start_requests:
|
||||
yield r
|
||||
|
||||
def spider_opened(self, spider):
|
||||
spider.logger.info('Spider opened: %s' % spider.name)
|
||||
|
||||
|
||||
class ConfigSpiderDownloaderMiddleware(object):
|
||||
# Not all methods need to be defined. If a method is not defined,
|
||||
# scrapy acts as if the downloader middleware does not modify the
|
||||
# passed objects.
|
||||
|
||||
@classmethod
|
||||
def from_crawler(cls, crawler):
|
||||
# This method is used by Scrapy to create your spiders.
|
||||
s = cls()
|
||||
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
|
||||
return s
|
||||
|
||||
def process_request(self, request, spider):
|
||||
# Called for each request that goes through the downloader
|
||||
# middleware.
|
||||
|
||||
# Must either:
|
||||
# - return None: continue processing this request
|
||||
# - or return a Response object
|
||||
# - or return a Request object
|
||||
# - or raise IgnoreRequest: process_exception() methods of
|
||||
# installed downloader middleware will be called
|
||||
return None
|
||||
|
||||
def process_response(self, request, response, spider):
|
||||
# Called with the response returned from the downloader.
|
||||
|
||||
# Must either;
|
||||
# - return a Response object
|
||||
# - return a Request object
|
||||
# - or raise IgnoreRequest
|
||||
return response
|
||||
|
||||
def process_exception(self, request, exception, spider):
|
||||
# Called when a download handler or a process_request()
|
||||
# (from other downloader middleware) raises an exception.
|
||||
|
||||
# Must either:
|
||||
# - return None: continue processing this exception
|
||||
# - return a Response object: stops process_exception() chain
|
||||
# - return a Request object: stops process_exception() chain
|
||||
pass
|
||||
|
||||
def spider_opened(self, spider):
|
||||
spider.logger.info('Spider opened: %s' % spider.name)
|
||||
27
backend/template/scrapy/config_spider/pipelines.py
Normal file
27
backend/template/scrapy/config_spider/pipelines.py
Normal file
@@ -0,0 +1,27 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
# Define your item pipelines here
|
||||
#
|
||||
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
|
||||
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
|
||||
|
||||
import os
|
||||
from pymongo import MongoClient
|
||||
|
||||
mongo = MongoClient(
|
||||
host=os.environ.get('CRAWLAB_MONGO_HOST') or 'localhost',
|
||||
port=int(os.environ.get('CRAWLAB_MONGO_PORT') or 27017),
|
||||
username=os.environ.get('CRAWLAB_MONGO_USERNAME'),
|
||||
password=os.environ.get('CRAWLAB_MONGO_PASSWORD'),
|
||||
authSource=os.environ.get('CRAWLAB_MONGO_AUTHSOURCE') or 'admin'
|
||||
)
|
||||
db = mongo[os.environ.get('CRAWLAB_MONGO_DB') or 'test']
|
||||
col = db[os.environ.get('CRAWLAB_COLLECTION') or 'test']
|
||||
task_id = os.environ.get('CRAWLAB_TASK_ID')
|
||||
|
||||
class ConfigSpiderPipeline(object):
|
||||
def process_item(self, item, spider):
|
||||
item['task_id'] = task_id
|
||||
if col is not None:
|
||||
col.save(item)
|
||||
return item
|
||||
111
backend/template/scrapy/config_spider/settings.py
Normal file
111
backend/template/scrapy/config_spider/settings.py
Normal file
@@ -0,0 +1,111 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import os
|
||||
import re
|
||||
import json
|
||||
|
||||
# Scrapy settings for config_spider project
|
||||
#
|
||||
# For simplicity, this file contains only settings considered important or
|
||||
# commonly used. You can find more settings consulting the documentation:
|
||||
#
|
||||
# https://docs.scrapy.org/en/latest/topics/settings.html
|
||||
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
|
||||
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
|
||||
|
||||
BOT_NAME = 'Crawlab Configurable Spider'
|
||||
|
||||
SPIDER_MODULES = ['config_spider.spiders']
|
||||
NEWSPIDER_MODULE = 'config_spider.spiders'
|
||||
|
||||
|
||||
# Crawl responsibly by identifying yourself (and your website) on the user-agent
|
||||
USER_AGENT = 'Crawlab Spider'
|
||||
|
||||
# Obey robots.txt rules
|
||||
ROBOTSTXT_OBEY = True
|
||||
|
||||
# Configure maximum concurrent requests performed by Scrapy (default: 16)
|
||||
#CONCURRENT_REQUESTS = 32
|
||||
|
||||
# Configure a delay for requests for the same website (default: 0)
|
||||
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
|
||||
# See also autothrottle settings and docs
|
||||
#DOWNLOAD_DELAY = 3
|
||||
# The download delay setting will honor only one of:
|
||||
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
|
||||
#CONCURRENT_REQUESTS_PER_IP = 16
|
||||
|
||||
# Disable cookies (enabled by default)
|
||||
#COOKIES_ENABLED = False
|
||||
|
||||
# Disable Telnet Console (enabled by default)
|
||||
#TELNETCONSOLE_ENABLED = False
|
||||
|
||||
# Override the default request headers:
|
||||
#DEFAULT_REQUEST_HEADERS = {
|
||||
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
|
||||
# 'Accept-Language': 'en',
|
||||
#}
|
||||
|
||||
# Enable or disable spider middlewares
|
||||
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
|
||||
#SPIDER_MIDDLEWARES = {
|
||||
# 'config_spider.middlewares.ConfigSpiderSpiderMiddleware': 543,
|
||||
#}
|
||||
|
||||
# Enable or disable downloader middlewares
|
||||
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
|
||||
#DOWNLOADER_MIDDLEWARES = {
|
||||
# 'config_spider.middlewares.ConfigSpiderDownloaderMiddleware': 543,
|
||||
#}
|
||||
|
||||
# Enable or disable extensions
|
||||
# See https://docs.scrapy.org/en/latest/topics/extensions.html
|
||||
#EXTENSIONS = {
|
||||
# 'scrapy.extensions.telnet.TelnetConsole': None,
|
||||
#}
|
||||
|
||||
# Configure item pipelines
|
||||
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
|
||||
ITEM_PIPELINES = {
|
||||
'config_spider.pipelines.ConfigSpiderPipeline': 300,
|
||||
}
|
||||
|
||||
# Enable and configure the AutoThrottle extension (disabled by default)
|
||||
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
|
||||
#AUTOTHROTTLE_ENABLED = True
|
||||
# The initial download delay
|
||||
#AUTOTHROTTLE_START_DELAY = 5
|
||||
# The maximum download delay to be set in case of high latencies
|
||||
#AUTOTHROTTLE_MAX_DELAY = 60
|
||||
# The average number of requests Scrapy should be sending in parallel to
|
||||
# each remote server
|
||||
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
|
||||
# Enable showing throttling stats for every response received:
|
||||
#AUTOTHROTTLE_DEBUG = False
|
||||
|
||||
# Enable and configure HTTP caching (disabled by default)
|
||||
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
|
||||
#HTTPCACHE_ENABLED = True
|
||||
#HTTPCACHE_EXPIRATION_SECS = 0
|
||||
#HTTPCACHE_DIR = 'httpcache'
|
||||
#HTTPCACHE_IGNORE_HTTP_CODES = []
|
||||
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
|
||||
|
||||
for setting_env_name in [x for x in os.environ.keys() if x.startswith('CRAWLAB_SETTING_')]:
|
||||
setting_name = setting_env_name.replace('CRAWLAB_SETTING_', '')
|
||||
setting_value = os.environ.get(setting_env_name)
|
||||
if setting_value.lower() == 'true':
|
||||
setting_value = True
|
||||
elif setting_value.lower() == 'false':
|
||||
setting_value = False
|
||||
elif re.search(r'^\d+$', setting_value) is not None:
|
||||
setting_value = int(setting_value)
|
||||
elif re.search(r'^\{.*\}$', setting_value.strip()) is not None:
|
||||
setting_value = json.loads(setting_value)
|
||||
elif re.search(r'^\[.*\]$', setting_value.strip()) is not None:
|
||||
setting_value = json.loads(setting_value)
|
||||
else:
|
||||
pass
|
||||
locals()[setting_name] = setting_value
|
||||
|
||||
21
backend/template/scrapy/config_spider/spiders/spider.py
Normal file
21
backend/template/scrapy/config_spider/spiders/spider.py
Normal file
@@ -0,0 +1,21 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import scrapy
|
||||
import re
|
||||
from config_spider.items import Item
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
def get_real_url(response, url):
|
||||
if re.search(r'^https?', url):
|
||||
return url
|
||||
elif re.search(r'^\/\/', url):
|
||||
u = urlparse(response.url)
|
||||
return u.scheme + url
|
||||
return urljoin(response.url, url)
|
||||
|
||||
class ConfigSpider(scrapy.Spider):
|
||||
name = 'config_spider'
|
||||
|
||||
def start_requests(self):
|
||||
yield scrapy.Request(url='###START_URL###', callback=self.###START_STAGE###)
|
||||
|
||||
###PARSERS###
|
||||
11
backend/template/scrapy/scrapy.cfg
Normal file
11
backend/template/scrapy/scrapy.cfg
Normal file
@@ -0,0 +1,11 @@
|
||||
# Automatically created by: scrapy startproject
|
||||
#
|
||||
# For more information about the [deploy] section see:
|
||||
# https://scrapyd.readthedocs.io/en/latest/deploy.html
|
||||
|
||||
[settings]
|
||||
default = config_spider.settings
|
||||
|
||||
[deploy]
|
||||
#url = http://localhost:6800/
|
||||
project = config_spider
|
||||
19
backend/template/spiderfile/Spiderfile.163_news
Normal file
19
backend/template/spiderfile/Spiderfile.163_news
Normal file
@@ -0,0 +1,19 @@
|
||||
name: "toscrapy_books"
|
||||
start_url: "http://news.163.com/special/0001386F/rank_news.html"
|
||||
start_stage: "list"
|
||||
engine: "scrapy"
|
||||
stages:
|
||||
- name: list
|
||||
is_list: true
|
||||
list_css: "table tr:not(:first-child)"
|
||||
fields:
|
||||
- name: "title"
|
||||
css: "td:nth-child(1) > a"
|
||||
- name: "url"
|
||||
css: "td:nth-child(1) > a"
|
||||
attr: "href"
|
||||
- name: "clicks"
|
||||
css: "td.cBlue"
|
||||
settings:
|
||||
ROBOTSTXT_OBEY: false
|
||||
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||
21
backend/template/spiderfile/Spiderfile.baidu
Normal file
21
backend/template/spiderfile/Spiderfile.baidu
Normal file
@@ -0,0 +1,21 @@
|
||||
name: toscrapy_books
|
||||
start_url: http://www.baidu.com/s?wd=crawlab
|
||||
start_stage: list
|
||||
engine: scrapy
|
||||
stages:
|
||||
- name: list
|
||||
is_list: true
|
||||
list_xpath: //*[contains(@class, "c-container")]
|
||||
page_xpath: //*[@id="page"]//a[@class="n"][last()]
|
||||
page_attr: href
|
||||
fields:
|
||||
- name: title
|
||||
xpath: .//h3/a
|
||||
- name: url
|
||||
xpath: .//h3/a
|
||||
attr: href
|
||||
- name: abstract
|
||||
xpath: .//*[@class="c-abstract"]
|
||||
settings:
|
||||
ROBOTSTXT_OBEY: false
|
||||
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||
27
backend/template/spiderfile/Spiderfile.toscrapy_books
Normal file
27
backend/template/spiderfile/Spiderfile.toscrapy_books
Normal file
@@ -0,0 +1,27 @@
|
||||
name: "toscrapy_books"
|
||||
start_url: "http://books.toscrape.com"
|
||||
start_stage: "list"
|
||||
engine: "scrapy"
|
||||
stages:
|
||||
- name: list
|
||||
is_list: true
|
||||
list_css: "section article.product_pod"
|
||||
page_css: "ul.pager li.next a"
|
||||
page_attr: "href"
|
||||
fields:
|
||||
- name: "title"
|
||||
css: "h3 > a"
|
||||
- name: "url"
|
||||
css: "h3 > a"
|
||||
attr: "href"
|
||||
next_stage: "detail"
|
||||
- name: "price"
|
||||
css: ".product_price > .price_color"
|
||||
- name: detail
|
||||
is_list: false
|
||||
fields:
|
||||
- name: "description"
|
||||
css: "#product_description + p"
|
||||
settings:
|
||||
ROBOTSTXT_OBEY: true
|
||||
AUTOTHROTTLE_ENABLED: true
|
||||
51
backend/template/spiders/amazon_config/Spiderfile
Normal file
51
backend/template/spiders/amazon_config/Spiderfile
Normal file
@@ -0,0 +1,51 @@
|
||||
name: "amazon_config"
|
||||
display_name: "亚马逊中国(可配置)"
|
||||
remark: "亚马逊中国搜索手机,列表+分页"
|
||||
type: "configurable"
|
||||
col: "results_amazon_config"
|
||||
engine: scrapy
|
||||
start_url: https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_2
|
||||
start_stage: list
|
||||
stages:
|
||||
- name: list
|
||||
is_list: true
|
||||
list_css: .s-result-item
|
||||
list_xpath: ""
|
||||
page_css: .a-last > a
|
||||
page_xpath: ""
|
||||
page_attr: href
|
||||
fields:
|
||||
- name: title
|
||||
css: span.a-text-normal
|
||||
xpath: ""
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: url
|
||||
css: .a-link-normal
|
||||
xpath: ""
|
||||
attr: href
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: price
|
||||
css: ""
|
||||
xpath: .//*[@class="a-price-whole"]
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: price_fraction
|
||||
css: ""
|
||||
xpath: .//*[@class="a-price-fraction"]
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: img
|
||||
css: .s-image-square-aspect > img
|
||||
xpath: ""
|
||||
attr: src
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
settings:
|
||||
ROBOTSTXT_OBEY: "false"
|
||||
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
|
||||
like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||
57
backend/template/spiders/autohome_config/Spiderfile
Normal file
57
backend/template/spiders/autohome_config/Spiderfile
Normal file
@@ -0,0 +1,57 @@
|
||||
name: "autohome_config"
|
||||
display_name: "汽车之家(可配置)"
|
||||
remark: "汽车之家文章,列表+详情+分页"
|
||||
type: "configurable"
|
||||
col: "results_autohome_config"
|
||||
engine: scrapy
|
||||
start_url: https://www.autohome.com.cn/all/
|
||||
start_stage: list
|
||||
stages:
|
||||
- name: list
|
||||
is_list: true
|
||||
list_css: ul.article > li
|
||||
list_xpath: ""
|
||||
page_css: a.page-item-next
|
||||
page_xpath: ""
|
||||
page_attr: href
|
||||
fields:
|
||||
- name: title
|
||||
css: li > a > h3
|
||||
xpath: ""
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: url
|
||||
css: li > a
|
||||
xpath: ""
|
||||
attr: href
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: abstract
|
||||
css: li > a > p
|
||||
xpath: ""
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: time
|
||||
css: li > a .fn-left
|
||||
xpath: ""
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: views
|
||||
css: li > a .fn-right > em:first-child
|
||||
xpath: ""
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: comments
|
||||
css: li > a .fn-right > em:last-child
|
||||
xpath: ""
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
settings:
|
||||
ROBOTSTXT_OBEY: "false"
|
||||
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
|
||||
like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||
39
backend/template/spiders/baidu_config/Spiderfile
Normal file
39
backend/template/spiders/baidu_config/Spiderfile
Normal file
@@ -0,0 +1,39 @@
|
||||
name: "baidu_config"
|
||||
display_name: "百度搜索(可配置)"
|
||||
remark: "百度搜索Crawlab,列表+分页"
|
||||
type: "configurable"
|
||||
col: "results_baidu_config"
|
||||
engine: scrapy
|
||||
start_url: http://www.baidu.com/s?wd=crawlab
|
||||
start_stage: list
|
||||
stages:
|
||||
- name: list
|
||||
is_list: true
|
||||
list_css: ".result.c-container"
|
||||
list_xpath: ""
|
||||
page_css: "a.n"
|
||||
page_xpath: ""
|
||||
page_attr: href
|
||||
fields:
|
||||
- name: title
|
||||
css: ""
|
||||
xpath: .//h3/a
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: url
|
||||
css: ""
|
||||
xpath: .//h3/a
|
||||
attr: href
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
- name: abstract
|
||||
css: ""
|
||||
xpath: .//*[@class="c-abstract"]
|
||||
attr: ""
|
||||
next_stage: ""
|
||||
remark: ""
|
||||
settings:
|
||||
ROBOTSTXT_OBEY: "false"
|
||||
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
|
||||
like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||
6
backend/template/spiders/bing_general/Spiderfile
Normal file
6
backend/template/spiders/bing_general/Spiderfile
Normal file
@@ -0,0 +1,6 @@
|
||||
name: "bing_general"
|
||||
display_name: "必应搜索 (通用)"
|
||||
remark: "必应搜索 Crawlab,列表+分页"
|
||||
col: "results_bing_general"
|
||||
type: "customized"
|
||||
cmd: "python bing_spider.py"
|
||||
41
backend/template/spiders/bing_general/bing_spider.py
Normal file
41
backend/template/spiders/bing_general/bing_spider.py
Normal file
@@ -0,0 +1,41 @@
|
||||
import requests
|
||||
from bs4 import BeautifulSoup as bs
|
||||
from urllib.parse import urljoin, urlparse
|
||||
import re
|
||||
from crawlab import save_item
|
||||
|
||||
s = requests.Session()
|
||||
|
||||
def get_real_url(response, url):
|
||||
if re.search(r'^https?', url):
|
||||
return url
|
||||
elif re.search(r'^\/\/', url):
|
||||
u = urlparse(response.url)
|
||||
return u.scheme + url
|
||||
return urljoin(response.url, url)
|
||||
|
||||
def start_requests():
|
||||
for i in range(0, 9):
|
||||
fr = 'PERE' if not i else 'MORE'
|
||||
url = f'https://cn.bing.com/search?q=crawlab&first={10 * i + 1}&FROM={fr}'
|
||||
request_page(url)
|
||||
|
||||
def request_page(url):
|
||||
print(f'requesting {url}')
|
||||
r = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'})
|
||||
parse_list(r)
|
||||
|
||||
def parse_list(response):
|
||||
soup = bs(response.content.decode('utf-8'))
|
||||
for el in list(soup.select('#b_results > li')):
|
||||
try:
|
||||
save_item({
|
||||
'title': el.select_one('h2').text,
|
||||
'url': el.select_one('h2 a').attrs.get('href'),
|
||||
'abstract': el.select_one('.b_caption p').text,
|
||||
})
|
||||
except:
|
||||
pass
|
||||
|
||||
if __name__ == '__main__':
|
||||
start_requests()
|
||||
5
backend/template/spiders/chinaz/Spiderfile
Normal file
5
backend/template/spiders/chinaz/Spiderfile
Normal file
@@ -0,0 +1,5 @@
|
||||
name: "chinaz"
|
||||
display_name: "站长之家 (Scrapy)"
|
||||
col: "results_chinaz"
|
||||
type: "customized"
|
||||
cmd: "scrapy crawl chinaz_spider"
|
||||
7
backend/template/spiders/chinaz/chinaz/pipelines.py
Normal file
7
backend/template/spiders/chinaz/chinaz/pipelines.py
Normal file
@@ -0,0 +1,7 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
# Define your item pipelines here
|
||||
#
|
||||
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
|
||||
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
|
||||
|
||||
@@ -65,7 +65,7 @@ ROBOTSTXT_OBEY = True
|
||||
# Configure item pipelines
|
||||
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
|
||||
ITEM_PIPELINES = {
|
||||
'chinaz.pipelines.MongoPipeline': 300,
|
||||
'crawlab.pipelines.CrawlabMongoPipeline': 300,
|
||||
}
|
||||
|
||||
# Enable and configure the AutoThrottle extension (disabled by default)
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user