diff --git a/CHANGELOG-zh.md b/CHANGELOG-zh.md deleted file mode 100644 index eef72af1..00000000 --- a/CHANGELOG-zh.md +++ /dev/null @@ -1,316 +0,0 @@ -# 0.6.0 (TBC) - -(TBC) - -# 0.5.1 (2020-07-31) - -### 功能 / 优化 -- **加入错误详情信息**. -- **加入 Golang 编程语言支持**. -- **加入 Chrome Driver 和 Firefox 的 Web Driver 安装脚本**. -- **支持系统任务**. "系统任务"跟普通爬虫任务相似,允许用户查看诸如安装语言之类的任务日志. -- **将安装语言从 RPC 更改为系统任务**. - -### Bug 修复 -- **修复在爬虫市场中第一次下载爬虫时会报500错误**. [#808](https://github.com/crawlab-team/crawlab/issues/808) -- **修复一部分翻译问题**. -- **修复任务详情 500 错误**. [#810](https://github.com/crawlab-team/crawlab/issues/810) -- **修复密码重置问题**. [#811](https://github.com/crawlab-team/crawlab/issues/811) -- **修复无法下载 CSV 问题**. [#812](https://github.com/crawlab-team/crawlab/issues/812) -- **修复无法安装 Node.js 问题**. [#813](https://github.com/crawlab-team/crawlab/issues/813) -- **修复批量添加定时任务时默认为禁用问题**. [#814](https://github.com/crawlab-team/crawlab/issues/814) - -# 0.5.0 (2020-07-19) -### 功能 / 优化 -- **爬虫市场**. 允许用户下载开源爬虫到 Crawlab. -- **批量操作**. 允许用户与 Crawlab 批量交互,例如批量运行任务、批量删除爬虫等等. -- **迁移 MongoDB 驱动器至 `MongoDriver`**. -- **重构优化节点逻辑代码**. -- **更改默认 `task.workers` 至 16**. -- **更改默认 nginx `client_max_body_size` 为 200m**. -- **支持写日志到 ElasticSearch**. -- **在 Scrapy 页面展示错误详情**. -- **删除挑战页面**. -- **将反馈、免责声明页面移动到顶部**. - -### Bug 修复 -- **修复由于 TTL 索引未创建导致的日志不过期问题**. -- **设置默认日志过期时间为 1 天**. -- **`task_id` 索引没有创建**. -- **`docker-compose.yml` 修复**. -- **修复 404 页面**. -- **修复无法先创建工作节点问题**. - -# 0.4.10 (2020-04-21) -### 功能 / 优化 -- **优化日志管理**. 集中化管理日志,储存在 MongoDB,减少对 PubSub 的依赖,允许日志异常检测. -- **自动安装依赖**. 允许从 `requirements.txt` 和 `package.json` 自动安装依赖. -- **API Token**. 允许用户生成 API Token,并利用它们来集成到自己的系统中. -- **Web Hook**. 当任务开始或结束时,触发 Web Hook http 请求到预定义好的 URL. -- **自动生成结果集**. 如果没有设置,自动设置结果集为 `results_`. -- **优化项目列表**. 项目列表中不展示 "No Project". -- **升级 Node.js**. 将 Node.js 版本从 v8.12 升级到 v10.19. -- **定时任务增加运行按钮**. 允许用户在定时任务界面手动运行爬虫任务. - -### Bug 修复 -- **无法注册**. [#670](https://github.com/crawlab-team/crawlab/issues/670) -- **爬虫定时任务标签 Cron 表达式显示秒**. [#678](https://github.com/crawlab-team/crawlab/issues/678) -- **爬虫每日数据缺失**. [#684](https://github.com/crawlab-team/crawlab/issues/684) -- **结果数量未即时更新**. [#689](https://github.com/crawlab-team/crawlab/issues/689) - -# 0.4.9 (2020-03-31) -### 功能 / 优化 -- **挑战**. 用户可以完成不同的趣味挑战.. -- **更高级的权限控制**. 更细化的权限管理,例如普通用户只能查看或管理自己的爬虫或项目,而管理用户可以查看或管理所有爬虫或项目. -- **反馈**. 允许用户发送反馈和评分给 Crawlab 开发组. -- **更好的主页指标**. 优化主页上的指标展示. -- **可配置爬虫转化为自定义爬虫**. 用户可以将自己的可配置爬虫转化为 Scrapy 自定义爬虫. -- **查看定时任务触发的任务**. 允许用户查看定时任务触发的任务. [#648](https://github.com/crawlab-team/crawlab/issues/648) -- **支持结果去重**. 允许用户配置结果去重. [#579](https://github.com/crawlab-team/crawlab/issues/579) -- **支持任务重试**. 允许任务重新触发历史任务. - -### Bug 修复 -- **无法注册**. [#670](https://github.com/crawlab-team/crawlab/issues/670) -- **CLI 无法在 Windows 上使用**. [#580](https://github.com/crawlab-team/crawlab/issues/580) -- **重新上传错误**. [#643](https://github.com/crawlab-team/crawlab/issues/643) [#640](https://github.com/crawlab-team/crawlab/issues/640) -- **上传丢失文件目录**. [#646](https://github.com/crawlab-team/crawlab/issues/646) -- **无法在爬虫定时任务标签中添加定时任务**. - -# 0.4.8 (2020-03-11) -### 功能 / 优化 -- **支持更多编程语言安装**. 现在用户可以安装或预装更多的编程语言,包括 Java、.Net Core、PHP. -- **安装 UI 优化**. 用户能够更好的查看和管理节点列表页的安装. -- **更多 Git 支持**. 允许用户查看 Git Commits 记录,并 Checkout 到相应 Commit. -- **支持用 Hostname 作为节点注册类型**. 用户可以将 hostname 作为节点的唯一识别号. -- **RPC 支持**. 加入 RPC 支持来更好的管理节点通信. -- **是否在主节点运行开关**. 用户可以决定是否在主节点运行,如果为否,则所有任务将在工作节点上运行. -- **默认禁用教程**. -- **加入相关文档侧边栏**. -- **加载页面优化**. - -### Bug 修复 -- **重复节点**. [#391](https://github.com/crawlab-team/crawlab/issues/391) -- **重复上传爬虫**. [#603](https://github.com/crawlab-team/crawlab/issues/603) -- **节点第三方模块安装失败导致 节点安装第三方部分无法使用**. [#609](https://github.com/crawlab-team/crawlab/issues/609) -- **离线节点也会创建任务**. [#622](https://github.com/crawlab-team/crawlab/issues/622) - -# 0.4.7 (2020-02-24) -### 功能 / 优化 -- **更好的支持 Scrapy**. 爬虫识别,`settings.py` 配置,日志级别选择,爬虫选择. [#435](https://github.com/crawlab-team/crawlab/issues/435) -- **Git 同步**. 允许用户将 Git 项目同步到 Crawlab. -- **长任务支持**. 用户可以添加长任务爬虫,这些爬虫可以跑长期运行的任务. [425](https://github.com/crawlab-team/crawlab/issues/425) -- **爬虫列表优化**. 分状态任务列数统计,任务列表详情弹出框,图例. [425](https://github.com/crawlab-team/crawlab/issues/425) -- **版本升级检测**. 检测最新版本,通知用户升级. -- **批量操作爬虫**. 允许用户批量运行/停止爬虫任务,以及批量删除爬虫. -- **复制爬虫**. 允许用户复制已存在爬虫来创建新爬虫. -- **微信群二维码**. - -### Bug 修复 -- **定时任务爬虫选择问题**. 字段不会随着爬虫变化而响应. -- **定时任务冲突问题**. 两个不同的爬虫设置定时任务,时间设置成相同的话,可能会有bug. [#515](https://github.com/crawlab-team/crawlab/issues/515) [#565](https://github.com/crawlab-team/crawlab/issues/565) -- **任务日志问题**. 在同一时间触发的不同任务可能会写入同一个日志文件. [#577](https://github.com/crawlab-team/crawlab/issues/577) -- **任务列表筛选选项不全**. - -# 0.4.6 (2020-02-13) -### 功能 / 优化 -- **Node.js SDK**. 用户可以将 SDK 应用到他们的 Node.js 爬虫中. -- **日志管理优化**. 日志搜索,错误高亮,自动滚动. -- **任务执行流程优化**. 允许用户在触发任务后跳转到该任务详情页. -- **任务展示优化**. 在爬虫详情页的最近任务表格中加入了“参数”列. [#295](https://github.com/crawlab-team/crawlab/issues/295) -- **爬虫列表优化**. 在爬虫列表页加入"更新时间"和"创建时间". [#505](https://github.com/crawlab-team/crawlab/issues/505) -- **页面加载占位器**. - -### Bug 修复 -- **定时任务配置失去焦点**. [#519](https://github.com/crawlab-team/crawlab/issues/519) -- **无法用 CLI 工具上传爬虫**. [#524](https://github.com/crawlab-team/crawlab/issues/524) - -# 0.4.5 (2020-02-03) -### 功能 / 优化 -- **交互式教程**. 引导用户了解 Crawlab 的主要功能. -- **加入全局环境变量**. 可以设置全局环境变量,然后传入到所有爬虫程序中. [#177](https://github.com/crawlab-team/crawlab/issues/177) -- **项目**. 允许用户将爬虫关联到项目上. [#316](https://github.com/crawlab-team/crawlab/issues/316) -- **示例爬虫**. 当初始化时,自动加入示例爬虫. [#379](https://github.com/crawlab-team/crawlab/issues/379) -- **用户管理优化**. 限制管理用户的权限. [#456](https://github.com/crawlab-team/crawlab/issues/456) -- **设置页面优化**. -- **任务结果页面优化**. - -### Bug 修复 -- **无法找到爬虫文件错误**. [#485](https://github.com/crawlab-team/crawlab/issues/485) -- **点击删除按钮导致跳转**. [#480](https://github.com/crawlab-team/crawlab/issues/480) -- **无法在空爬虫里创建文件**. [#479](https://github.com/crawlab-team/crawlab/issues/479) -- **下载结果错误**. [#465](https://github.com/crawlab-team/crawlab/issues/465) -- **crawlab-sdk CLI 错误**. [#458](https://github.com/crawlab-team/crawlab/issues/458) -- **页面刷新问题**. [#441](https://github.com/crawlab-team/crawlab/issues/441) -- **结果不支持 JSON**. [#202](https://github.com/crawlab-team/crawlab/issues/202) -- **修复“删除爬虫后获取所有爬虫”错误**. -- **修复 i18n 警告**. - -# 0.4.4 (2020-01-17) - -### 功能 / 优化 -- **邮件通知**. 允许用户发送邮件消息通知. -- **钉钉机器人通知**. 允许用户发送钉钉机器人消息通知. -- **企业微信机器人通知**. 允许用户发送企业微信机器人消息通知. -- **API 地址优化**. 在前端加入相对路径,因此用户不需要特别注明 `CRAWLAB_API_ADDRESS`. -- **SDK 兼容**. 允许用户通过 Crawlab SDK 与 Scrapy 或通用爬虫集成. -- **优化文件管理**. 加入树状文件侧边栏,让用户更方便的编辑文件. -- **高级定时任务 Cron**. 允许用户通过 Cron 可视化编辑器编辑定时任务. - -### Bug 修复 -- **`nil retuened` 错误**. -- **使用 HTTPS 出现的报错**. -- **无法在爬虫列表页运行可配置爬虫**. -- **上传爬虫文件缺少表单验证**. - -# 0.4.3 (2020-01-07) - -### 功能 / 优化 -- **依赖安装**. 允许用户在平台 Web 界面安装/卸载依赖以及添加编程语言(暂时只有 Node.js)。 -- **Docker 中预装编程语言**. 允许 Docker 用户通过设置 `CRAWLAB_SERVER_LANG_NODE` 为 `Y` 来预装 `Node.js` 环境. -- **在爬虫详情页添加定时任务列表**. 允许用户在爬虫详情页查看、添加、编辑定时任务. [#360](https://github.com/crawlab-team/crawlab/issues/360) -- **Cron 表达式与 Linux 一致**. 将表达式从 6 元素改为 5 元素,与 Linux 一致. -- **启用/禁用定时任务**. 允许用户启用/禁用定时任务. [#297](https://github.com/crawlab-team/crawlab/issues/297) -- **优化任务管理**. 允许用户批量删除任务. [#341](https://github.com/crawlab-team/crawlab/issues/341) -- **优化爬虫管理**. 允许用户在爬虫列表页对爬虫进行筛选和排序. -- **添加中文版 `CHANGELOG`**. -- **在顶部添加 Github 加星按钮**. - -### Bug 修复 -- **定时任务问题**. [#423](https://github.com/crawlab-team/crawlab/issues/423) -- **上传爬虫zip文件问题**. [#403](https://github.com/crawlab-team/crawlab/issues/403) [#407](https://github.com/crawlab-team/crawlab/issues/407) -- **因为网络原因导致崩溃**. [#340](https://github.com/crawlab-team/crawlab/issues/340) -- **定时任务无法正常运行** -- **定时任务列表列表错位问题** -- **刷新按钮跳转错误问题** - -# 0.4.2 (2019-12-26) -### 功能 / 优化 -- **免责声明**. 加入免责声明. -- **通过 API 获取版本号**. [#371](https://github.com/crawlab-team/crawlab/issues/371) -- **通过配置来允许用户注册**. [#346](https://github.com/crawlab-team/crawlab/issues/346) -- **允许添加新用户**. -- **更高级的文件管理**. 允许用户添加、编辑、重命名、删除代码文件. [#286](https://github.com/crawlab-team/crawlab/issues/286) -- **优化爬虫创建流程**. 允许用户在上传 zip 文件前创建空的自定义爬虫. -- **优化任务管理**. 允许用户通过选择条件过滤任务. [#341](https://github.com/crawlab-team/crawlab/issues/341) - -### Bug 修复 -- **重复节点**. [#391](https://github.com/crawlab-team/crawlab/issues/391) -- **"mongodb no reachable" 错误**. [#373](https://github.com/crawlab-team/crawlab/issues/373) - -# 0.4.1 (2019-12-13) -### 功能 / 优化 -- **Spiderfile 优化**. 将阶段由数组更换为字典. [#358](https://github.com/crawlab-team/crawlab/issues/358) -- **百度统计更新**. - -### Bug 修复 -- **无法展示定时任务**. [#353](https://github.com/crawlab-team/crawlab/issues/353) -- **重复节点注册**. [#334](https://github.com/crawlab-team/crawlab/issues/334) - -# 0.4.0 (2019-12-06) -### 功能 / 优化 -- **可配置爬虫**. 允许用户添加 `Spiderfile` 来配置抓取规则. -- **执行模式**. 允许用户选择 3 种任务执行模式: *所有节点*, *指定节点* and *随机*. - -### Bug 修复 -- **任务意外被杀死**. [#306](https://github.com/crawlab-team/crawlab/issues/306) -- **文档更正**. [#301](https://github.com/crawlab-team/crawlab/issues/258) [#301](https://github.com/crawlab-team/crawlab/issues/258) -- **直接部署与 Windows 不兼容**. [#288](https://github.com/crawlab-team/crawlab/issues/288) -- **日志文件丢失**. [#269](https://github.com/crawlab-team/crawlab/issues/269) - -# 0.3.5 (2019-10-28) -### 功能 / 优化 -- **优雅关闭**. [详情](https://github.com/crawlab-team/crawlab/commit/63fab3917b5a29fd9770f9f51f1572b9f0420385) -- **节点信息优化**. [详情](https://github.com/crawlab-team/crawlab/commit/973251a0fbe7a2184ac0da09e0404a17c736aee7) -- **将系统环境变量添加到任务**. [详情](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3) -- **自动刷新任务日志**. [详情](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3) -- **允许 HTTPS 部署**. [详情](https://github.com/crawlab-team/crawlab/commit/5d8f6f0c56768a6e58f5e46cbf5adff8c7819228) - -### Bug 修复 -- **定时任务中无法获取爬虫列表**. [详情](https://github.com/crawlab-team/crawlab/commit/311f72da19094e3fa05ab4af49812f58843d8d93) -- **无法获取工作节点信息**. [详情](https://github.com/crawlab-team/crawlab/commit/6af06efc17685a9e232e8c2b5fd819ec7d2d1674) -- **运行爬虫任务时无法选择节点**. [详情](https://github.com/crawlab-team/crawlab/commit/31f8e03234426e97aed9b0bce6a50562f957edad) -- **结果量很大时无法获取结果数量**. [#260](https://github.com/crawlab-team/crawlab/issues/260) -- **定时任务中的节点问题**. [#244](https://github.com/crawlab-team/crawlab/issues/244) - - -# 0.3.1 (2019-08-25) -### 功能 / 优化 -- **Docker 镜像优化**. 将 Docker 镜像进一步分割成 alpine 镜像版本的 master、worker、frontendSplit docker further into master, worker, frontend. -- **单元测试**. 用单元测试覆盖部分后端代码. -- **前端优化**. 登录页、按钮大小、上传 UI 提示. -- **更灵活的节点注册**. 允许用户传一个变量作为注册 key,而不是默认的 MAC 地址. - -### Bug 修复 -- **上传大爬虫文件错误**. 上传大爬虫文件时的内存崩溃问题. [#150](https://github.com/crawlab-team/crawlab/issues/150) -- **无法同步爬虫**. 通过提高写权限等级来修复同步爬虫文件时的问题. [#114](https://github.com/crawlab-team/crawlab/issues/114) -- **爬虫页问题**. 通过删除 `Site` 字段来修复. [#112](https://github.com/crawlab-team/crawlab/issues/112) -- **节点展示问题**. 当在多个机器上跑 Docker 容器时,节点无法正确展示. [#99](https://github.com/crawlab-team/crawlab/issues/99) - -# 0.3.0 (2019-07-31) -### 功能 / 优化 -- **Golang 后端**: 将后端由 Python 重构为 Golang,很大的提高了稳定性和性能. -- **节点网络图**: 节点拓扑图可视化. -- **节点系统信息**: 可以查看包括操作系统、CPU数量、可执行文件在内的系统信息. -- **节点监控改进**: 节点通过 Redis 来监控和注册. -- **文件管理**: 可以在线编辑爬虫文件,包括代码高亮. -- **登录页/注册页/用户管理**: 要求用户登录后才能使用 Crawlab, 允许用户注册和用户管理,有一些基于角色的鉴权机制. -- **自动部署爬虫**: 爬虫将被自动部署或同步到所有在线节点. -- **更小的 Docker 镜像**: 瘦身版 Docker 镜像,通过多阶段构建将 Docker 镜像大小从 1.3G 减小到 700M 左右. - -### Bug 修复 -- **节点状态**. 节点状态不会随着节点下线而更新. [#87](https://github.com/tikazyq/crawlab/issues/87) -- **爬虫部署错误**. 通过自动爬虫部署来修复 [#83](https://github.com/tikazyq/crawlab/issues/83) -- **节点无法显示**. 节点无法显示在线 [#81](https://github.com/tikazyq/crawlab/issues/81) -- **定时任务无法工作**. 通过 Golang 后端修复 [#64](https://github.com/tikazyq/crawlab/issues/64) -- **Flower 错误**. 通过 Golang 后端修复 [#57](https://github.com/tikazyq/crawlab/issues/57) - -# 0.2.4 (2019-07-07) -### 功能 / 优化 -- **文档**: 更优和更详细的文档. -- **更好的 Crontab**: 通过 UI 界面生成 Cron 表达式. -- **更优的性能**: 从原生 flask 引擎 切换到 `gunicorn`. [#78](https://github.com/tikazyq/crawlab/issues/78) - -### Bug 修复 -- **删除爬虫**. 删除爬虫时不止在数据库中删除,还应该删除相关的文件夹、任务和定时任务. [#69](https://github.com/tikazyq/crawlab/issues/69) -- **MongoDB 授权**. 允许用户注明 `authenticationDatabase` 来连接 `mongodb`. [#68](https://github.com/tikazyq/crawlab/issues/68) -- **Windows 兼容性**. 加入 `eventlet` 到 `requirements.txt`. [#59](https://github.com/tikazyq/crawlab/issues/59) - - -# 0.2.3 (2019-06-12) -### 功能 / 优化 -- **Docker**: 用户能够运行 Docker 镜像来加快部署. -- **CLI**: 允许用户通过命令行来执行 Crawlab 程序. -- **上传爬虫**: 允许用户上传自定义爬虫到 Crawlab. -- **预览时编辑字段**: 允许用户在可配置爬虫中预览数据时编辑字段. - -### Bug 修复 -- **爬虫分页**. 爬虫列表页中修复分页问题. - -# 0.2.2 (2019-05-30) -### 功能 / 优化 -- **自动抓取字段**: 在可配置爬虫列表页种自动抓取字段. -- **下载结果**: 允许下载结果为 CSV 文件. -- **百度统计**: 允许用户选择是否允许向百度统计发送统计数据. - -### Bug 修复 -- **结果页分页**. [#45](https://github.com/tikazyq/crawlab/issues/45) -- **定时任务重复触发**: 将 Flask DEBUG 设置为 False 来保证定时任务无法重复触发. [#32](https://github.com/tikazyq/crawlab/issues/32) -- **前端环境**: 添加 `VUE_APP_BASE_URL` 作为生产环境模式变量,然后 API 不会永远都是 `localhost` [#30](https://github.com/tikazyq/crawlab/issues/30) - -# 0.2.1 (2019-05-27) -- **可配置爬虫**: 允许用户创建爬虫来抓取数据,而不用编写代码. - -# 0.2 (2019-05-10) - -- **高级数据统计**: 爬虫详情页的高级数据统计. -- **网站数据**: 加入网站列表(中国),允许用户查看 robots.txt、首页响应时间等信息. - -# 0.1.1 (2019-04-23) - -- **基础统计**: 用户可以查看基础统计数据,包括爬虫和任务页中的失败任务数、结果数. -- **近实时任务信息**: 周期性(5 秒)向服务器轮训数据来实现近实时查看任务信息. -- **定时任务**: 利用 apscheduler 实现定时任务,允许用户设置类似 Cron 的定时任务. - -# 0.1 (2019-04-17) - -- **首次发布** diff --git a/CHANGELOG.md b/CHANGELOG.md deleted file mode 100644 index 7aef62da..00000000 --- a/CHANGELOG.md +++ /dev/null @@ -1,314 +0,0 @@ -# 0.6.0 (TBC) - -(TBC) - -# 0.5.1 (2020-07-31) - -### Features / Enhancement -- **Added error message details**. -- **Added Golang programming language support**. -- **Added web driver installation scripts for Chrome Driver and Firefox**. -- **Support system tasks**. A "system task" is similar to normal spider task, it allows users to view logs of general tasks such as installing languages. -- **Changed methods of installing languages from RPC to system tasks**. - -### Bug Fixes -- **Fixed first download repo 500 error in Spider Market page**. [#808](https://github.com/crawlab-team/crawlab/issues/808) -- **Fixed some translation issues**. -- **Fixed 500 error in task detail page**. [#810](https://github.com/crawlab-team/crawlab/issues/810) -- **Fixed password reset issue**. [#811](https://github.com/crawlab-team/crawlab/issues/811) -- **Fixed unable to download CSV issue**. [#812](https://github.com/crawlab-team/crawlab/issues/812) -- **Fixed unable to install node.js issue**. [#813](https://github.com/crawlab-team/crawlab/issues/813) -- **Fixed disabled status for batch adding schedules**. [#814](https://github.com/crawlab-team/crawlab/issues/814) - -# 0.5.0 (2020-07-19) -### Features / Enhancement -- **Spider Market**. Allow users to download open-source spiders into Crawlab. -- **Batch actions**. Allow users to interact with Crawlab in batch fashions, e.g. batch run tasks, batch delete spiders, ect. -- **Migrate MongoDB driver to `MongoDriver`**. -- **Refactor and optmize node-related logics**. -- **Change default `task.workers` to 16**. -- **Change default nginx `client_max_body_size` to 200m**. -- **Support writing logs to ElasticSearch**. -- **Display error details in Scrapy page**. -- **Removed Challenge page**. -- **Moved Feedback and Dislaimer pages to navbar**. - -### Bug Fixes -- **Fixed log not expiring issue because of failure to create TTL index**. -- **Set default log expire duration to 1 day**. -- **`task_id` index not created**. -- **`docker-compose.yml` fix**. -- **Fixed 404 page**. -- **Fixed unable to create worker node before master node issue**. - -# 0.4.10 (2020-04-21) -### Features / Enhancement -- **Enhanced Log Management**. Centralizing log storage in MongoDB, reduced the dependency of PubSub, allowing log error detection. -- **API Token**. Allow users to generate API tokens and use them to integrate into their own systems. -- **Web Hook**. Trigger a Web Hook http request to pre-defined URL when a task starts or finishes. -- **Auto Install Dependencies**. Allow installing dependencies automatically from `requirements.txt` or `package.json`. -- **Auto Results Collection**. Set results collection to `results_` if it is not set. -- **Optimized Project List**. Not display "No Project" item in the project list. -- **Upgrade Node.js**. Upgrade Node.js version from v8.12 to v10.19. -- **Add Run Button in Schedule Page**. Allow users to manually run task in Schedule Page. - -### Bug Fixes -- **Cannot register**. [#670](https://github.com/crawlab-team/crawlab/issues/670) -- **Spider schedule tab cron expression shows second**. [#678](https://github.com/crawlab-team/crawlab/issues/678) -- **Missing daily stats in spider**. [#684](https://github.com/crawlab-team/crawlab/issues/684) -- **Results count not update in time**. [#689](https://github.com/crawlab-team/crawlab/issues/689) - -# 0.4.9 (2020-03-31) -### Features / Enhancement -- **Challenges**. Users can achieve different challenges based on their actions. -- **More Advanced Access Control**. More granular access control, e.g. normal users can only view/manage their own spiders/projects and admin users can view/manage all spiders/projects. -- **Feedback**. Allow users to send feedbacks and ratings to Crawlab team. -- **Better Home Page Metrics**. Optimized metrics display on home page. -- **Configurable Spiders Converted to Customized Spiders**. Allow users to convert their configurable spiders into customized spiders which are also Scrapy spiders. -- **View Tasks Triggered by Schedule**. Allow users to view tasks triggered by a schedule. [#648](https://github.com/crawlab-team/crawlab/issues/648) -- **Support Results De-Duplication**. Allow users to configure de-duplication of results. [#579](https://github.com/crawlab-team/crawlab/issues/579) -- **Support Task Restart**. Allow users to re-run historical tasks. - -### Bug Fixes -- **CLI unable to use on Windows**. [#580](https://github.com/crawlab-team/crawlab/issues/580) -- **Re-upload error**. [#643](https://github.com/crawlab-team/crawlab/issues/643) [#640](https://github.com/crawlab-team/crawlab/issues/640) -- **Upload missing folders**. [#646](https://github.com/crawlab-team/crawlab/issues/646) -- **Unable to add schedules in Spider Page**. - -# 0.4.8 (2020-03-11) -### Features / Enhancement -- **Support Installations of More Programming Languages**. Now users can install or pre-install more programming languages including Java, .Net Core and PHP. -- **Installation UI Optimization**. Users can better view and manage installations on Node List page. -- **More Git Support**. Allow users to view Git Commits record, and allow checkout to corresponding commit. -- **Support Hostname Node Registration Type**. Users can set hostname as the node key as the unique identifier. -- **RPC Support**. Added RPC support to better manage node communication. -- **Run On Master Switch**. Users can determine whether to run tasks on master. If not, all tasks will be run only on worker nodes. -- **Disabled Tutorial by Default**. -- **Added Related Documentation Sidebar**. -- **Loading Page Optimization**. - -### Bug Fixes -- **Duplicated Nodes**. [#391](https://github.com/crawlab-team/crawlab/issues/391) -- **Duplicated Spider Upload**. [#603](https://github.com/crawlab-team/crawlab/issues/603) -- **Failure in dependencies installation results in unusable dependency installation functionalities.**. [#609](https://github.com/crawlab-team/crawlab/issues/609) -- **Create Tasks for Offline Nodes**. [#622](https://github.com/crawlab-team/crawlab/issues/622) - -# 0.4.7 (2020-02-24) -### Features / Enhancement -- **Better Support for Scrapy**. Spiders identification, `settings.py` configuration, log level selection, spider selection. [#435](https://github.com/crawlab-team/crawlab/issues/435) -- **Git Sync**. Allow users to sync git projects to Crawlab. -- **Long Task Support**. Users can add long-task spiders which is supposed to run without finishing. [#425](https://github.com/crawlab-team/crawlab/issues/425) -- **Spider List Optimization**. Tasks count by status, tasks detail popup, legend. [#425](https://github.com/crawlab-team/crawlab/issues/425) -- **Upgrade Check**. Check latest version and notifiy users to upgrade. -- **Spiders Batch Operation**. Allow users to run/stop spider tasks and delete spiders in batches. -- **Copy Spiders**. Allow users to copy an existing spider to create a new one. -- **Wechat Group QR Code**. - -### Bug Fixes -- **Schedule Spider Selection Issue**. Fields not responding to spider change. -- **Cron Jobs Conflict**. Possible bug when two spiders set to the same time of their cron jobs. [#515](https://github.com/crawlab-team/crawlab/issues/515) [#565](https://github.com/crawlab-team/crawlab/issues/565) -- **Task Log Issue**. Different tasks write to the same log file if triggered at the same time. [#577](https://github.com/crawlab-team/crawlab/issues/577) -- **Task List Filter Options Incomplete**. - -# 0.4.6 (2020-02-13) -### Features / Enhancement -- **SDK for Node.js**. Users can apply SDK in their Node.js spiders. -- **Log Management Optimization**. Log search, error highlight, auto-scrolling. -- **Task Execution Process Optimization**. Allow users to be redirected to task detail page after triggering a task. -- **Task Display Optimization**. Added "Param" in the Latest Tasks table in the spider detail page. [#295](https://github.com/crawlab-team/crawlab/issues/295) -- **Spider List Optimization**. Added "Update Time" and "Create Time" in spider list page. -- **Page Loading Placeholder**. - -### Bug Fixes -- **Lost Focus in Schedule Configuration**. [#519](https://github.com/crawlab-team/crawlab/issues/519) -- **Unable to Upload Spider using CLI**. [#524](https://github.com/crawlab-team/crawlab/issues/524) - -# 0.4.5 (2020-02-03) -### Features / Enhancement -- **Interactive Tutorial**. Guide users through the main functionalities of Crawlab. -- **Global Environment Variables**. Allow users to set global environment variables, which will be passed into all spider programs. [#177](https://github.com/crawlab-team/crawlab/issues/177) -- **Project**. Allow users to link spiders to projects. [#316](https://github.com/crawlab-team/crawlab/issues/316) -- **Demo Spiders**. Added demo spiders when Crawlab is initialized. [#379](https://github.com/crawlab-team/crawlab/issues/379) -- **User Admin Optimization**. Restrict privilleges of admin users. [#456](https://github.com/crawlab-team/crawlab/issues/456) -- **Setting Page Optimization**. -- **Task Results Optimization**. - -### Bug Fixes -- **Unable to find spider file error**. [#485](https://github.com/crawlab-team/crawlab/issues/485) -- **Click delete button results in redirect**. [#480](https://github.com/crawlab-team/crawlab/issues/480) -- **Unable to create files in an empty spider**. [#479](https://github.com/crawlab-team/crawlab/issues/479) -- **Download results error**. [#465](https://github.com/crawlab-team/crawlab/issues/465) -- **crawlab-sdk CLI error**. [#458](https://github.com/crawlab-team/crawlab/issues/458) -- **Page refresh issue**. [#441](https://github.com/crawlab-team/crawlab/issues/441) -- **Results not support JSON**. [#202](https://github.com/crawlab-team/crawlab/issues/202) -- **Getting all spider after deleting a spider**. -- **i18n warning**. - -# 0.4.4 (2020-01-17) -### Features / Enhancement -- **Email Notification**. Allow users to send email notifications. -- **DingTalk Robot Notification**. Allow users to send DingTalk Robot notifications. -- **Wechat Robot Notification**. Allow users to send Wechat Robot notifications. -- **API Address Optimization**. Added relative URL path in frontend so that users don't have to specify `CRAWLAB_API_ADDRESS` explicitly. -- **SDK Compatiblity**. Allow users to integrate Scrapy or general spiders with Crawlab SDK. -- **Enhanced File Management**. Added tree-like file sidebar to allow users to edit files much more easier. -- **Advanced Schedule Cron**. Allow users to edit schedule cron with visualized cron editor. - -### Bug Fixes -- **`nil retuened` error**. -- **Error when using HTTPS**. -- **Unable to run Configurable Spiders on Spider List**. -- **Missing form validation before uploading spider files**. - -# 0.4.3 (2020-01-07) - -### Features / Enhancement -- **Dependency Installation**. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface. -- **Pre-install Programming Languages in Docker**. Allow Docker users to set `CRAWLAB_SERVER_LANG_NODE` as `Y` to pre-install `Node.js` environments. -- **Add Schedule List in Spider Detail Page**. Allow users to view / add / edit schedule cron jobs in the spider detail page. [#360](https://github.com/crawlab-team/crawlab/issues/360) -- **Align Cron Expression with Linux**. Change the expression of 6 elements to 5 elements as aligned in Linux. -- **Enable/Disable Schedule Cron**. Allow users to enable/disable the schedule jobs. [#297](https://github.com/crawlab-team/crawlab/issues/297) -- **Better Task Management**. Allow users to batch delete tasks. [#341](https://github.com/crawlab-team/crawlab/issues/341) -- **Better Spider Management**. Allow users to sort and filter spiders in the spider list page. -- **Added Chinese `CHANGELOG`**. -- **Added Github Star Button at Nav Bar**. - -### Bug Fixes -- **Schedule Cron Task Issue**. [#423](https://github.com/crawlab-team/crawlab/issues/423) -- **Upload Spider Zip File Issue**. [#403](https://github.com/crawlab-team/crawlab/issues/403) [#407](https://github.com/crawlab-team/crawlab/issues/407) -- **Exit due to Network Failure**. [#340](https://github.com/crawlab-team/crawlab/issues/340) -- **Cron Jobs not Running Correctly** -- **Schedule List Columns Mis-positioned** -- **Clicking Refresh Button Redirected to 404 Page** - -# 0.4.2 (2019-12-26) -### Features / Enhancement -- **Disclaimer**. Added page for Disclaimer. -- **Call API to fetch version**. [#371](https://github.com/crawlab-team/crawlab/issues/371) -- **Configure to allow user registration**. [#346](https://github.com/crawlab-team/crawlab/issues/346) -- **Allow adding new users**. -- **More Advanced File Management**. Allow users to add / edit / rename / delete files. [#286](https://github.com/crawlab-team/crawlab/issues/286) -- **Optimized Spider Creation Process**. Allow users to create an empty customized spider before uploading the zip file. -- **Better Task Management**. Allow users to filter tasks by selecting through certian criterions. [#341](https://github.com/crawlab-team/crawlab/issues/341) - -### Bug Fixes -- **Duplicated nodes**. [#391](https://github.com/crawlab-team/crawlab/issues/391) -- **"mongodb no reachable" error**. [#373](https://github.com/crawlab-team/crawlab/issues/373) - -# 0.4.1 (2019-12-13) -### Features / Enhancement -- **Spiderfile Optimization**. Stages changed from dictionary to array. [#358](https://github.com/crawlab-team/crawlab/issues/358) -- **Baidu Tongji Update**. - -### Bug Fixes -- **Unable to display schedule tasks**. [#353](https://github.com/crawlab-team/crawlab/issues/353) -- **Duplicate node registration**. [#334](https://github.com/crawlab-team/crawlab/issues/334) - -# 0.4.0 (2019-12-06) -### Features / Enhancement -- **Configurable Spider**. Allow users to add spiders using *Spiderfile* to configure crawling rules. -- **Execution Mode**. Allow users to select 3 modes for task execution: *All Nodes*, *Selected Nodes* and *Random*. - -### Bug Fixes -- **Task accidentally killed**. [#306](https://github.com/crawlab-team/crawlab/issues/306) -- **Documentation fix**. [#301](https://github.com/crawlab-team/crawlab/issues/258) [#301](https://github.com/crawlab-team/crawlab/issues/258) -- **Direct deploy incompatible with Windows**. [#288](https://github.com/crawlab-team/crawlab/issues/288) -- **Log files lost**. [#269](https://github.com/crawlab-team/crawlab/issues/269) - -# 0.3.5 (2019-10-28) -### Features / Enhancement -- **Graceful Showdown**. [detail](https://github.com/crawlab-team/crawlab/commit/63fab3917b5a29fd9770f9f51f1572b9f0420385) -- **Node Info Optimization**. [detail](https://github.com/crawlab-team/crawlab/commit/973251a0fbe7a2184ac0da09e0404a17c736aee7) -- **Append System Environment Variables to Tasks**. [detail](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3) -- **Auto Refresh Task Log**. [detail](https://github.com/crawlab-team/crawlab/commit/4ab4892471965d6342d30385578ca60dc51f8ad3) -- **Enable HTTPS Deployment**. [detail](https://github.com/crawlab-team/crawlab/commit/5d8f6f0c56768a6e58f5e46cbf5adff8c7819228) - -### Bug Fixes -- **Unable to fetch spider list info in schedule jobs**. [detail](https://github.com/crawlab-team/crawlab/commit/311f72da19094e3fa05ab4af49812f58843d8d93) -- **Unable to fetch node info from worker nodes**. [detail](https://github.com/crawlab-team/crawlab/commit/6af06efc17685a9e232e8c2b5fd819ec7d2d1674) -- **Unable to select node when trying to run spider tasks**. [detail](https://github.com/crawlab-team/crawlab/commit/31f8e03234426e97aed9b0bce6a50562f957edad) -- **Unable to fetch result count when result volume is large**. [#260](https://github.com/crawlab-team/crawlab/issues/260) -- **Node issue in schedule tasks**. [#244](https://github.com/crawlab-team/crawlab/issues/244) - - -# 0.3.1 (2019-08-25) -### Features / Enhancement -- **Docker Image Optimization**. Split docker further into master, worker, frontend with alpine image. -- **Unit Tests**. Covered part of the backend code with unit tests. -- **Frontend Optimization**. Login page, button size, hints of upload UI optimization. -- **More Flexible Node Registration**. Allow users to pass a variable as key for node registration instead of MAC by default. - -### Bug Fixes -- **Uploading Large Spider Files Error**. Memory crash issue when uploading large spider files. [#150](https://github.com/crawlab-team/crawlab/issues/150) -- **Unable to Sync Spiders**. Fixes through increasing level of write permission when synchronizing spider files. [#114](https://github.com/crawlab-team/crawlab/issues/114) -- **Spider Page Issue**. Fixes through removing the field "Site". [#112](https://github.com/crawlab-team/crawlab/issues/112) -- **Node Display Issue**. Nodes do not display correctly when running docker containers on multiple machines. [#99](https://github.com/crawlab-team/crawlab/issues/99) - -# 0.3.0 (2019-07-31) -### Features / Enhancement -- **Golang Backend**: Refactored code from Python backend to Golang, much more stability and performance. -- **Node Network Graph**: Visualization of node typology. -- **Node System Info**: Available to see system info including OS, CPUs and executables. -- **Node Monitoring Enhancement**: Nodes are monitored and registered through Redis. -- **File Management**: Available to edit spider files online, including code highlight. -- **Login/Regiser/User Management**: Require users to login to use Crawlab, allow user registration and user management, some role-based authorization. -- **Automatic Spider Deployment**: Spiders are deployed/synchronized to all online nodes automatically. -- **Smaller Docker Image**: Slimmed Docker image and reduced Docker image size from 1.3G to \~700M by applying Multi-Stage Build. - -### Bug Fixes -- **Node Status**. Node status does not change even though it goes offline actually. [#87](https://github.com/tikazyq/crawlab/issues/87) -- **Spider Deployment Error**. Fixed through Automatic Spider Deployment [#83](https://github.com/tikazyq/crawlab/issues/83) -- **Node not showing**. Node not able to show online [#81](https://github.com/tikazyq/crawlab/issues/81) -- **Cron Job not working**. Fixed through new Golang backend [#64](https://github.com/tikazyq/crawlab/issues/64) -- **Flower Error**. Fixed through new Golang backend [#57](https://github.com/tikazyq/crawlab/issues/57) - -# 0.2.4 (2019-07-07) -### Features / Enhancement -- **Documentation**: Better and much more detailed documentation. -- **Better Crontab**: Make crontab expression through crontab UI. -- **Better Performance**: Switched from native flask engine to `gunicorn`. [#78](https://github.com/tikazyq/crawlab/issues/78) - -### Bugs Fixes -- **Deleting Spider**. Deleting a spider does not only remove record in db but also removing related folder, tasks and schedules. [#69](https://github.com/tikazyq/crawlab/issues/69) -- **MongoDB Auth**. Allow user to specify `authenticationDatabase` to connect to `mongodb`. [#68](https://github.com/tikazyq/crawlab/issues/68) -- **Windows Compatibility**. Added `eventlet` to `requirements.txt`. [#59](https://github.com/tikazyq/crawlab/issues/59) - - -# 0.2.3 (2019-06-12) -### Features / Enhancement -- **Docker**: User can run docker image to speed up deployment. -- **CLI**: Allow user to use command-line interface to execute Crawlab programs. -- **Upload Spider**: Allow user to upload Customized Spider to Crawlab. -- **Edit Fields on Preview**: Allow user to edit fields when previewing data in Configurable Spider. - -### Bugs Fixes -- **Spiders Pagination**. Fixed pagination problem in spider page. - -# 0.2.2 (2019-05-30) -### Features / Enhancement -- **Automatic Extract Fields**: Automatically extracting data fields in list pages for configurable spider. -- **Download Results**: Allow downloading results as csv file. -- **Baidu Tongji**: Allow users to choose to report usage info to Baidu Tongji. - -### Bug Fixes -- **Results Page Pagination**: Fixes so the pagination of results page is working correctly. [#45](https://github.com/tikazyq/crawlab/issues/45) -- **Schedule Tasks Duplicated Triggers**: Set Flask DEBUG as False so that schedule tasks won't trigger twice. [#32](https://github.com/tikazyq/crawlab/issues/32) -- **Frontend Environment**: Added `VUE_APP_BASE_URL` as production mode environment variable so the API call won't be always `localhost` in deployed env [#30](https://github.com/tikazyq/crawlab/issues/30) - -# 0.2.1 (2019-05-27) -- **Configurable Spider**: Allow users to create a spider to crawl data without coding. - -# 0.2 (2019-05-10) - -- **Advanced Stats**: Advanced analytics in spider detail view. -- **Sites Data**: Added sites list (China) for users to check info such as robots.txt and home page response time/code. - -# 0.1.1 (2019-04-23) - -- **Basic Stats**: User can view basic stats such as number of failed tasks and number of results in spiders and tasks pages. -- **Near Realtime Task Info**: Periodically (5 sec) polling data from server to allow view task info in a near-realtime fashion. -- **Scheduled Tasks**: Allow users to set up cron-like scheduled/periodical tasks using apscheduler. - -# 0.1 (2019-04-17) - -- **Initial Release** diff --git a/README.md b/README.md index bbd11ea8..7e5f48fa 100644 --- a/README.md +++ b/README.md @@ -25,17 +25,17 @@ [Installation](#installation) | [Run](#run) | [Screenshot](#screenshot) | [Architecture](#architecture) | [Integration](#integration-with-other-frameworks) | [Compare](#comparison-with-other-frameworks) | [Community & Sponsorship](#community--sponsorship) | [CHANGELOG](https://github.com/crawlab-team/crawlab/blob/main/CHANGELOG.md) | [Disclaimer](https://github.com/crawlab-team/crawlab/blob/main/DISCLAIMER.md) -Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium. +Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, and various web crawler frameworks including Scrapy, Puppeteer, Selenium. -[Demo](https://demo.crawlab.cn) | [Documentation](https://docs.crawlab.cn/en/) +[Demo](https://demo.crawlab.cn) | [Documentation](https://docs.crawlab.cn) ## Installation -You can follow the [installation guide](https://docs.crawlab.cn/en/guide/installation/). +You can follow the [installation guide](https://docs.crawlab.cn/getting-started/installation). ## Quick Start -Please open the command line prompt and execute the command below. Make sure you have installed `docker-compose` in advance. +Please open the command line prompt and execute the command below. Make sure you have installed [Docker](https://www.docker.com) in advance. ```bash git clone https://github.com/crawlab-team/examples @@ -43,17 +43,15 @@ cd examples/docker/basic docker-compose up -d ``` -Next, you can look into the `docker-compose.yml` (with detailed config params) and the [Documentation](http://docs.crawlab.cn/en/) for further information. +Next, you can look into the `docker-compose.yml` (with detailed config params) and the [Documentation](http://docs.crawlab.cn) for further information. ## Run ### Docker -Please use `docker-compose` to one-click to start up. By doing so, you don't even have to configure MongoDB database. Create a file named `docker-compose.yml` and input the code below. - +Please use `docker compose` to one-click to start up. By doing so, you don't even have to configure MongoDB database. Create a file named `docker-compose.yml` and input the code below. ```yaml -version: '3.3' services: master: image: crawlabteam/crawlab:latest @@ -73,8 +71,7 @@ services: container_name: crawlab_example_worker01 environment: CRAWLAB_NODE_MASTER: "N" - CRAWLAB_GRPC_ADDRESS: "master" - CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer" + CRAWLAB_MASTER_HOST: "master" volumes: - "./.crawlab/worker01:/root/.crawlab" depends_on: diff --git a/SECURITY.md b/SECURITY.md index ff373513..2e0dd759 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -4,6 +4,7 @@ | Version | Supported | | ------- | ------------------ | +| 0.7.x | :white_check_mark: | | 0.6.x | :white_check_mark: | | 0.5.x | :white_check_mark: | | < 0.5 | :x: | diff --git a/changelog/v0.2-zh.md b/changelog/v0.2-zh.md new file mode 100644 index 00000000..81c5241b --- /dev/null +++ b/changelog/v0.2-zh.md @@ -0,0 +1,49 @@ +# 0.2.4 (2019-07-07) +### 功能 / 优化 +- **文档优化** - 更详细更有组织的文档 +- **Cron 表达式生成器** - 可视化 Cron 表达式生成界面 +- **性能优化** - 从 Flask 迁移至 Gunicorn 服务器 + +### Bug 修复 +- **删除爬虫** - 现在会正确删除相关文件和任务 [#69](https://github.com/tikazyq/crawlab/issues/69) +- **MongoDB 认证** - 支持 authenticationDatabase 参数 [#68](https://github.com/tikazyq/crawlab/issues/68) +- **Windows 兼容性** - 在 requirements.txt 中加入 eventlet [#59](https://github.com/tikazyq/crawlab/issues/59) + +# 0.2.3 (2019-06-12) +### 功能 / 优化 +- **Docker 支持** - 官方 Docker 镜像发布 +- **命令行工具** - Crawlab 命令行接口 +- **爬虫上传** - 直接上传 ZIP 文件功能 +- **字段编辑** - 数据预览时直接编辑字段 + +### Bug 修复 +- **爬虫分页** - 修复爬虫列表分页问题 + +# 0.2.2 (2019-05-30) +### 功能 / 优化 +- **自动字段检测** - 在可配置爬虫中自动提取字段 +- **CSV 导出** - 支持 CSV 格式结果下载 +- **统计开关** - 百度统计禁用选项 + +### Bug 修复 +- **结果分页** [#45](https://github.com/tikazyq/crawlab/issues/45) +- **定时任务重复** - 关闭 Flask 调试模式防止重复 [#32](https://github.com/tikazyq/crawlab/issues/32) +- **API 地址配置** - 添加 VUE_APP_BASE_URL 环境变量 [#30](https://github.com/tikazyq/crawlab/issues/30) + +# 0.2.1 (2019-05-27) +### 主要功能 +- **可配置爬虫** - 通过界面无需代码创建爬虫 + +# 0.2 (2019-05-10) +### 核心功能 +- **高级统计** - 爬虫详情页详细统计 +- **网站指标** - 中国网站列表及 robots.txt 和响应时间监控 + +# 0.1.1 (2019-04-23) +### 基础功能 +- **基础统计** - 失败计数和结果总数 +- **近实时更新** - 5 秒轮询任务更新 +- **定时任务** - 集成 APScheduler 实现定时任务 + +# 0.1 (2019-04-17) +- **首次发布** - 基础爬虫和任务管理 \ No newline at end of file diff --git a/changelog/v0.2.md b/changelog/v0.2.md new file mode 100644 index 00000000..015bf0aa --- /dev/null +++ b/changelog/v0.2.md @@ -0,0 +1,49 @@ +# 0.2.4 (2019-07-07) +### Features / Enhancement +- **Improved documentation** - More detailed and organized documentation +- **Cron expression generator** - Visual UI for generating cron expressions +- **Performance optimization** - Migrated from Flask to Gunicorn web server + +### Bug Fixes +- **Spider deletion** - Now properly deletes associated files and tasks [#69](https://github.com/tikazyq/crawlab/issues/69) +- **MongoDB authentication** - Added support for authenticationDatabase parameter [#68](https://github.com/tikazyq/crawlab/issues/68) +- **Windows compatibility** - Added eventlet to requirements.txt [#59](https://github.com/tikazyq/crawlab/issues/59) + +# 0.2.3 (2019-06-12) +### Features / Enhancement +- **Docker support** - Official Docker image release +- **CLI tool** - Command line interface for Crawlab operations +- **Spider upload** - Direct ZIP file upload capability +- **Field editing** - In-place field editing during data preview + +### Bug Fixes +- **Spider pagination** - Fixed pagination in spider list view + +# 0.2.2 (2019-05-30) +### Features / Enhancement +- **Auto field detection** - Automatic field extraction in configurable spiders +- **CSV export** - Results download in CSV format +- **Analytics opt-out** - Baidu analytics toggle + +### Bug Fixes +- **Result pagination** [#45](https://github.com/tikazyq/crawlab/issues/45) +- **Duplicate cron jobs** - Disabled Flask debug mode to prevent duplicates [#32](https://github.com/tikazyq/crawlab/issues/32) +- **API endpoint config** - Added VUE_APP_BASE_URL env variable [#30](https://github.com/tikazyq/crawlab/issues/30) + +# 0.2.1 (2019-05-27) +### Major Features +- **Configurable spiders** - No-code spider creation through UI + +# 0.2 (2019-05-10) +### Core Features +- **Advanced analytics** - Detailed statistics in spider detail view +- **Website metrics** - Chinese website list with robots.txt and response time monitoring + +# 0.1.1 (2019-04-23) +### Foundation +- **Basic statistics** - Failure counts and result totals +- **Near real-time updates** - 5-second polling for task updates +- **Cron scheduling** - APScheduler integration for timed tasks + +# 0.1 (2019-04-17) +- **Initial release** - Basic spider and task management \ No newline at end of file diff --git a/changelog/v0.3.md b/changelog/v0.3.md new file mode 100644 index 00000000..fa5d4cab --- /dev/null +++ b/changelog/v0.3.md @@ -0,0 +1,22 @@ +# 0.3.5 (2019-10-28) +### Features / Enhancement +- Graceful shutdown implementation +- Enhanced node monitoring +- System environment variables in tasks +- Automatic log refresh +- HTTPS support + +### Bug Fixes +- Fixed schedule task spider list +- Worker node info retrieval +- Node selection in task execution + +# 0.3.0 (2019-07-31) +### Major Changes +- **Golang backend** rewrite +- Node topology visualization +- System metrics monitoring +- In-browser code editor +- User authentication system +- Automatic spider deployment +- Optimized Docker images \ No newline at end of file diff --git a/changelog/v0.4.md b/changelog/v0.4.md new file mode 100644 index 00000000..d60067fb --- /dev/null +++ b/changelog/v0.4.md @@ -0,0 +1,154 @@ +# 0.4.10 (2020-04-21) +### Features / Enhancement +- Centralized MongoDB logging +- API token authentication +- Webhook notifications +- Auto-install dependencies from requirements.txt/package.json +- Automatic results collection naming +- Improved project list display +- Node.js v10.19 upgrade +- Manual run button in schedules + +### Bug Fixes +- Fixed registration issue [#670](https://github.com/crawlab-team/crawlab/issues/670) +- Corrected cron expression display [#678](https://github.com/crawlab-team/crawlab/issues/678) +- Fixed daily stats missing [#684](https://github.com/crawlab-team/crawlab/issues/684) +- Improved results count updates [#689](https://github.com/crawlab-team/crawlab/issues/689) + +# 0.4.9 (2020-03-31) +### Features / Enhancement +- Achievement challenges +- Granular access controls +- User feedback system +- Improved dashboard metrics +- Configurable to custom spider conversion +- Schedule-triggered task tracking +- Result deduplication +- Task restart capability + +### Bug Fixes +- Windows CLI compatibility [#580](https://github.com/crawlab-team/crawlab/issues/580) +- Re-upload errors [#643](https://github.com/crawlab-team/crawlab/issues/643) +- File directory sync issues [#646](https://github.com/crawlab-team/crawlab/issues/646) + +# 0.4.8 (2020-03-11) +### Features / Enhancement +- **Multi-language support** - Added Java, .NET Core, PHP installations +- **Installation UI overhaul** - Better visualization of node setups +- **Git integration** - View commit history and checkout capabilities +- **Hostname registration** - Use hostname as node identifier +- **RPC framework** - Improved node communication system +- **Master node toggle** - Control where tasks execute +- **Documentation sidebar** - Quick access to relevant docs + +### Bug Fixes +- Duplicate node registration [#391](https://github.com/crawlab-team/crawlab/issues/391) +- Spider upload conflicts [#603](https://github.com/crawlab-team/crawlab/issues/603) +- Dependency installation failures [#609](https://github.com/crawlab-team/crawlab/issues/609) +- Offline node task creation [#622](https://github.com/crawlab-team/crawlab/issues/622) + +# 0.4.7 (2020-02-24) +### Features / Enhancement +- **Enhanced Scrapy support** - Auto-detection, settings config, log levels +- **Git synchronization** - Sync Git repositories to Crawlab +- **Long-running tasks** - Support for persistent spider processes +- **Spider list optimization** - Status-based task counters and charts +- **Version checking** - New version detection and notifications +- **Bulk spider operations** - Run/stop/delete multiple spiders +- **Spider cloning** - Duplicate existing spider configurations + +### Bug Fixes +- Schedule task spider selection +- Conflicting cron expressions [#515](https://github.com/crawlab-team/crawlab/issues/515) +- Log file collisions [#577](https://github.com/crawlab-team/crawlab/issues/577) + +# 0.4.6 (2020-02-13) +### Features / Enhancement +- **Node.js SDK** - Official SDK for Node.js spiders +- **Log management** - Search, error highlighting, auto-scroll +- **Task execution flow** - Direct navigation to new tasks +- **Parameter tracking** - Added args column to task lists +- **Timestamps in lists** - Created/updated time in spider views +- **Loading states** - Skeleton screens during loading + +### Bug Fixes +- Schedule config focus loss [#519](https://github.com/crawlab-team/crawlab/issues/519) +- CLI upload failures [#524](https://github.com/crawlab-team/crawlab/issues/524) + +# 0.4.5 (2020-02-03) +### Features / Enhancement +- **Interactive tutorials** - Guided walkthrough of key features +- **Global environment variables** - Apply to all spider executions [#177](https://github.com/crawlab-team/crawlab/issues/177) +- **Project management** - Organize spiders into projects [#316](https://github.com/crawlab-team/crawlab/issues/316) +- **Sample spiders** - Auto-added during initialization [#379](https://github.com/crawlab-team/crawlab/issues/379) +- **User permissions** - Restricted admin privileges [#456](https://github.com/crawlab-team/crawlab/issues/456) +- **Settings UI refresh** - Improved configuration interface +- **Result view optimization** - Better data presentation + +### Bug Fixes +- Missing spider files [#485](https://github.com/crawlab-team/crawlab/issues/485) +- Delete button navigation [#480](https://github.com/crawlab-team/crawlab/issues/480) +- Empty spider file creation [#479](https://github.com/crawlab-team/crawlab/issues/479) +- CSV export errors [#465](https://github.com/crawlab-team/crawlab/issues/465) + +# 0.4.4 (2020-01-17) +### Features / Enhancement +- **Email notifications** - Task status alerts via SMTP +- **IM integrations** - DingTalk/WeCom bot support +- **API endpoint optimization** - Relative path support +- **SDK compatibility** - Improved Scrapy integration +- **File tree navigation** - Sidebar directory structure +- **Visual cron editor** - Graphical schedule builder + +### Bug Fixes +- Nil response errors +- HTTPS configuration issues +- Configurable spider execution +- File upload validation + +# 0.4.3 (2020-01-07) +### Features / Enhancement +- **Dependency management** - Web-based package installation +- **Docker language packs** - Pre-install Node.js via env var +- **Cron standardization** - 5-field Linux-compatible format +- **Schedule toggles** - Enable/disable cron jobs [#297](https://github.com/crawlab-team/crawlab/issues/297) +- **Bulk task deletion** - Remove multiple tasks [#341](https://github.com/crawlab-team/crawlab/issues/341) +- **Spider filtering** - Sort/search in spider lists +- **Chinese changelog** - Localized release notes + +### Bug Fixes +- Schedule task execution [#423](https://github.com/crawlab-team/crawlab/issues/423) +- ZIP upload issues [#403](https://github.com/crawlab-team/crawlab/issues/403) +- Network failure handling [#340](https://github.com/crawlab-team/crawlab/issues/340) + +# 0.4.2 (2019-12-26) +### Features / Enhancement +- **Legal disclaimer** - Usage terms documentation +- **Version API** - Programmatic version checking [#371](https://github.com/crawlab-team/crawlab/issues/371) +- **Registration control** - Configurable signups [#346](https://github.com/crawlab-team/crawlab/issues/346) +- **File operations** - Create/edit/rename/delete files [#286](https://github.com/crawlab-team/crawlab/issues/286) +- **Spider creation flow** - Empty spider initialization +- **Task filtering** - Advanced search capabilities [#341](https://github.com/crawlab-team/crawlab/issues/341) + +### Bug Fixes +- Duplicate nodes [#391](https://github.com/crawlab-team/crawlab/issues/391) +- MongoDB connection errors [#373](https://github.com/crawlab-team/crawlab/issues/373) + +# 0.4.1 (2019-12-13) +### Features / Enhancement +- **Spiderfile format** - Changed stages to dictionary [#358](https://github.com/crawlab-team/crawlab/issues/358) +- **Analytics update** - Improved Baidu tracking + +### Bug Fixes +- Schedule display issues [#353](https://github.com/crawlab-team/crawlab/issues/353) +- Node registration duplicates [#334](https://github.com/crawlab-team/crawlab/issues/334) + +# 0.4.0 (2019-12-06) +### Major Features +- **Configurable spiders** - YAML-based spider creation +- **Execution modes** - All nodes/Specific node/Random + +### Bug Fixes +- Task termination issues [#306](https://github.com/crawlab-team/crawlab/issues/306) +- Windows deployment problems [#288](https://github.com/crawlab-team/crawlab/issues/288) +- Log file management [#269](https://github.com/crawlab-team/crawlab/issues/269) \ No newline at end of file diff --git a/changelog/v0.5.md b/changelog/v0.5.md new file mode 100644 index 00000000..ece02daf --- /dev/null +++ b/changelog/v0.5.md @@ -0,0 +1,37 @@ +# 0.5.1 (2020-07-31) +### Features / Enhancement +- **Added error message details** +- **Added Golang programming language support** +- **Added web driver installation scripts for Chrome Driver and Firefox** +- **Support system tasks** - Similar to spider tasks, allows viewing logs for system operations like language installations +- **Changed methods of installing languages from RPC to system tasks** + +### Bug Fixes +- Fixed first download repo 500 error in Spider Market page [#808](https://github.com/crawlab-team/crawlab/issues/808) +- Fixed translation issues +- Fixed 500 error in task detail page [#810](https://github.com/crawlab-team/crawlab/issues/810) +- Fixed password reset issue [#811](https://github.com/crawlab-team/crawlab/issues/811) +- Fixed CSV download issue [#812](https://github.com/crawlab-team/crawlab/issues/812) +- Fixed Node.js installation issue [#813](https://github.com/crawlab-team/crawlab/issues/813) +- Fixed disabled status for batch schedule additions [#814](https://github.com/crawlab-team/crawlab/issues/814) + +# 0.5.0 (2020-07-19) +### Features / Enhancement +- **Spider Market** - Download open-source spiders +- **Batch operations** - Run/delete tasks and spiders in bulk +- **MongoDB driver migration** to `MongoDriver` +- Refactored node management logic +- Increased default `task.workers` to 16 +- Set nginx `client_max_body_size` to 200m +- **ElasticSearch logging** support +- Display detailed error messages in Scrapy UI +- Removed Challenge page +- Moved Feedback/Disclaimer to navbar + +### Bug Fixes +- Fixed log expiration TTL index issue +- Set default log retention to 1 day +- Added missing `task_id` index +- Fixed docker-compose configuration +- Fixed 404 page handling +- Fixed worker node creation sequence issue \ No newline at end of file diff --git a/changelog/v0.6.0-zh.md b/changelog/v0.6-zh.md similarity index 100% rename from changelog/v0.6.0-zh.md rename to changelog/v0.6-zh.md diff --git a/changelog/v0.6.0.md b/changelog/v0.6.md similarity index 100% rename from changelog/v0.6.0.md rename to changelog/v0.6.md diff --git a/docker-compose.yml b/docker-compose.yml index bc018590..4d109bd2 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -1,4 +1,3 @@ -version: '3.3' services: master: image: crawlabteam/crawlab @@ -11,4 +10,4 @@ services: depends_on: - mongo mongo: - image: mongo:4.2 + image: mongo:5