Releases: ModelEngine-Group/DataMate
Releases · ModelEngine-Group/DataMate
v1.0.2
What's Changed
- feat: expose port 18000 in docker-compose for backend service by @hhhhsc701 in #474
- feat: add parallel file copy for ratio task execution by @JasonW404 in #475
- fix: add missing Popover import in DetailHeader by @JasonW404 in #476
Full Changelog: v1.0.1...v1.0.2
v1.0.1
What's Changed
- feat: 修复docker下运行失败问题 by @hhhhsc701 in #466
- fix: upload task display error by @MoeexT in #464
- fix: add file directory error by @MoeexT in #468
- 删除标注算子,屏蔽自动标注 by @hhhhsc701 in #470
Full Changelog: v1.0.0...v1.0.1
v1.0.0
What's Changed
- fix: 修复日志挂载路径 by @hhhhsc701 in #459
- fix: enable jwt by @MoeexT in #460
- 增加自定义首页 by @hhhhsc701 in #462
- fix: login popup when enable jwt by @MoeexT in #463
- fix: Enhance synthesis task management and export functionality by @Dallas98 in #465
Full Changelog: v0.8.0...v1.0.0
v0.8.0
What's Changed
- Add LICENSE file by @Dallas98 in #444
- feat: add filtering functionality for knowledge base file details by @Dallas98 in #445
- feat: add AuthGuard component for enhanced route protection by @Dallas98 in #446
- Add AuthGuard component and update login redirect logic by @Dallas98 in #447
- fix: 指定具体的tag by @hhhhsc701 in #448
- feat: add oms authentication by @MoeexT in #452
- 丰富文档 by @hhhhsc701 in #451
- 数据清洗问题修复 by @hhhhsc701 in #449
- fix: 修复日志展示 by @hhhhsc701 in #454
- feat: add oms auth status by @MoeexT in #453
Full Changelog: v0.7.0...v0.8.0
v0.7.0
What's Changed
- 修改环境变量 by @hhhhsc701 in #434
- feat: add dataset link by @MoeexT in #430
- feat: use hard file link instead of copying file by @MoeexT in #433
- fix: enhance knowledge base chunk management and unify delete confirmation dialogs by @Dallas98 in #435
- fix: increase batch delete size and optimize document loading with async by @Dallas98 in #440
- fix: show login pop-up by @MoeexT in #441
- fix: upload zip by @MoeexT in #442
- Fix typo in LICENSE for DataMate usage by @Dallas98 in #443
Full Changelog: v0.6.0...v0.7.0
v0.6.0
What's Changed
- 优化数据处理/安装部署 by @hhhhsc701 in #419
- feat:文件预览功能 by @MoeexT in #415
- Feature/cleansing enhancements by @hhhhsc701 in #421
- ⚡ instant download file by @MoeexT in #420
- ✨ multi-select file by @MoeexT in #422
- 🎨 remove data quality by @MoeexT in #423
- feat:Enhance RAG service and unify knowledge retrieval functionality by @Dallas98 in #425
- 国际化补充 by @hhhhsc701 in #426
- feat:Enhance Knowledge Base components and API with new features by @Dallas98 in #428
- feat(i18n): 补充缺失的国际化翻译字段 by @hhhhsc701 in #429
- fix: dataset deletion bugs by @MoeexT in #424
- fix: data annotation task creation bug by @MoeexT in #427
- 适配dj多模态算子 by @hhhhsc701 in #431
- feat:enhance knowledge base components and improve API handling by @Dallas98 in #432
Full Changelog: v0.5.0...v0.6.0
v0.5.0
核心功能更新
| 模块 | 功能描述 | 详细说明 | PR/Commit |
|---|---|---|---|
| 数据管理 | 数据集详情性能优化 | 优化页面加载性能;移除无用 Mock 数据 | #407, #391 |
| 数据管理 | 数据集标签功能 | 完善标签管理;修复标签添加、更新错误 | #405, #408, #399 |
| 数据管理 | 数据管理参数优化 | 优化数据集创建、更新参数;新增文件名、路径、颜色验证器 | #371 |
| 数据管理 | 数据集操作修复 | 修复数据集更新、文件添加、分页查询问题 | #372, #370 |
| 知识库 | RAG 模块增强 | 新增文档处理和检索服务;支持多工作池并行处理 | #395 |
| 知识库 | 架构统一 | 统一向量知识库和知识图谱的架构设计 | #406 |
| 知识库 | Milvus 配置优化 | 优化向量数据库连接和性能 | #404 |
| 采集任务 | 任务编辑和重试 | 支持编辑和重试采集任务;优化状态展示 | #394, #403 |
| 采集任务 | 时区与模板优化 | 使用本地时区创建任务;简化模板配置 | #375, #374 |
| 采集任务 | 过滤器修复 | 修复数据采集过滤器问题 | #373 |
| 数据清洗 | 清洗功能增强 | 优化数据处理流程;增强日志表和文件表组件 | #410 |
| 数据标注 | LabelStudio 同步修复 | 修复同步问题;修复多标签选择问题 | #399, #390, #401 |
| 算子市场 | 数据处理修复 | 修复数据处理流程;支持文件元数据透传 | #410, #414 |
| 系统 | UI/UX 优化 | 修复国际化词重复;删除确认改用 Modal;URL 验证 | #386, #382, #402 |
| 系统 | 前端修复 | 修复模型列表加载、过滤器同步、分页响应问题 | #413, #398, #371 |
| 文档 | 文档完善 | 更新技能文档、产品文档;优化构建配置 | - |
What's Changed
- fix: fix paging issue by @hefanli in #370
- Feat/data management param by @hefanli in #371
- fix: 修复数据集更新和数据集文件添加的问题 by @hefanli in #372
- 🐛 Fix data collection filters by @MoeexT in #373
- 💄 change delete prompt Notification to Modal by @MoeexT in #382
- fix: 修复主页国际化词重复问题 by @hefanli in #386
- 🔥 remove mysql-template & simplify template source/target by @MoeexT in #374
- 修复多个问题 by @hhhhsc701 in #389
- fix: Fixed the problem that multiple labels cannot be selected when creating annotation tasks in text multi-label templates. by @hefanli in #390
- 修复文件格式 by @hhhhsc701 in #392
- fix: Delete useless mock data on the data quality tab by @hefanli in #391
- feat:Enhance RAG module with document processing and retrieval improvements by @Dallas98 in #395
- fix:Improve filter synchronization and data fetching in SearchControls by @Dallas98 in #398
- 🐛 create task with local time-zone by @MoeexT in #375
- support edit&retry task by @MoeexT in #394
- fix: Fix the sync issue between DataMate & LabelStudio; Fix the tag update issue in DataMate by @JasonW404 in #399
- fix: 标注信息仅展示标签字段即可 by @hefanli in #401
- feat: update Milvus configuration for improved service integration by @Dallas98 in #404
- feat: enhance RAG module with background processing for knowledge base files by @Dallas98 in #406
- 💄 add url validation by @MoeexT in #402
- Develop/dataset tags by @MoeexT in #405
- Develop/collection status by @MoeexT in #403
- Develop/dataset detail perf by @MoeexT in #407
- 🐛 fix add tag error by @MoeexT in #408
- 修复数据处理与算子市场问题 by @hhhhsc701 in #410
- fix: fetch model list on component mount and update Java service paths by @Dallas98 in #413
- 支持文件元数据透传 by @hhhhsc701 in #414
Full Changelog: v0.4.0...v0.5.0
v0.4.0
更新&修复功能
| 模块 | 功能描述 | 详细说明 |
|---|---|---|
| 国际化 (i18n) | 全面前端国际化支持 | 支持中英文双语切换,使用 i18next 框架;覆盖所有核心模块:数据采集、数据管理、数据比例、数据评估、数据标注、数据处理、知识库、算子市场、设置、合成任务、清洗 |
| 知识库 | 扩展文档加载器支持 | 新增 Excel (.xlsx, .xls) 和 HTML (.html, .htm) 文件格式支持;基于 unstructured 库的通用文档加载器 |
| 知识库 | 服务架构重构 | 提升模块化和可维护性;优化文档解析和向量化流程 |
| 数据管理 | 数据集标签分布图 | 可视化统计功能;展示数据集中各标签的分布情况 |
| 数据管理 | 标签逻辑修复 | 修复数据集标签写入和查询逻辑;解决数据一致性问题 |
| 数据管理 | 数据集加载修复 | 统一 tags 对象中的 "values" 关键字 |
| 数据清洗 | 清洗功能增强 | 优化数据处理流程;改进数据质量规则 |
| 数据标注 | 支持 pt/h5 文件标注 | PyTorch 模型文件 (.pt) 和 HDF5 格式 (.h5) 可作为图像数据集进行标注 |
| 数据标注 | 添加 PgBouncer 连接池 | 解决 Label Studio 连接数过多问题;提升数据库连接管理效率;优化高并发场景性能 |
| 数据标注 | Label Studio 部署文档 | 新增详细的安装和部署说明 |
| 算子市场 | 数据血谱增强 | 改进数据血缘关系的可视化展示;增强数据处理流程的追踪能力 |
| 采集任务 | 文件链接功能 | 添加文件时仅保留链接,不复制文件内容;节省存储空间 |
| 采集任务 | 定时调度修复 | 使用 APScheduler 修复采集任务的定时调度;支持更灵活的调度策略 |
| 采集任务 | 定时运行逻辑重构 | 提取公共调度类;提升代码复用性和可维护性 |
| 系统 | API 响应码标准化 | API 响应码统一为字符串格式;改进接口一致性 |
| 系统 | 证书自签发续订 | 支持自动化证书管理和续订;简化 HTTPS 配置流程 |
| 系统 | Docker 构建修复 | 修复 Docker 镜像构建问题;优化构建流程 |
| 系统 | 模板开发 | 改进项目模板;支持快速初始化 |
| ME Provider | 端点 URL 更新 | 同步最新的 API 端点;确保服务连接正常 |
| 文档工具 | 技能文档完善 | 新增 Python Web 后端架构师技能文档;新增前端设计技能文档 |
| 产品文档 | 更新DataMate产品文档 | https://modelengine-group.github.io/datamate-docs/ |
v0.3.0
版本定位
DataMate 本版本聚焦数据治理能力增强与多租户隔离:补齐数据血缘、数据采集能力,完善标注工作流与算子生态,统一模型配置与数据权限,提升平台稳定性和可运维性。
核心功能更新
| 模块 | 更新内容 |
|---|---|
| 数据归集 | • 新增:API Reader 采集插件,支持将API接口返回的数据以CSV文件的形式归集到DataMate。 • 新增:通用关系型数据库采集模板(RDBMS Reader),增加对于postgres、opengauss、sqlserver等关系型数据库的归集支持。 |
| 数据处理 | • 变更:更新数据处理任务算子展示方式 • 新增:重试次数展示和日志根据重试次数查看 |
| 数据管理 | • 新增:数据血缘页,实现数据归集、数据管理、数据清洗过程中数据血缘的追溯 • 新增:上传时支持按文件夹上传;前端预览文本和图片文件内容;前端查看已标注文件的标签 • 修复:未入库文件元数据无法查看,无法预览内容 |
| 数据标注 | • 优化:整合并完善辅助标注和人工标注的同步逻辑,实现标注工具标注数据和数据集数据的双向同步 • 优化:保留一级操作(同步、编辑),并将低频操作(删除任务、编辑任务数据集、导出标注结果)收敛至二级菜单。 • 修复:分页参数错误导致的前端展示问题 |
| 算子市场 | • 新增:算子按功能分类 • 新增:算子文档与示例补充 • 优化:前端展示(Card、OperatorServiceMonitor、Requirement 等组件) |
| 知识生成 | • 优化:知识图谱改用 2D 力导向图、渲染逻辑调整 |
| 模型和配置 | • 重构:模型配置统一、LLM 客户端工厂标准化 • 优化:会话与模型创建时的用户上下文追踪 |
| 部署模块 | • 新增:Docker 镜像按分支打 tag • 新增:MinerU 适配 310P、构建参数优化 • 新增:SECURITY.md 安全策略说明 • 修复:数据库日志目录权限、图片相似度重复重试误过滤 • 修复:编译阶段 three.js 依赖缺失 |
| 用户管理 | • 新增:数据归集、数据集、清洗、合成、配比、评估、知识库、算子按创建者隔离,系统预置数据不隔离 • 新增:DataSetScope 数据权限、UserContext 传递 |
What's Changed
- feat(data-management): add preview functionality for text and image items by @o0Shark0o in #259
- fix: 添加编译阶段three.js依赖 by @Dallas98 in #261
- fix: 增加three依赖 by @hhhhsc701 in #262
- fix: 数据库修改日志目录权限/修复图片相似度重复重试任务时误过滤 by @hhhhsc701 in #264
- Fix: Annotation template paginate can not show when template count > 12 by @q792602257 in #260
- feat: refactor KnowledgeGraphView to use 2D force graph and improve rendering logic by @Dallas98 in #266
- fix: 翻页参数问题 by @q792602257 in #265
- feat(data-management): fix bulk upload issues and enhance UI & upload experience by @o0Shark0o in #269
- feat(auto-annotation): sync tags and timestamps to datasets and optimize visibility by @o0Shark0o in #271
- feat: 优化mineru构建部署参数,适配310P by @hhhhsc701 in #270
- 算子前端展示更新 by @hhhhsc701 in #273
- realize that data sets, data cleaning, and knowledge bases are isolated according to the creator, and operators are not isolated. by @hefanli in #268
- feat: 增加deepwiki链接 by @hhhhsc701 in #278
- feat: 清洗任务增加重试次数记录和日志展示 by @hhhhsc701 in #280
- feat: 数据库改为单例模式 by @hhhhsc701 in #281
- feat: 补充算子文档和示例 by @hhhhsc701 in #282
- feat(annotation): add bidirectional sync and flexible export for annotation tasks by @o0Shark0o in #284
- feat: enhance dataset file download functionality to zip all files and improve path validation by @Dallas98 in #285
- add data lineage page and data quality page by @hefanli in #287
- Enhance README with new fields and unit updates by @hhhhsc701 in #288
- feat: update Docker image tagging to use branch names for better identification by @Dallas98 in #289
- feat: rename data cleansing references to data processing for consistency by @Dallas98 in #292
- Create SECURITY.md for security policy by @yafengzhang2025 in #293
- feat(annotation): simplify task creation and streamline sync workflow by @o0Shark0o in #294
- 增加数据库插入 by @hhhhsc701 in #295
- 增加空值处理 by @hhhhsc701 in #298
- feat(annotation): refine sync behavior and add annotation export option by @o0Shark0o in #299
- fixed the problem that the metadata of files that have not yet been stored in the database cannot be previewed. by @hefanli in #300
- add apireader collection plug-in by @hefanli in #301
- feature: added universal relational database collection template by @hefanli in #302
- fixed issues where files inside folders could not be previewed and deleted by @hefanli in #303
- refactor(models): unify model configuration and standardize LLM client factory by @Dallas98 in #283
- feat(operator-market): reorganize operator categories by functionality by @o0Shark0o in #304
- feat: enhance user tracking in session management and model creation by using effective user context by @Dallas98 in #305
- fix: adapt to structural changes of COT data by @hefanli in #306
Full Changelog: v0.2.0...v0.3.0
v0.2.0
1. 核心模块迭代
| 模块 | 更新内容 | 价值与影响 |
|---|---|---|
| 数据归集 | 将原来的OBS归集模板修改为S3归集模板,增强对于S3存储的支持能力增加MYSQL归集模板、StarRocks归集模板、GlusterFS归集模板 | 增加了数据归集支持的外部系统,丰富了数据归集的能力。 |
| 数据处理 | 支持清洗到已创建的数据集中;清洗任务创建筛选逻辑优化,支持多选;更新paddlepaddle图像文本方向识别模型;支持清洗表格标签; | 提升数据处理的灵活性、效率。 |
| 数据管理 | 数据管理内增加新建文件夹功能,可在数据集中使用文件夹;数据管理中增加‘数据/文件夹’重命名功能;优化‘数据集’根据数据类型的图标和微动态互动效果。 | 增强了数据管理的组织与可维护能力,提升了复杂数据集在使用中的灵活性。提升了界面可读性与操作直观性。 |
| 数据标注 | 新增图像目标检测自动标注功能,支持80类可选分类;自动标注结果已可以与Label-Studio进行互联;自动/手动标注创建任务可以任选数据集和数据。 | 降低了人工标注成本,加快了大规模数据集的标注效率。提升了标注流程的灵活性与工程落地效率。 |
| 知识生成 | 知识生成新增生成知识图谱能力:支持.txt、.md、.doc、.docx、.pdf等泛文本文件生成知识图谱前端支持显示知识图谱关系,支持查询节点与边的详情 | 将非结构化文本转化为可视化、可检索的知识关联网络,可大幅提升 RAG 数据质量和效率 |
| 部署模块 | 重构docker-compose.yml文件,支持随版本发布docker-compose.yml支持配置证书以开启https,支持加密后的密钥统一数据库密码配置入口,密码存放到secret中支持ModelEngine对接,前端菜单能跳转到DataMate首页 | 简化docker安装方式,支持单行一键安装增加系统安全性 |
| 用户模块 | 增加用户登录和注册功能,默认部署不对用户进行要求用户功能开启后,仅可查看首页,其余页面点击后弹出登录弹窗 | 增加用户功能,增加系统安全性 |
2. 体验、兼容性优化与问题修复
- 开放数据集、数据清洗、算子市场相关接口mcp能力,支持通过Agent创建清洗任务
- 算子市场,对各个数据种类的算子进行了图标与颜色的全新设计,使前端更加美观,粗略可读性更强
- 数据库切换,由mysql切换为pgsql,去除GPL以便公司内部开源引入,且与LabelStudio的数据库合并,减少部署时资源消耗
- 修复数据清洗任务创建时切换选中算子导致已配置的参数丢失问题
- 修复算子市场分页状态在「全选 / 细分分类」切换时未重置,导致分页错乱及空结果问题
- 修复设置界面前端拉伸时会出现无限拉伸的情况。
What's Changed
- fix: ray部署修改 by @hhhhsc701 in #221
- feature: add mysql collection and starrocks collection by @hefanli in #222
- feat(auto-annotation): integrate YOLO auto-labeling and enhance data management by @o0Shark0o in #223
- fix the ratio task config by @hefanli in #224
- 优化合成任务数据集选择交互体验 by @o0Shark0o in #225
- fix: 修复入库可能重复;筛选逻辑优化 by @hhhhsc701 in #226
- feat: mcp支持创建清洗任务 by @hhhhsc701 in #227
- feature: could create dataset while creating collection task by @hefanli in #228
- feat(auto-annotation): integrate auto-label results with Label Studio and improve task creation by @o0Shark0o in #230
- feat: enhance Docker configuration with additional services and profiles by @Dallas98 in #231
- feature: support data intelligence orchestation front by @yafengzhang2025 in #232
- feat: update knowledge base file detail view with pagination and metadata display by @Dallas98 in #233
- feat(text-annotation): improve manual text annotation and enable visibility in Label Studio by @o0Shark0o in #235
- refactor: replace the database from mysql to pgsql by @hefanli in #236
- feat: enhance Makefile and README for improved image download process and user prompts by @Dallas98 in #237
- fix: fix the LabelStudio jumps without login by @hefanli in #238
- fix: 修复paddleocr模型报错 by @hhhhsc701 in #239
- feat: enhance Makefile and update logging to mask database URL in main.py by @Dallas98 in #240
- feat(operator): add YOLOv8 image object detection operator by @o0Shark0o in #241
- feat: Add Glusterfs LocalFs S3-CompatibleFs Support by @q792602257 in #229
- feat: nginx支持https/密码通过secret存放 by @hhhhsc701 in #242
- fix: 修复参数传递问题 by @hhhhsc701 in #243
- feature: add user login and signup functions by @hefanli in #244
- fix(auto-annotation): resolve task bugs and add dataset and folder renaming support by @o0Shark0o in #245
- fix: pg数据库环境变量 by @hhhhsc701 in #247
- feat: 支持通过configmap挂载证书并解密 by @hhhhsc701 in #248
- feat: 支持清洗到已有数据集中 by @hhhhsc701 in #249
- add user-related interactions on the frontend by @hefanli in #250
- fix: 修复https证书配置 by @hhhhsc701 in #252
- feat: 新增知识图谱功能 by @Dallas98 in #251
- no login required by default by @hefanli in #253
- feat: 支持清洗表格标签 by @hhhhsc701 in #254
- feat: add graph RAG service routing and storage configuration by @Dallas98 in #255
- feat: add lightrag-hku dependency to pyproject.toml by @Dallas98 in #256
- fix: 修复日志打印 by @hhhhsc701 in #257
New Contributors
- @yafengzhang2025 made their first contribution in #232
- @q792602257 made their first contribution in #229
Full Changelog: v0.1.0...v0.2.0