refactor(subagent): 重构 SubAgent 编排系统,增强任务队列、重试机制与并发控制#5722
refactor(subagent): 重构 SubAgent 编排系统,增强任务队列、重试机制与并发控制#5722whatevertogo wants to merge 16 commits intoAstrBotDevs:masterfrom
Conversation
重构 subagent 编排系统为模块化架构: - 新增 astrbot/core/subagent/ 模块: - models.py: 核心数据模型 (SubagentConfig, SubagentAgentSpec, SubagentTaskData) - codec.py: 配置编解码,支持兼容性处理和扩展字段 - planner.py: 挂载计划生成器,解析 persona/instructions/tools - runtime.py: 后台任务队列,支持并发控制、重试分类、崩溃恢复 - worker.py: 轮询工作器 - 新增数据库持久化层: - SubagentTask 表用于任务状态管理 - 支持幂等性、指数退避、错误分类 (transient/fatal) - 崩溃后自动恢复 interrupted running 状态的任务 - 重构 SubagentOrchestrator: - 采用 Facade 模式协调 planner/runtime/worker - 解耦任务执行逻辑 - 更新 HandoffTool 执行流程: - 支持后台异步执行和结果通知 - 嵌套深度控制 - 添加完整单元测试覆盖 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dation - Updated SubAgentPage.vue to include new fields: tools_scope, tools, max_steps, and instructions. - Improved validation rules for agent names to enforce length constraints. - Added inferToolsScope function to determine tools scope based on provided configurations. - Implemented normalization of sub-agent configurations to handle new fields. - Enhanced agent addition and removal logic to accommodate new properties. - Introduced tests for subagent codec to validate encoding and decoding of configurations. - Added persistence tests for subagent tasks to ensure correct status transitions. - Developed runtime tests for subagent task management, including retries and cancellations. - Created unit tests for subagent planner to validate tool management and persona integration. - Implemented route tests for subagent API to ensure correct configuration handling and task actions.
…/subagent-orchestration
…mprove error classification constants
… subagent orchestration
Replace hardcoded defaults with constants for exception names, nested depth limits, and default max steps and apply them in codec and handoff_executor. Downgrade SubagentRuntime event logs from info to debug. Add regex-based task_id validation in dashboard routes to prevent injection attacks. Signed-off-by: whatevertogo <whatevertogo@users.noreply.github.com>
…iguration handling
… and concurrency control for robust subagent execution.
There was a problem hiding this comment.
Sorry @whatevertogo, your pull request is larger than the review limit of 150000 diff characters
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant overhaul of the SubAgent orchestration system, aiming to enhance its stability, scalability, and ease of use. The core changes involve a new modular architecture for SubAgent logic, the integration of SQLite for persistent task management, and the implementation of advanced task queuing with retry and concurrency controls. These improvements ensure that SubAgent operations are more resilient to failures and can be managed more effectively, both programmatically and through the updated user interface. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: AstrBotTeam's Space pr4697的改动View Suggested Changes@@ -15,8 +15,8 @@
> **参考文档**: 关于委派工具参数的完整说明,请参阅官方 SubAgent 文档([中文版](https://docs.astrbot.app/zh/use/subagent.html#委派工具参数) / [English](https://docs.astrbot.app/en/use/subagent.html#delegation-tool-parameters)),该文档已在 [PR #138](https://github.com/AstrBotDevs/AstrBot-docs/pull/138) 中完成更新。
任务流转支持同步和异步两种模式:
-- 同步模式(默认):主代理等待子代理完成任务后返回结果。
-- 异步模式(background_task=true):主代理立即返回任务 ID,子代理后台执行,完成后自动通知用户。
+- 同步模式(默认):主代理等待子代理完成任务后返回结果,通过 `HandoffExecutor.execute_foreground()` 执行。
+- 异步模式(background_task=true):主代理立即返回任务 ID,任务通过 `HandoffExecutor.submit_background()` 提交到持久化任务队列,子代理后台执行,完成后通过 `wake_main_agent_for_background_result()` 自动通知用户。
这样,主代理既可以自主处理,也可以灵活委派任务,并支持异步任务流转和多模态任务引用。
@@ -35,6 +35,20 @@
#### 配置说明
SubAgent 的定义与 Persona 配置一致,需在配置文件中指定 tools、skills、name、description 等。
+**架构重构(PR #5722)**
+
+[PR #5722](https://github.com/AstrBotDevs/AstrBot/pull/5722) 对 SubAgent 编排系统进行了全面重构,将任务执行逻辑从 `astr_agent_tool_exec.py` 迁移到模块化的 `astrbot/core/subagent/` 组件中,增强了任务队列、重试机制和并发控制能力。
+
+**核心模块:**
+- `handoff_executor.py`:处理 handoff 执行逻辑,提供 `HandoffExecutor.execute_foreground()` 和 `HandoffExecutor.submit_background()` 方法
+- `background_notifier.py`:包含 `wake_main_agent_for_background_result()` 函数,用于后台任务完成后唤醒主代理
+- `runtime.py`:`SubagentRuntime` 类管理任务队列、重试逻辑和并发控制
+- `planner.py`:`SubagentPlanner` 类负责确定性的 handoff 计划生成
+- `models.py`:定义 `SubagentTaskData`、`SubagentConfig` 等数据模型
+- `hooks.py`:提供生命周期钩子接口(`SubagentHooks`)
+- `error_classifier.py`:错误分类器,判断任务失败是否可重试
+- `worker.py`:`SubagentWorker` 类负责后台任务调度
+
**Persona 引用(PR #5672)**
子代理支持通过 `persona_id` 字段引用现有的 Persona 配置。[PR #5672](https://github.com/AstrBotDevs/AstrBot/pull/5672) 修复了子代理使用 `persona_id: "default"` 时的查找失败问题:
@@ -56,7 +70,7 @@
- `input`:转交给子代理的任务输入(必填)
- `image_urls`(可选):图片 URL 或本地路径数组,用于多模态任务。用户发送给主代理的图片会自动传递给子代理,也可以手动指定 HTTP/HTTPS 公网 URL 或本地文件路径。支持的图片格式包括:png、jpg、jpeg、gif、webp、bmp、tif、tiff、svg、heic。
- `background_task`:设置任务执行模式
- - 若为 `true`,任务将以后台模式异步执行,主代理立即返回任务 ID,完成后通过后台任务唤醒机制通知用户
+ - 若为 `true`,任务将以后台模式异步执行,通过 `SubagentRuntime.enqueue()` 方法提交到持久化任务队列,主代理立即返回任务 ID,完成后通过后台任务唤醒机制通知用户
- 若未设置或为 `false`,则为同步模式,主代理等待子代理完成任务
更详细的参数说明和使用示例,请参阅官方 SubAgent 文档的[委派工具参数](https://docs.astrbot.app/zh/use/subagent.html#委派工具参数)和[多模态图片传递](https://docs.astrbot.app/zh/use/subagent.html#多模态图片传递)章节。
@@ -151,7 +165,7 @@
**工具集构建逻辑(_build_handoff_toolset)**
-新增 `_build_handoff_toolset()` 方法统一 handoff 工具集构建逻辑,确保 SubAgent 和主 Agent 的工具语义一致:
+新增 `HandoffExecutor.build_handoff_toolset()` 方法统一 handoff 工具集构建逻辑,确保 SubAgent 和主 Agent 的工具语义一致:
1. **`tools=None` 语义(所有工具)**:
- 当 `SubAgent.tools` 设置为 `None` 时,表示"所有可用工具",与主代理行为一致
@@ -175,7 +189,7 @@
**技术实现**
- 实现读取 `provider_settings.computer_use_runtime` 配置,确定需要挂载的运行时工具类型
-- 在 `_execute_handoff()` 方法中,原有的临时工具集构建逻辑已替换为统一的 `_build_handoff_toolset()` 方法调用
+- `HandoffExecutor.build_handoff_toolset()` 方法替代了原有 `_build_handoff_toolset()` 方法,在 `handoff_executor.py` 模块中实现
- 工具解析优先级:已注册工具 → 运行时工具,确保灵活性和正确性
**权限控制(PR #5402)**
@@ -344,12 +358,57 @@
### 4. 工具注册与配置加载
#### 逻辑改进
-工具注册和配置加载逻辑已优化,确保子代理配置的正确性和工具的动态注册。FunctionTool 新增 `is_background_task` 属性,支持异步后台任务,任务创建后立即返回任务 ID,完成后自动通知主代理。
+工具注册和配置加载逻辑已优化,确保子代理配置的正确性和工具的动态注册。FunctionTool 新增 `is_background_task` 属性,支持异步后台任务。
+
+#### 后台任务执行流程(PR #5722)
+
+[PR #5722](https://github.com/AstrBotDevs/AstrBot/pull/5722) 重构了后台任务的提交和执行流程,引入持久化任务队列、重试机制和并发控制:
+
+**任务提交:**
+- 后台任务通过 `SubagentRuntime.enqueue()` 方法提交,返回唯一的 `task_id`
+- 使用 `tool_call_id` 参数生成幂等性键,防止重复提交相同任务
+- 任务元数据(payload、handoff 快照、执行上下文)持久化到 SQLite 数据库(`SubagentTask` 表)
+
+**任务队列系统:**
+- `SubagentRuntime` 类管理基于 SQLite 的持久化任务队列
+- 任务状态:pending → running → succeeded/failed/retrying/canceled
+- 支持按任务状态查询(`list_tasks()`)、手动重试(`retry_task()`)、取消任务(`cancel_task()`)
+
+**重试机制:**
+- 错误分类器(`error_classifier.py`)将错误分为三类:
+ - `fatal`:致命错误,不可重试(如 ValueError、PermissionError)
+ - `transient`:暂时性错误,可重试(如 TimeoutError、ConnectionError)
+ - `retryable`:可重试错误,其他未分类错误默认为此类
+- 重试参数可配置:
+ - `max_attempts`:最大重试次数(默认根据 `DEFAULT_MAX_ATTEMPTS` 常量)
+ - `base_delay_ms` 和 `max_delay_ms`:指数退避延迟(毫秒)
+ - `jitter_ratio`:随机抖动比例,防止雷鸣羊群效应
+- 任务失败后自动根据错误类型和当前尝试次数决定是否重试
+
+**并发控制:**
+- 基于 lane 的并发控制系统,lane 键格式:`session:{umo}:{subagent_name}`
+- `max_concurrent` 参数限制同时执行的任务数(默认根据 `DEFAULT_MAX_CONCURRENT_TASKS` 常量,可通过配置调整)
+- 确保同一用户-子代理组合的任务串行执行,避免竞争条件
+
+**恢复机制:**
+- Worker 启动时自动恢复中断的 "running" 任务(超过 5 分钟未更新的任务视为中断)
+- 中断任务自动重新调度,确保系统重启后任务不丢失
+
+**生命周期钩子:**
+- 提供 `SubagentHooks` 接口,支持以下生命周期事件:
+ - `on_task_enqueued`:任务入队时触发
+ - `on_task_started`:任务开始执行时触发
+ - `on_task_retrying`:任务重试时触发(包含延迟、错误类型等信息)
+ - `on_task_succeeded`:任务成功完成时触发
+ - `on_task_failed`:任务失败时触发(包含错误类型和异常)
+ - `on_task_canceled`:任务被取消时触发
+ - `on_task_result_ignored`:任务结果被忽略时触发(如状态已变化)
#### 注意事项
- 工具需在对应 SubAgent/Persona 配置中声明
- 动态注册工具时需确保配置同步更新
-- 后台任务需正确设置 `is_background_task: true`
+- 后台任务需正确设置 `background_task: true` 参数
+- 后台任务执行通过 `HandoffExecutor.execute_queued_task()` 方法完成,该方法从数据库中恢复任务上下文并执行
#### Neo 技能生命周期工具(PR #5028)
Note: You must be authenticated to accept/decline updates. |
There was a problem hiding this comment.
Code Review
This PR introduces a large-scale and well-designed refactoring of the SubAgent orchestration system, enhancing its reliability and maintainability through persistent task queues, retry mechanisms, and concurrency control. The code structure is now more modular, with logic clearly separated into new modules under astrbot/core/subagent/. However, a critical security vulnerability related to prompt injection was identified in the SkillManager. Specifically, untrusted metadata from the sandbox environment is injected into the main agent's system prompt without proper sanitization, which could allow a compromised sandbox to subvert the main agent's behavior. Additionally, a minor issue regarding duplicated boolean parsing logic was found, and suggestions have been provided for improved code consistency.
…togo/AstrBot into refactor/subagent-orchestration
…anagement system.
|
To use Codex here, create a Codex account and connect to github. |
|
@codex review it |
|
To use Codex here, create a Codex account and connect to github. |
|
@codex review it |
|
To use Codex here, create a Codex account and connect to github. |
|
@codex review it |
|
To use Codex here, create a Codex account and connect to github. |
Motivation / 动机
重构 SubAgent 编排系统,解决原有系统在任务调度、错误处理和并发控制方面的不足。通过模块化设计和持久化支持,提升系统的可靠性和可维护性。
Modifications / 改动点
核心模块重构:
astrbot/core/subagent/模块,将 SubAgent 编排逻辑模块化runtime.py: 任务运行时,实现任务队列、重试机制和并发控制codec.py: 配置编解码器error_classifier.py: 错误分类器,用于判断任务失败是否可重试hooks.py: 生命周期钩子接口models.py: 数据模型定义planner.py: 任务规划器handoff_executor.py: Handoff 执行器background_notifier.py: 后台通知器持久化支持:
astrbot/core/db/模块,支持 SQLite 持久化配置增强:
前端改进:
SubAgentPage.vue,支持新的任务管理界面测试覆盖:
新增 12 个单元测试文件,覆盖编解码、错误分类、钩子、持久化、规划器、运行时等模块
This is NOT a breaking change. / 这不是一个破坏性变更。
Test Results / 测试结果
已通过本地单元测试验证:
test_subagent_codec.py- 编解码测试通过test_subagent_error_classifier.py- 错误分类测试通过test_subagent_hooks.py- 钩子接口测试通过test_subagent_persistence.py- 持久化测试通过test_subagent_planner.py- 规划器测试通过test_subagent_runtime.py- 运行时测试通过Checklist / 检查清单
requirements.txt和pyproject.toml文件相应位置。/ I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations inrequirements.txtandpyproject.toml.🤖 Generated with Claude Code