Skip to content

Phase 3: Success-rate monitoring and self-healing procedures #695

@AlexMikhalev

Description

@AlexMikhalev

Parent Epic

Part of #692 (Operational Skill Store)

Summary

Track procedure confidence scores over time. When a procedure degrades (success rate drops below threshold), auto-pull it from replay and flag for re-learning. This is the "self-healing" layer that prevents agents from replaying broken workflows.

What Changes

Health tracking in terraphim_agent_evolution

CLI

terraphim-agent learn health                    # all procedures
terraphim-agent learn health --degraded-only    # only degraded/broken
terraphim-agent learn health --task-type "deploy"

Auto-pull logic

When replay is requested for a Broken procedure:

  • Refuse replay, display health report
  • Suggest re-capture: "This procedure has failed 6 of last 10 times. Re-capture with: learn capture-success ..."

EIDOS integration

Each health state transition generates a prediction:

Affected Crates

  • terraphim_agent_evolution (health monitoring)
  • terraphim_agent (health CLI, auto-pull in replay)

Dependencies

Acceptance Criteria

  • Health report computed from rolling window of last N executions
  • Procedures auto-transition between health states
  • Broken procedures refused for replay with clear message
  • Health transitions emit EIDOS-compatible predictions
  • learn health CLI displays formatted report
  • Unit tests for health state transitions

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions