-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Context
No existing ADF mechanism measures whether agents genuinely understand the codebases they operate on, or merely execute surface-level patterns. Theory of Code Space (ToCS, arXiv:2603.00601) provides a 4-dimension evaluation framework for exactly this.
Proposal
Implement ToCS-inspired evaluation to measure ADF agent effectiveness across four dimensions:
Evaluation Dimensions
| Dimension | What It Measures | Metric |
|---|---|---|
| Construct | Does the agent build an accurate dependency map? | Edge F1 by type (IMPORTS, CALLS_API, REGISTRY_WIRES, DATA_FLOWS_TO) |
| Revise | Does the agent update beliefs when code changes? | Belief revision score (delta accuracy after code change) |
| Exploit | Can the agent predict impact of changes? | Counterfactual probe accuracy |
| Constraints | Does the agent discover architectural rules? | Invariant discovery F1 vs CLAUDE.md/domain model rules |
Implementation Phases
- Phase 0: Run ToCS benchmark against terraphim-ai workspace with current agents (baseline)
- Phase 1: Add periodic cognitive map probing -- every N tool calls, externalise understanding as structured JSON
- Phase 2: Compare probes against ground truth (KG-derived dependency graph) to compute scores
- Phase 3: Feed scores to NightwatchMonitor as new signal type (alert on degradation)
Cognitive Map Probing
- Injected via
PreToolUsehooks (Agent SDK) or system messages (subprocess) - Agent outputs structured JSON: nodes (modules), edges (dependencies, typed), confidence scores
- Compared against ground truth from terraphim KG + tree-sitter analysis
Key Insight from ToCS Research
- Aho-Corasick automata cover ~67% of edges (IMPORTS level)
- CALLS_API (~17%) and DATA_FLOWS_TO (~7%) require semantic understanding
- Some models show "catastrophic belief collapse" -- losing knowledge between probes
- Evaluation framework should be built BEFORE KG enrichment (measure first, improve later)
Sub-issues (to be created during design phase)
- Run ToCS baseline against terraphim-ai workspace
- Implement cognitive map probe injection and collection
- Implement belief stability monitoring (successive probe comparison)
- Integrate evaluation scores with NightwatchMonitor
- KG enrichment with tree-sitter call graph (after baseline confirms gap)
References
- ToCS paper: https://arxiv.org/abs/2603.00601
- ToCS repo: https://github.com/che-shr-cat/tocs
- KB article:
cto-executive-system/knowledge/external/context-engineering/tocs-theory-of-code-space-benchmark.md - Expansion plan:
cto-executive-system/plans/tocs-terraphim-ai-evaluation-plan.md - ADF plan:
cto-executive-system/plans/adf-architecture-improvements.md(item 3.1) - Depends on: Epic: Migrate ADF Claude agents to Agent SDK #689 (Agent SDK migration for hook-based probe injection)
- Related: Epic: Evaluate Pi (badlogic/pi-mono) architectural patterns for terraphim-ai #682 (Pi eval epic), Evaluate: Steering and follow-up message queues for ADF agent interaction #687 (steering queues)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request