Epic: ToCS-based agent evaluation framework for ADF

## Context

No existing ADF mechanism measures whether agents genuinely understand the codebases they operate on, or merely execute surface-level patterns. Theory of Code Space (ToCS, arXiv:2603.00601) provides a 4-dimension evaluation framework for exactly this.

## Proposal

Implement ToCS-inspired evaluation to measure ADF agent effectiveness across four dimensions:

### Evaluation Dimensions

| Dimension | What It Measures | Metric |
|-----------|-----------------|--------|
| **Construct** | Does the agent build an accurate dependency map? | Edge F1 by type (IMPORTS, CALLS_API, REGISTRY_WIRES, DATA_FLOWS_TO) |
| **Revise** | Does the agent update beliefs when code changes? | Belief revision score (delta accuracy after code change) |
| **Exploit** | Can the agent predict impact of changes? | Counterfactual probe accuracy |
| **Constraints** | Does the agent discover architectural rules? | Invariant discovery F1 vs CLAUDE.md/domain model rules |

### Implementation Phases
1. **Phase 0**: Run ToCS benchmark against terraphim-ai workspace with current agents (baseline)
2. **Phase 1**: Add periodic cognitive map probing -- every N tool calls, externalise understanding as structured JSON
3. **Phase 2**: Compare probes against ground truth (KG-derived dependency graph) to compute scores
4. **Phase 3**: Feed scores to NightwatchMonitor as new signal type (alert on degradation)

### Cognitive Map Probing
- Injected via `PreToolUse` hooks (Agent SDK) or system messages (subprocess)
- Agent outputs structured JSON: nodes (modules), edges (dependencies, typed), confidence scores
- Compared against ground truth from terraphim KG + tree-sitter analysis

### Key Insight from ToCS Research
- Aho-Corasick automata cover ~67% of edges (IMPORTS level)
- CALLS_API (~17%) and DATA_FLOWS_TO (~7%) require semantic understanding
- Some models show "catastrophic belief collapse" -- losing knowledge between probes
- Evaluation framework should be built BEFORE KG enrichment (measure first, improve later)

### Sub-issues (to be created during design phase)
- Run ToCS baseline against terraphim-ai workspace
- Implement cognitive map probe injection and collection
- Implement belief stability monitoring (successive probe comparison)
- Integrate evaluation scores with NightwatchMonitor
- KG enrichment with tree-sitter call graph (after baseline confirms gap)

## References
- ToCS paper: https://arxiv.org/abs/2603.00601
- ToCS repo: https://github.com/che-shr-cat/tocs
- KB article: `cto-executive-system/knowledge/external/context-engineering/tocs-theory-of-code-space-benchmark.md`
- Expansion plan: `cto-executive-system/plans/tocs-terraphim-ai-evaluation-plan.md`
- ADF plan: `cto-executive-system/plans/adf-architecture-improvements.md` (item 3.1)
- Depends on: #689 (Agent SDK migration for hook-based probe injection)
- Related: #682 (Pi eval epic), #687 (steering queues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: ToCS-based agent evaluation framework for ADF #691

Context

Proposal

Evaluation Dimensions

Implementation Phases

Cognitive Map Probing

Key Insight from ToCS Research

Sub-issues (to be created during design phase)

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dimension	What It Measures	Metric
Construct	Does the agent build an accurate dependency map?	Edge F1 by type (IMPORTS, CALLS_API, REGISTRY_WIRES, DATA_FLOWS_TO)
Revise	Does the agent update beliefs when code changes?	Belief revision score (delta accuracy after code change)
Exploit	Can the agent predict impact of changes?	Counterfactual probe accuracy
Constraints	Does the agent discover architectural rules?	Invariant discovery F1 vs CLAUDE.md/domain model rules

Epic: ToCS-based agent evaluation framework for ADF #691

Description

Context

Proposal

Evaluation Dimensions

Implementation Phases

Cognitive Map Probing

Key Insight from ToCS Research

Sub-issues (to be created during design phase)

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions