Skip to content

Epic: ToCS-based agent evaluation framework for ADF #691

@AlexMikhalev

Description

@AlexMikhalev

Context

No existing ADF mechanism measures whether agents genuinely understand the codebases they operate on, or merely execute surface-level patterns. Theory of Code Space (ToCS, arXiv:2603.00601) provides a 4-dimension evaluation framework for exactly this.

Proposal

Implement ToCS-inspired evaluation to measure ADF agent effectiveness across four dimensions:

Evaluation Dimensions

Dimension What It Measures Metric
Construct Does the agent build an accurate dependency map? Edge F1 by type (IMPORTS, CALLS_API, REGISTRY_WIRES, DATA_FLOWS_TO)
Revise Does the agent update beliefs when code changes? Belief revision score (delta accuracy after code change)
Exploit Can the agent predict impact of changes? Counterfactual probe accuracy
Constraints Does the agent discover architectural rules? Invariant discovery F1 vs CLAUDE.md/domain model rules

Implementation Phases

  1. Phase 0: Run ToCS benchmark against terraphim-ai workspace with current agents (baseline)
  2. Phase 1: Add periodic cognitive map probing -- every N tool calls, externalise understanding as structured JSON
  3. Phase 2: Compare probes against ground truth (KG-derived dependency graph) to compute scores
  4. Phase 3: Feed scores to NightwatchMonitor as new signal type (alert on degradation)

Cognitive Map Probing

  • Injected via PreToolUse hooks (Agent SDK) or system messages (subprocess)
  • Agent outputs structured JSON: nodes (modules), edges (dependencies, typed), confidence scores
  • Compared against ground truth from terraphim KG + tree-sitter analysis

Key Insight from ToCS Research

  • Aho-Corasick automata cover ~67% of edges (IMPORTS level)
  • CALLS_API (~17%) and DATA_FLOWS_TO (~7%) require semantic understanding
  • Some models show "catastrophic belief collapse" -- losing knowledge between probes
  • Evaluation framework should be built BEFORE KG enrichment (measure first, improve later)

Sub-issues (to be created during design phase)

  • Run ToCS baseline against terraphim-ai workspace
  • Implement cognitive map probe injection and collection
  • Implement belief stability monitoring (successive probe comparison)
  • Integrate evaluation scores with NightwatchMonitor
  • KG enrichment with tree-sitter call graph (after baseline confirms gap)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions