Twining Benchmark Harness

A CLI-driven benchmark execution engine that quantitatively compares multi-agent coordination strategies. It answers the question: does Twining actually help AI agents work together, and by how much?

The harness runs controlled experiments where multiple Claude agents collaborate on a shared codebase under different coordination conditions — from no coordination through static docs, shared files, structured frameworks, and full Twining MCP — then scores the results using dual-rubric LLM-as-judge evaluation, automated analysis, and statistical comparison.

Quick Start

Prerequisites

Node.js >= 20.0.0
Claude Code installed (npm install -g @anthropic-ai/claude-code)
Authentication (one of):
- Anthropic API key (ANTHROPIC_API_KEY environment variable), or
- Claude Max/Pro subscription (claude auth login — no API key needed, flat monthly cost)

Install

git clone https://github.com/daveangulo/twining-benchmark.git
cd twining-benchmark-harness
npm install
npm run build

Run Benchmarks

# Run a single scenario/condition pair
npx twining-bench run --scenario refactoring-handoff --condition baseline --runs 1

# Run all scenarios against all conditions (5 runs each, ~$470, ~37 hours)
npx twining-bench run --scenario all --condition all --budget 500

# Use a seed for reproducible execution order
npx twining-bench run --scenario all --condition all --seed benchmark-v1

# Dry run — validate config and estimate cost without executing
npx twining-bench run --scenario all --condition all --dry-run

# Smoke test — quick end-to-end validation (2 conditions, ~10 min)
npx twining-bench smoke-test

Results are written to benchmark-results/<run-id>/ with structured subdirectories for metadata, scores, transcripts, and artifacts.

View Results

# Show full KPI summary for the latest run
npx twining-bench results show latest

# Compare two runs side-by-side with significance testing
npx twining-bench results compare <run-id-1> <run-id-2>

# Export results as markdown or CSV
npx twining-bench export <run-id> --format markdown
npx twining-bench export <run-id> --format csv

The results display includes a VERDICT (whether Twining helps), CONFIDENCE level, condition ranking table with significance indicators, pairwise comparisons, and auto-generated key findings.

Analyze Results (Python)

A standalone Python analysis package provides 16-dimension statistical analysis with interactive reports:

cd analysis
uv venv && uv pip install -e .

# Full analysis of a benchmark run (JSON, Markdown, HTML reports)
python -m benchmark_analysis analyze ../benchmark-results/<run-id>

# Compare two runs for regressions/improvements
python -m benchmark_analysis compare ../benchmark-results/<run-id-1> ../benchmark-results/<run-id-2>

See analysis/README.md for the full list of analyses performed.

Cloud Execution (Fly.io)

The harness can run on Fly.io for long-running benchmark suites.

Setup

# Deploy to Fly.io (requires fly CLI installed)
npx twining-bench cloud deploy

# Set your API key as a secret (only needed for API mode, not subscription plans)
fly secrets set ANTHROPIC_API_KEY=sk-ant-...

Running Benchmarks on Fly

# Quick smoke test to verify config
fly ssh console -a twining-benchmark -C "node dist/cli/index.js smoke-test --timeout 10 --budget 10"

# Full benchmark run (detached via tmux)
fly ssh console -a twining-benchmark -C "apt-get update -qq && apt-get install -y -qq tmux"
fly ssh console -a twining-benchmark -C "tmux new-session -d -s bench 'node dist/cli/index.js run --scenario all --condition all --budget 500 --seed benchmark-v1 --output /data/benchmark-results 2>&1 | tee /data/benchmark-results/full-run.log'"

# Check progress
fly ssh console -a twining-benchmark -C "tail -20 /data/benchmark-results/full-run.log"

# Reattach to session
fly ssh console -a twining-benchmark -C "tmux attach -t bench"

# Pull results to local machine
npx twining-bench cloud pull

Dashboard

The deployed app serves a web dashboard at https://twining-benchmark.fly.dev/ for viewing results, comparing conditions, and exploring metrics.

How It Works

The Experiment

Each benchmark run executes this sequence:

Target Setup — A synthetic TypeScript project ("TaskFlow Pro") is copied to an isolated temp directory with a fresh git repo
Condition Setup — Coordination artifacts are injected per condition (e.g., CLAUDE.md files, Twining MCP server, structured framework files)
Agent Execution — Claude agents execute tasks via the Claude Agent SDK, with per-condition tool/MCP configuration
Data Collection — Git diffs, token usage, timing, and tool call transcripts are captured per session
Scoring — Dual-rubric LLM-as-judge (coordination quality + standalone quality) and automated analysis produce scores
Teardown — Temp directories and MCP servers are cleaned up

When --seed is provided, execution order is randomized using a seeded Fisher-Yates shuffle to control for order effects.

Scenarios

The harness includes 8 scenarios testing different multi-agent coordination challenges:

Scenario	Agents	What It Tests
`refactoring-handoff`	2	Agent A refactors, Agent B extends. Does B respect A's architecture?
`architecture-cascade`	3	A chain of 3 agents propagating architectural decisions downstream.
`bug-investigation`	2	Agent A investigates planted bugs (with hard timeout), Agent B fixes from A's findings.
`multi-session-build`	5	Five sequential agents building a feature end-to-end.
`concurrent-agents`	3+1	Three agents work in parallel (caching, audit, validation), then a merge agent integrates.
`conflict-resolution`	3	Two agents given contradictory architectural preferences, a third resolves the conflict.
`context-recovery`	2	Agent A is interrupted mid-task, Agent B recovers context and completes the work.
`scale-stress-test`	2-10	Parameterized stress test with configurable scale factor (1-5). Excluded from `--scenario all`.

Conditions

All 8 coordination conditions form a progression from no coordination to full Twining:

Condition	Available to Agents
`baseline`	Codebase only. No coordination files, no shared state.
`claude-md-only`	Codebase + CLAUDE.md with project conventions and instructions.
`shared-markdown`	CLAUDE.md + shared COORDINATION.md for freeform agent notes.
`file-reload-generic`	Simulates `/clear` + CONTEXT.md reload. Zero conversation history per agent.
`file-reload-structured`	GSD/BMAD-style framework: role files, STATE.md, PLAN.md, decisions.md, handoff.md.
`full-twining`	Twining plugin installed (same as a real user). Plugin provides MCP server (32 tools), hooks, skills, and behavioral instructions. No extra harness guidance.
`twining-lite`	Twining plugin installed with allowedTools restricted to 8 core tools: blackboard, decisions, and handoff. Tests whether the full tool suite is necessary.
`persistent-history`	Agents share accumulated conversation context instead of starting fresh. Tests whether the /clear pattern helps or hurts.

Scoring

Primary Metrics

Results display surfaces these metrics prominently, before any composite score:

Metric	What It Shows
Success Rate	% of iterations where all agent sessions completed
Test Pass Rate	Tests passing / total tests
Cost	Mean API cost per run (USD)
Time	Mean wall time per run
Compilation	Whether the final codebase compiles

Dual-Rubric Evaluation

Each run also produces two independent LLM-as-judge scores:

Coordination Score (CES) — Evaluates inter-agent coordination quality using 4 dimensions:

Dimension	Weight	What It Measures	Method
Consistency	0.25	Do agents align with each other's architectural choices?	LLM-judge
Integration	0.30	Does the combined output compile, pass tests, and integrate?	Automated
Redundancy	0.20	How much redundant or duplicated work occurred? (inverse)	LLM-judge
Coherence	0.15	Is the final codebase architecturally coherent?	LLM-judge
Overhead	-0.10	Penalty for coordination overhead (smooth linear: `ratio × 100`)	Automated

Standalone Quality Score — Evaluates output quality independent of coordination (no mention of agents or shared state):

Dimension	Weight	What It Measures
Correctness	0.25	Does the code work? Edge cases handled?
Architectural Soundness	0.25	Clean separation of concerns, consistent patterns?
Maintainability	0.25	Readable, well-named, testable code?
Completeness	0.25	Were all requirements implemented?

Coordination Lift = CES - Standalone Score. Positive means coordination helped; negative means overhead hurt net quality.

Blinded Evaluation

LLM-as-judge evaluation uses blind mode to prevent bias:

Condition identity (name, tool names) is stripped from the context
Coordination artifacts (.twining/, COORDINATION.md, etc.) are removed
Standalone quality evaluation always runs fully blinded
The judge evaluates code quality without knowing which coordination system produced it

Statistical Analysis

The harness is designed for realistic sample sizes (n=21-35 per condition at 3-5 iterations across 7 scenarios). Statistical methods are calibrated accordingly:

Hedges' g (primary): Bias-corrected effect size — leads all comparison tables. Small-sample correction prevents the ~19% overestimate of raw Cohen's d at n<10.
Minimum Detectable Effect Size (MDES): Reports what effects are detectable at your actual sample size, replacing misleading "need N runs" guidance. At 5 iterations × 7 scenarios: MDES = d≥0.62.
ROPE analysis (primary decision framework): Region of Practical Equivalence testing — classifies differences as "equivalent", "different", or "undecided" based on practical significance (default ±5 composite points), better suited to small samples than p-values alone.
Holm-Bonferroni correction: Adjusted p-values control family-wise error rate across all pairwise comparisons.
Mann-Whitney U: Non-parametric significance test. Note: at n<10, exact p-value resolution is coarse.
Bootstrap 95% CIs: For condition means and mean differences (delta). Fixed seed for reproducibility.
Spearman rank correlation: Used for behavior-outcome analysis (robust to non-normality of count data).
Variance flagging: Scenario×condition cells with CV > 30% are flagged as high-variance.

Test Targets

Synthetic (default): TaskFlow Pro — A 28-file TypeScript project with repository pattern, event-driven notifications, 2 seeded bugs, and 70 passing tests.

Generated (--target-type generated) — Deterministic repo generator controlled by config (file count, modules, dependency depth, test coverage). Same seed = byte-identical output.

External (--target-type external) — Adapter for real-world repositories via git clone with user-supplied ground truth manifest.

Development

npm test              # Run all tests
npm run test:watch    # Watch mode
npm run lint          # Type-check
npm run build         # Compile to dist/

# End-to-end smoke test (~10 min, ~$5 on API or free on subscription)
npx twining-bench smoke-test

# CI-gated e2e test
RUN_E2E=true npx vitest run tests/e2e/

Project Structure

twining-benchmark-harness/
├── src/
│   ├── runner/
│   │   ├── orchestrator.ts           # Run orchestration with seeded order
│   │   ├── agent-session.ts          # Claude Agent SDK wrapper
│   │   ├── test-runner.ts            # Post-iteration tsc + vitest execution
│   │   ├── smoke-test.ts             # E2E smoke test runner
│   │   ├── shuffle.ts                # Seeded Fisher-Yates shuffle
│   │   ├── data-collector.ts         # Git diff, transcript, artifact capture
│   │   └── error-handler.ts          # Failure classification
│   ├── conditions/                   # 8 coordination conditions
│   ├── scenarios/                    # 8 benchmark scenarios
│   ├── analyzer/
│   │   ├── statistics.ts             # Mann-Whitney U, Cohen's d, paired tests
│   │   ├── code-analysis.ts          # Git churn, AST pattern detection
│   │   ├── llm-judge.ts              # Dual-rubric evaluation (8 templates)
│   │   └── composite-scorer.ts       # CES calculation, ranking
│   ├── targets/                      # Synthetic, generated, external targets
│   ├── results/                      # Store, index manager, exporter
│   ├── cli/                          # Commander.js CLI (11 commands)
│   ├── dashboard/                    # React + Vite web dashboard
│   └── types/                        # TypeScript interfaces
├── tests/
│   ├── unit/                         # 38 test files
│   ├── integration/                  # 2 integration test files
│   └── e2e/                          # CI-gated smoke test
├── analysis/                          # Python analysis package (16 dimensions, 3 report formats)
├── scripts/                          # Smoke test and analysis scripts
├── Dockerfile                        # Multi-stage build for Fly.io
├── fly.toml                          # Fly.io config (4 CPU, 4GB RAM)
└── PRD.md                            # Full product requirements

Configuration

twining-bench.config.ts at the project root:

const config: BenchmarkConfig = {
  targetPath: './targets/synthetic',
  defaultRuns: 5,                         // 5 iterations per pair (detects d≥0.62 at full matrix)
  agentTimeoutMs: 15 * 60 * 1000,       // 15 min per agent session
  tokenBudgetPerRun: 500_000,
  budgetDollars: 100,                    // Hard cost ceiling
  outputDirectory: './benchmark-results',
  maxTurns: 50,
  retryCount: 0,
  dashboardPort: 3838,
  evaluatorModel: 'claude-sonnet-4-5-20250929',
};

CLI flags override config values. Use --budget to set cost ceiling for full runs.

Variable	Purpose
`ANTHROPIC_API_KEY`	API key for agent sessions and LLM-as-judge. Not required if authenticated via `claude auth login`.
`TWINING_PLUGIN_PATH`	Override path to Twining plugin directory. Set automatically in Docker (`/opt/twining-plugin/plugin`).
`RUN_E2E`	Set to `true` to enable CI-gated end-to-end tests.

Extending the Harness

Adding a Condition

Implement BaseCondition and register in src/conditions/registry.ts. See existing conditions for patterns — from simple (baseline.ts) to complex (full-twining.ts with MCP server).

Adding a Scenario

Extend BaseScenario and register in src/scenarios/registry.ts. Set executionMode: 'parallel' for concurrent agent scenarios. See concurrent-agents.ts for the parallel pattern.

Adding a Target

Implement ITestTarget from src/targets/target.interface.ts.

Known Limitations

See docs/benchmark-limitations.md for a full list of known limitations that should accompany published results, including: hand-designed CES weights, same-family judge model, synthetic TypeScript target, and small sample sizes.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
analysis		analysis
benchmark-results		benchmark-results
docs		docs
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE-REVIEW-FINDINGS.md		CODE-REVIEW-FINDINGS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
PRD.md		PRD.md
README.md		README.md
fly.toml		fly.toml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
twining-bench.config.ts		twining-bench.config.ts
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Twining Benchmark Harness

Quick Start

Prerequisites

Install

Run Benchmarks

View Results

Analyze Results (Python)

Cloud Execution (Fly.io)

Setup

Running Benchmarks on Fly

Dashboard

How It Works

The Experiment

Scenarios

Conditions

Scoring

Primary Metrics

Dual-Rubric Evaluation

Blinded Evaluation

Statistical Analysis

Test Targets

Development

Project Structure

Configuration

Extending the Harness

Adding a Condition

Adding a Scenario

Adding a Target

Known Limitations

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages