A CLI-driven benchmark execution engine that quantitatively compares multi-agent coordination strategies. It answers the question: does Twining actually help AI agents work together, and by how much?
The harness runs controlled experiments where multiple Claude agents collaborate on a shared codebase under different coordination conditions — from no coordination through static docs, shared files, structured frameworks, and full Twining MCP — then scores the results using dual-rubric LLM-as-judge evaluation, automated analysis, and statistical comparison.
- Node.js >= 20.0.0
- Claude Code installed (
npm install -g @anthropic-ai/claude-code) - Authentication (one of):
- Anthropic API key (
ANTHROPIC_API_KEYenvironment variable), or - Claude Max/Pro subscription (
claude auth login— no API key needed, flat monthly cost)
- Anthropic API key (
git clone https://github.com/daveangulo/twining-benchmark.git
cd twining-benchmark-harness
npm install
npm run build# Run a single scenario/condition pair
npx twining-bench run --scenario refactoring-handoff --condition baseline --runs 1
# Run all scenarios against all conditions (5 runs each, ~$470, ~37 hours)
npx twining-bench run --scenario all --condition all --budget 500
# Use a seed for reproducible execution order
npx twining-bench run --scenario all --condition all --seed benchmark-v1
# Dry run — validate config and estimate cost without executing
npx twining-bench run --scenario all --condition all --dry-run
# Smoke test — quick end-to-end validation (2 conditions, ~10 min)
npx twining-bench smoke-testResults are written to benchmark-results/<run-id>/ with structured subdirectories for metadata, scores, transcripts, and artifacts.
# Show full KPI summary for the latest run
npx twining-bench results show latest
# Compare two runs side-by-side with significance testing
npx twining-bench results compare <run-id-1> <run-id-2>
# Export results as markdown or CSV
npx twining-bench export <run-id> --format markdown
npx twining-bench export <run-id> --format csvThe results display includes a VERDICT (whether Twining helps), CONFIDENCE level, condition ranking table with significance indicators, pairwise comparisons, and auto-generated key findings.
A standalone Python analysis package provides 16-dimension statistical analysis with interactive reports:
cd analysis
uv venv && uv pip install -e .
# Full analysis of a benchmark run (JSON, Markdown, HTML reports)
python -m benchmark_analysis analyze ../benchmark-results/<run-id>
# Compare two runs for regressions/improvements
python -m benchmark_analysis compare ../benchmark-results/<run-id-1> ../benchmark-results/<run-id-2>See analysis/README.md for the full list of analyses performed.
The harness can run on Fly.io for long-running benchmark suites.
# Deploy to Fly.io (requires fly CLI installed)
npx twining-bench cloud deploy
# Set your API key as a secret (only needed for API mode, not subscription plans)
fly secrets set ANTHROPIC_API_KEY=sk-ant-...# Quick smoke test to verify config
fly ssh console -a twining-benchmark -C "node dist/cli/index.js smoke-test --timeout 10 --budget 10"
# Full benchmark run (detached via tmux)
fly ssh console -a twining-benchmark -C "apt-get update -qq && apt-get install -y -qq tmux"
fly ssh console -a twining-benchmark -C "tmux new-session -d -s bench 'node dist/cli/index.js run --scenario all --condition all --budget 500 --seed benchmark-v1 --output /data/benchmark-results 2>&1 | tee /data/benchmark-results/full-run.log'"
# Check progress
fly ssh console -a twining-benchmark -C "tail -20 /data/benchmark-results/full-run.log"
# Reattach to session
fly ssh console -a twining-benchmark -C "tmux attach -t bench"
# Pull results to local machine
npx twining-bench cloud pullThe deployed app serves a web dashboard at https://twining-benchmark.fly.dev/ for viewing results, comparing conditions, and exploring metrics.
Each benchmark run executes this sequence:
- Target Setup — A synthetic TypeScript project ("TaskFlow Pro") is copied to an isolated temp directory with a fresh git repo
- Condition Setup — Coordination artifacts are injected per condition (e.g., CLAUDE.md files, Twining MCP server, structured framework files)
- Agent Execution — Claude agents execute tasks via the Claude Agent SDK, with per-condition tool/MCP configuration
- Data Collection — Git diffs, token usage, timing, and tool call transcripts are captured per session
- Scoring — Dual-rubric LLM-as-judge (coordination quality + standalone quality) and automated analysis produce scores
- Teardown — Temp directories and MCP servers are cleaned up
When --seed is provided, execution order is randomized using a seeded Fisher-Yates shuffle to control for order effects.
The harness includes 8 scenarios testing different multi-agent coordination challenges:
| Scenario | Agents | What It Tests |
|---|---|---|
refactoring-handoff |
2 | Agent A refactors, Agent B extends. Does B respect A's architecture? |
architecture-cascade |
3 | A chain of 3 agents propagating architectural decisions downstream. |
bug-investigation |
2 | Agent A investigates planted bugs (with hard timeout), Agent B fixes from A's findings. |
multi-session-build |
5 | Five sequential agents building a feature end-to-end. |
concurrent-agents |
3+1 | Three agents work in parallel (caching, audit, validation), then a merge agent integrates. |
conflict-resolution |
3 | Two agents given contradictory architectural preferences, a third resolves the conflict. |
context-recovery |
2 | Agent A is interrupted mid-task, Agent B recovers context and completes the work. |
scale-stress-test |
2-10 | Parameterized stress test with configurable scale factor (1-5). Excluded from --scenario all. |
All 8 coordination conditions form a progression from no coordination to full Twining:
| Condition | Available to Agents |
|---|---|
baseline |
Codebase only. No coordination files, no shared state. |
claude-md-only |
Codebase + CLAUDE.md with project conventions and instructions. |
shared-markdown |
CLAUDE.md + shared COORDINATION.md for freeform agent notes. |
file-reload-generic |
Simulates /clear + CONTEXT.md reload. Zero conversation history per agent. |
file-reload-structured |
GSD/BMAD-style framework: role files, STATE.md, PLAN.md, decisions.md, handoff.md. |
full-twining |
Twining plugin installed (same as a real user). Plugin provides MCP server (32 tools), hooks, skills, and behavioral instructions. No extra harness guidance. |
twining-lite |
Twining plugin installed with allowedTools restricted to 8 core tools: blackboard, decisions, and handoff. Tests whether the full tool suite is necessary. |
persistent-history |
Agents share accumulated conversation context instead of starting fresh. Tests whether the /clear pattern helps or hurts. |
Results display surfaces these metrics prominently, before any composite score:
| Metric | What It Shows |
|---|---|
| Success Rate | % of iterations where all agent sessions completed |
| Test Pass Rate | Tests passing / total tests |
| Cost | Mean API cost per run (USD) |
| Time | Mean wall time per run |
| Compilation | Whether the final codebase compiles |
Each run also produces two independent LLM-as-judge scores:
Coordination Score (CES) — Evaluates inter-agent coordination quality using 4 dimensions:
| Dimension | Weight | What It Measures | Method |
|---|---|---|---|
| Consistency | 0.25 | Do agents align with each other's architectural choices? | LLM-judge |
| Integration | 0.30 | Does the combined output compile, pass tests, and integrate? | Automated |
| Redundancy | 0.20 | How much redundant or duplicated work occurred? (inverse) | LLM-judge |
| Coherence | 0.15 | Is the final codebase architecturally coherent? | LLM-judge |
| Overhead | -0.10 | Penalty for coordination overhead (smooth linear: ratio × 100) |
Automated |
Standalone Quality Score — Evaluates output quality independent of coordination (no mention of agents or shared state):
| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 0.25 | Does the code work? Edge cases handled? |
| Architectural Soundness | 0.25 | Clean separation of concerns, consistent patterns? |
| Maintainability | 0.25 | Readable, well-named, testable code? |
| Completeness | 0.25 | Were all requirements implemented? |
Coordination Lift = CES - Standalone Score. Positive means coordination helped; negative means overhead hurt net quality.
LLM-as-judge evaluation uses blind mode to prevent bias:
- Condition identity (name, tool names) is stripped from the context
- Coordination artifacts (
.twining/,COORDINATION.md, etc.) are removed - Standalone quality evaluation always runs fully blinded
- The judge evaluates code quality without knowing which coordination system produced it
The harness is designed for realistic sample sizes (n=21-35 per condition at 3-5 iterations across 7 scenarios). Statistical methods are calibrated accordingly:
- Hedges' g (primary): Bias-corrected effect size — leads all comparison tables. Small-sample correction prevents the ~19% overestimate of raw Cohen's d at n<10.
- Minimum Detectable Effect Size (MDES): Reports what effects are detectable at your actual sample size, replacing misleading "need N runs" guidance. At 5 iterations × 7 scenarios: MDES = d≥0.62.
- ROPE analysis (primary decision framework): Region of Practical Equivalence testing — classifies differences as "equivalent", "different", or "undecided" based on practical significance (default ±5 composite points), better suited to small samples than p-values alone.
- Holm-Bonferroni correction: Adjusted p-values control family-wise error rate across all pairwise comparisons.
- Mann-Whitney U: Non-parametric significance test. Note: at n<10, exact p-value resolution is coarse.
- Bootstrap 95% CIs: For condition means and mean differences (delta). Fixed seed for reproducibility.
- Spearman rank correlation: Used for behavior-outcome analysis (robust to non-normality of count data).
- Variance flagging: Scenario×condition cells with CV > 30% are flagged as high-variance.
Synthetic (default): TaskFlow Pro — A 28-file TypeScript project with repository pattern, event-driven notifications, 2 seeded bugs, and 70 passing tests.
Generated (--target-type generated) — Deterministic repo generator controlled by config (file count, modules, dependency depth, test coverage). Same seed = byte-identical output.
External (--target-type external) — Adapter for real-world repositories via git clone with user-supplied ground truth manifest.
npm test # Run all tests
npm run test:watch # Watch mode
npm run lint # Type-check
npm run build # Compile to dist/
# End-to-end smoke test (~10 min, ~$5 on API or free on subscription)
npx twining-bench smoke-test
# CI-gated e2e test
RUN_E2E=true npx vitest run tests/e2e/twining-benchmark-harness/
├── src/
│ ├── runner/
│ │ ├── orchestrator.ts # Run orchestration with seeded order
│ │ ├── agent-session.ts # Claude Agent SDK wrapper
│ │ ├── test-runner.ts # Post-iteration tsc + vitest execution
│ │ ├── smoke-test.ts # E2E smoke test runner
│ │ ├── shuffle.ts # Seeded Fisher-Yates shuffle
│ │ ├── data-collector.ts # Git diff, transcript, artifact capture
│ │ └── error-handler.ts # Failure classification
│ ├── conditions/ # 8 coordination conditions
│ ├── scenarios/ # 8 benchmark scenarios
│ ├── analyzer/
│ │ ├── statistics.ts # Mann-Whitney U, Cohen's d, paired tests
│ │ ├── code-analysis.ts # Git churn, AST pattern detection
│ │ ├── llm-judge.ts # Dual-rubric evaluation (8 templates)
│ │ └── composite-scorer.ts # CES calculation, ranking
│ ├── targets/ # Synthetic, generated, external targets
│ ├── results/ # Store, index manager, exporter
│ ├── cli/ # Commander.js CLI (11 commands)
│ ├── dashboard/ # React + Vite web dashboard
│ └── types/ # TypeScript interfaces
├── tests/
│ ├── unit/ # 38 test files
│ ├── integration/ # 2 integration test files
│ └── e2e/ # CI-gated smoke test
├── analysis/ # Python analysis package (16 dimensions, 3 report formats)
├── scripts/ # Smoke test and analysis scripts
├── Dockerfile # Multi-stage build for Fly.io
├── fly.toml # Fly.io config (4 CPU, 4GB RAM)
└── PRD.md # Full product requirements
twining-bench.config.ts at the project root:
const config: BenchmarkConfig = {
targetPath: './targets/synthetic',
defaultRuns: 5, // 5 iterations per pair (detects d≥0.62 at full matrix)
agentTimeoutMs: 15 * 60 * 1000, // 15 min per agent session
tokenBudgetPerRun: 500_000,
budgetDollars: 100, // Hard cost ceiling
outputDirectory: './benchmark-results',
maxTurns: 50,
retryCount: 0,
dashboardPort: 3838,
evaluatorModel: 'claude-sonnet-4-5-20250929',
};CLI flags override config values. Use --budget to set cost ceiling for full runs.
| Variable | Purpose |
|---|---|
ANTHROPIC_API_KEY |
API key for agent sessions and LLM-as-judge. Not required if authenticated via claude auth login. |
TWINING_PLUGIN_PATH |
Override path to Twining plugin directory. Set automatically in Docker (/opt/twining-plugin/plugin). |
RUN_E2E |
Set to true to enable CI-gated end-to-end tests. |
Implement BaseCondition and register in src/conditions/registry.ts. See existing conditions for patterns — from simple (baseline.ts) to complex (full-twining.ts with MCP server).
Extend BaseScenario and register in src/scenarios/registry.ts. Set executionMode: 'parallel' for concurrent agent scenarios. See concurrent-agents.ts for the parallel pattern.
Implement ITestTarget from src/targets/target.interface.ts.
See docs/benchmark-limitations.md for a full list of known limitations that should accompany published results, including: hand-designed CES weights, same-family judge model, synthetic TypeScript target, and small sample sizes.
MIT