Evalbuff

Evalbuff improves a coding agent's performance by iteratively optimizing project documentation. It watches an agent fail, writes docs to fix the pattern, and keeps only the changes that measurably help.

Two Modes

1. Commit Learning Mode (default)

Walks through your repo's git history commit-by-commit, using each commit as a learning opportunity:

Start at HEAD~500 (configurable) and process commits one at a time, oldest first
For each commit, craft a human-like prompt that vaguely describes the change (via LLM)
Run N agents in parallel (default 5) on that prompt against the parent commit
Judge all runs — using the actual commit diff as ground truth
Always analyze failures and propose doc changes (ensuring they're generic enough to help future tasks, not just this one)
Re-run N agents with the proposed docs
If scores improve, keep the docs and try to propose more improvements
If scores don't improve, reject the docs and move to the next commit
State is saved after each commit — resume at any time

The result: a docs/ directory that encodes patterns the agent needs to know, learned from real historical changes.

2. Prompt Mode

Run a specific coding prompt and improve docs for it — no git history needed:

Given a prompt describing a coding task
Run N agents in parallel on the prompt against the current HEAD
Judge all runs — no ground truth, relies entirely on e2e testing by the judge
Analyze and propose doc changes
Re-run and keep/reject as with learn mode

Useful for targeted doc improvement around known pain points.

How It Works

for each task (commit or prompt):
  ┌─────────────────────────────────────────────────────┐
  │  1. Run N agents in parallel (baseline)             │
  │  2. Judge all N runs → average score                │
  │  3. Analyze worst run → propose generic doc         │
  │  4. Apply doc to repo                               │
  │  5. Re-run N agents with new doc                    │
  │  6. Score improved? Keep doc, try more improvements │
  │     Score same/worse? Reject doc, next task         │
  └─────────────────────────────────────────────────────┘

Key design decisions:

Low-cost agent (codebuff --agent base2-free by default) — runs many times cheaply
N parallel runs for statistical significance — one run is noisy, five gives a decent signal
Always analyze — no score threshold; every task is a learning opportunity
Generic docs only — the doc writer is instructed to skip task-specific advice and focus on patterns
Iterative improvement — keeps proposing docs until one is rejected, then moves on

Usage

Commit Learning Mode

bun run evalbuff/src/run-evalbuff.ts \
  --repo /path/to/target-repo \
  --agent "codebuff --agent base2-free" \
  --commits 500 \
  --parallelism 5 \
  --max-cost 100

Prompt Mode

bun run evalbuff/src/run-evalbuff.ts \
  --repo /path/to/target-repo \
  --agent "codebuff --agent base2-free" \
  --prompt "Add a dark mode toggle to the settings page" \
  --parallelism 5

Arguments

Argument	Default	Description
`--repo`	required	Path to the target repo where docs/ will be written
`--agent`	`codebuff --agent base2-free`	Agent CLI command (prompt appended as last arg)
`--prompt`	—	If set, runs in prompt mode instead of learn mode
`--commits`	500	How many commits back to start from (learn mode)
`--parallelism`	5	Number of agents to run in parallel per task
`--max-cost`	100	Stop after spending this many USD (estimated)
`--agent-timeout`	300000	Per-agent timeout in ms (5 min default)
`--init-command`	—	Command to run in each test repo (e.g., `npm install`)
`--criteria`	auto	Path to criteria JSON (auto-created if omitted)
`--reviewers`	`claude,codex`	Comma-separated reviewer agent types

Resuming

State is saved to evalbuff-state.json in the target repo after each commit. Re-running with the same --repo automatically resumes from where it left off — it knows which commit was last processed and continues from there.

Overnight Run

nohup bun run evalbuff/src/run-evalbuff.ts \
  --repo /path/to/repo \
  --commits 500 \
  --parallelism 5 \
  --max-cost 200 \
  > evalbuff-overnight.log 2>&1 &

What Gets Produced

target-repo/
├── docs/                              # Generated documentation
│   ├── patterns/
│   │   └── error-handling.md
│   ├── conventions/
│   │   └── naming.md
│   └── architecture/
│       └── data-flow.md
├── AGENTS.md                          # Table of contents
├── evalbuff-state.json               # Resumable state (last commit SHA)
├── evalbuff-log.jsonl                # Per-task log
├── evalbuff-criteria.json            # Current criteria level
└── evalbuff-report-2026-03-26.md     # Report

Living Quality Criteria

Judges use a leveling system to avoid over-optimizing prematurely:

Level	Criteria Added	Promotion
L1	Builds, tests pass, basic completeness	Start
L2	+ Feature works E2E, logs clean	After L1 avg >= 8.0 over 10 tasks
L3	+ Edge cases, UI verification	After L2 avg >= 8.0
L4	+ Cross-component integration, performance	After L3 avg >= 8.0
L5	+ Production readiness	After L4 avg >= 8.0

Architecture

File	Role
`run-evalbuff.ts`	Main orchestrator — learn mode + prompt mode
`commit-task-generator.ts`	Extract tasks from git history, generate prompts from commits
`cli-runner.ts`	Agent-agnostic CLI runner — spawns any agent, captures diff
`judge.ts`	AI judging with/without ground truth, multi-reviewer aggregation
`docs-optimizer.ts`	Failure analysis, generic doc writing, doc application/revert
`criteria.ts`	Living quality criteria with L1-L5 promotion
`morning-report.ts`	Report generation from JSONL log
`test-repo-utils.ts`	Isolated git repo lifecycle management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evalbuff

Two Modes

1. Commit Learning Mode (default)

2. Prompt Mode

How It Works

Usage

Commit Learning Mode

Prompt Mode

Arguments

Resuming

Overnight Run

What Gets Produced

Living Quality Criteria

Architecture

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Evalbuff

Two Modes

1. Commit Learning Mode (default)

2. Prompt Mode

How It Works

Usage

Commit Learning Mode

Prompt Mode

Arguments

Resuming

Overnight Run

What Gets Produced

Living Quality Criteria

Architecture