Evalbuff improves a coding agent's performance by iteratively optimizing project documentation. It watches an agent fail, writes docs to fix the pattern, and keeps only the changes that measurably help.
Walks through your repo's git history commit-by-commit, using each commit as a learning opportunity:
- Start at HEAD~500 (configurable) and process commits one at a time, oldest first
- For each commit, craft a human-like prompt that vaguely describes the change (via LLM)
- Run N agents in parallel (default 5) on that prompt against the parent commit
- Judge all runs — using the actual commit diff as ground truth
- Always analyze failures and propose doc changes (ensuring they're generic enough to help future tasks, not just this one)
- Re-run N agents with the proposed docs
- If scores improve, keep the docs and try to propose more improvements
- If scores don't improve, reject the docs and move to the next commit
- State is saved after each commit — resume at any time
The result: a docs/ directory that encodes patterns the agent needs to know, learned from real historical changes.
Run a specific coding prompt and improve docs for it — no git history needed:
- Given a prompt describing a coding task
- Run N agents in parallel on the prompt against the current HEAD
- Judge all runs — no ground truth, relies entirely on e2e testing by the judge
- Analyze and propose doc changes
- Re-run and keep/reject as with learn mode
Useful for targeted doc improvement around known pain points.
for each task (commit or prompt):
┌─────────────────────────────────────────────────────┐
│ 1. Run N agents in parallel (baseline) │
│ 2. Judge all N runs → average score │
│ 3. Analyze worst run → propose generic doc │
│ 4. Apply doc to repo │
│ 5. Re-run N agents with new doc │
│ 6. Score improved? Keep doc, try more improvements │
│ Score same/worse? Reject doc, next task │
└─────────────────────────────────────────────────────┘
Key design decisions:
- Low-cost agent (
codebuff --agent base2-freeby default) — runs many times cheaply - N parallel runs for statistical significance — one run is noisy, five gives a decent signal
- Always analyze — no score threshold; every task is a learning opportunity
- Generic docs only — the doc writer is instructed to skip task-specific advice and focus on patterns
- Iterative improvement — keeps proposing docs until one is rejected, then moves on
bun run evalbuff/src/run-evalbuff.ts \
--repo /path/to/target-repo \
--agent "codebuff --agent base2-free" \
--commits 500 \
--parallelism 5 \
--max-cost 100bun run evalbuff/src/run-evalbuff.ts \
--repo /path/to/target-repo \
--agent "codebuff --agent base2-free" \
--prompt "Add a dark mode toggle to the settings page" \
--parallelism 5| Argument | Default | Description |
|---|---|---|
--repo |
required | Path to the target repo where docs/ will be written |
--agent |
codebuff --agent base2-free |
Agent CLI command (prompt appended as last arg) |
--prompt |
— | If set, runs in prompt mode instead of learn mode |
--commits |
500 | How many commits back to start from (learn mode) |
--parallelism |
5 | Number of agents to run in parallel per task |
--max-cost |
100 | Stop after spending this many USD (estimated) |
--agent-timeout |
300000 | Per-agent timeout in ms (5 min default) |
--init-command |
— | Command to run in each test repo (e.g., npm install) |
--criteria |
auto | Path to criteria JSON (auto-created if omitted) |
--reviewers |
claude,codex |
Comma-separated reviewer agent types |
State is saved to evalbuff-state.json in the target repo after each commit. Re-running with the same --repo automatically resumes from where it left off — it knows which commit was last processed and continues from there.
nohup bun run evalbuff/src/run-evalbuff.ts \
--repo /path/to/repo \
--commits 500 \
--parallelism 5 \
--max-cost 200 \
> evalbuff-overnight.log 2>&1 &target-repo/
├── docs/ # Generated documentation
│ ├── patterns/
│ │ └── error-handling.md
│ ├── conventions/
│ │ └── naming.md
│ └── architecture/
│ └── data-flow.md
├── AGENTS.md # Table of contents
├── evalbuff-state.json # Resumable state (last commit SHA)
├── evalbuff-log.jsonl # Per-task log
├── evalbuff-criteria.json # Current criteria level
└── evalbuff-report-2026-03-26.md # Report
Judges use a leveling system to avoid over-optimizing prematurely:
| Level | Criteria Added | Promotion |
|---|---|---|
| L1 | Builds, tests pass, basic completeness | Start |
| L2 | + Feature works E2E, logs clean | After L1 avg >= 8.0 over 10 tasks |
| L3 | + Edge cases, UI verification | After L2 avg >= 8.0 |
| L4 | + Cross-component integration, performance | After L3 avg >= 8.0 |
| L5 | + Production readiness | After L4 avg >= 8.0 |
| File | Role |
|---|---|
run-evalbuff.ts |
Main orchestrator — learn mode + prompt mode |
commit-task-generator.ts |
Extract tasks from git history, generate prompts from commits |
cli-runner.ts |
Agent-agnostic CLI runner — spawns any agent, captures diff |
judge.ts |
AI judging with/without ground truth, multi-reviewer aggregation |
docs-optimizer.ts |
Failure analysis, generic doc writing, doc application/revert |
criteria.ts |
Living quality criteria with L1-L5 promotion |
morning-report.ts |
Report generation from JSONL log |
test-repo-utils.ts |
Isolated git repo lifecycle management |