feat(user-test): self-eval loop with binary evals and mutation proposals (v2.52.0) by Drewx-Design · Pull Request #308 · EveryInc/compound-engineering-plugin

Drewx-Design · 2026-03-18T14:15:11Z

Summary

Add /user-test-eval command and skill — grades user-test output against 3 binary evals (probe execution order, Proven regression distinction, P1 surfacing)
Records scores in skill-evals.json, proposes targeted mutations in skill-mutations.md
Adds execution_index tracking for artifact-only probe order verification
Adds .user-test-last-report.md persistence for presentation-layer eval grading
Schema v10 with v9 backward compatibility

Key Design Decisions

Eval runs as separate invocation — commit mode prompts "Run /user-test-eval" rather than auto-chaining, to preserve grading integrity (separate context = harder to game)
All evals are mechanical — Eval 1: index comparison, Eval 2: regex match for → Proven regression marker, Eval 3: slug+priority matching. No subjective judgment.
One mutation per failing eval — all failures get proposals in a single run, human reviews and accepts/rejects
Artifacts are project-scoped (tests/user-flows/) not plugin-scoped

Commits

Commit	Description
v2.49.0	Cross-area probes, probe isolation, proactive browser restart (schema v7)
v2.50.0	Compounding quality: writebacks, weakness synthesis, fingerprints, CLI adversarial (schema v8)
v2.51.0	Multi-area journey testing (schema v9)
v2.52.0	Self-eval loop with binary evals and mutation proposals (schema v10)

Testing

Run /user-test on a real scenario, verify report persists to .user-test-last-report.md
Run /user-test-eval after a completed run, verify evals grade correctly
Verify skill-evals.json and skill-mutations.md are created/updated
Test eval with pre-v10 artifacts (should SKIP Eval 1 gracefully)

Post-Deploy Monitoring & Validation

No additional operational monitoring required: plugin changes are client-side skill definitions with no server component.

… system Complete implementation of the user-test skill across three plan iterations: - v1: Core skill with 5-phase browser testing, maturity model, thin wrapper commands - v2: Schema migration, timing, CLI mode, auto-commit, quality scoring, performance thresholds - v3: Bug registry, per-area score history, structured skip reasons, pass thresholds, queryable qualitative data, discovery-to-regression graduation, UX opportunities + good patterns New files: SKILL.md (364 lines), 5 reference files, 3 thin wrapper commands, 3 plans, 2 learnings. Version: 2.37.0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…dence, and novelty Schema v6: adds probe confidence field (high/medium/low) with execution ordering, stable query rotation lifecycle (active → [stable] → [retired]), mandatory novelty budget per area, and first-run code-reading orientation for structural hypothesis probes. New reference files: orientation.md, probes.md, queries-and-multiturn.md, run-targeting.md, verification-patterns.md.

…dedup P1: Fix generated→generated_from field name, broaden orientation bash commands beyond src/ (language-agnostic discovery), fix CLI discovery step numbering (step 2→3 after orientation insertion). P2: Fix "Phase 5"→"commit mode" reference, remove "(Phase 4)" label conflation, clarify score-history.json stores 1 entry per commit (not per iterate run), define novelty run 1 state boundary, add v5 migration bullet to SKILL.md Phase 1. P3: Document low confidence as reserved, specify Probes Run History column format (P/F entries), remove YAGNI orientation design note, deduplicate scoring boundaries across files, note [retired] as long-term maturity state. Merge v3/v4 migration bullets to maintain SKILL.md at 420-line ceiling.

…aring, protected artifacts 1. Interaction Method preamble — AskUserQuestion fallback for non-Claude LLMs (numbered list + wait pattern). Prevents silent auto-configuration. 2. CLI-only graceful degradation — when Chrome MCP unavailable but cli_test_command covers all scored_output areas, offer CLI-only mode. 3. Optional Proof sharing — after Phase 4 report, offer to POST session summary to proofeditor.ai for team review. Best-effort, skip on failure. 4. Protected artifacts — declare tests/user-flows/ as pipeline output that review agents must not flag for deletion. Consolidated 7 individual reference sections into compact list to stay within 420-line SKILL.md ceiling.

…ive browser restart (schema v7) Cross-area probe table tests state carry-over between areas. Probe isolation guidance separates multi-cause symptoms. Proactive restart prevents browser connection degradation after ~15 MCP calls. Connection resilience extracted to reference file. SKILL.md stays at 420 lines.

…e, lifecycle compression - Fix stale back-reference in browser-input-patterns.md (now points to connection-resilience.md directly) - Add related_bug to probe generation output format with storage guidance (inline in Generated From column) - Fix UX010 example → B002 (bug ID, not UX opportunity ID) - Add cross-area probe mention to orientation.md for multi-area seams - Consolidate browser-input-patterns.md Proactive Restart (removed redundant timing/cross-area paragraphs, kept clears/preserves list) - Compress cross-area Lifecycle subsection (6 lines → 2) - Add deferred restart exception note to connection-resilience.md - Replace "randomly" with round-robin for spot-check rotation - Add cross-area probe table mention to SKILL.md commit mode step 4

…synthesis, fingerprints, CLI adversarial (schema v8) Four changes that make each run smarter automatically: 1. Richer commit writebacks (C1): tactical notes in Notes column, verified selectors auto-appended to verify: blocks, weakness_class field written when 2+ probes share a failure pattern 2. Weakness-class synthesis (C2): cross-area synthesis pass generates [cross-area] Explore Next Run entries with adversarial instructions when a weakness_class appears in 2+ areas 3. Novelty fingerprint persistence (C3): compact fingerprints persisted in .user-test-last-run.json across runs with read-merge-write sequence, 20-per-area cap, iterate mode exemption 4. CLI adversarial browser mode (C4): CLI score 3 triggers adversarial browser testing — skip happy path, front-load competing constraints, pre-emptive P1 probe, SKIP→PROBES-ONLY override JSON schema extracted from SKILL.md to references/last-run-schema.md (52→2 lines inline). SKILL.md reduced from 421→369 lines.

…ring, JSON dedup, synthesis timing 7 review findings fixed: 1. Adversarial trigger scope: compressed SKILL.md to brief-then-defer pattern, removing contradiction with reference file's broader trigger condition 2. Step numbering collision: SKILL.md commit step 9 → 8b to avoid clash with queries doc's internal step 9; updated "steps 8-10" → "steps 8-12" 3. JSON field duplication: replaced duplicate JSON blocks in probes.md and queries-and-multiturn.md with cross-references to last-run-schema.md 4. Duplicate fingerprint override section removed from CLI Adversarial Mode 5. Synthesis timing: added "as present at run start — ignore this run's commit" to SKILL.md Phase 4 Step 6 to prevent premature synthesis 6. Added section anchor hints to tactical notes and selector writeback refs 7. Added tactical_note/confirmed_selectors defaults to v7 migration rule

Add journey testing layer for accumulated state across 3+ areas without resets. Journeys execute after cross-area probes, before per-area testing, with checkpoints at each step. failing-at-N pinpoints which step broke. New: references/journeys.md (lifecycle, budget, execution, interactions) Schema: v8→v9 additive (missing Journeys = empty, forward compatible) JSON: journeys_run array with per-step checkpoint data SKILL.md: 376 lines (under 420 ceiling)

…ring, JSON dedup, synthesis timing - P1: Add escalated_to field for journey escalation dedup (prevents duplicate bugs) - P1: Clarify continue-mode tracks per-step escalation independently - P2: Add 3-journey spot-check cap for passing/stable journeys - P2: Fix schema status enum (was missing untested/flaky/stable) - P2: Add explicit count-change baseline capture instruction - P3: Consolidate template comment, add schema hint - P3: Add explicit "no graduation" statement for journeys - P3: Define N-run summary terms (stabilized, persistent issues) - P3: Clarify orientation synthesizes probes into journey suggestions

…ion proposals (schema v10) Add /user-test-eval command and skill that grades skill output against 3 binary evals (probe execution order, Proven regression distinction, P1 surfacing), records scores in skill-evals.json, and proposes targeted mutations in skill-mutations.md. Auto-triggers after commit mode. Adds execution_index tracking for artifact-only probe order verification, and .user-test-last-report.md persistence for presentation-layer eval grading. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…, frontmatter, dynamic version - P1: Change auto-eval trigger to a prompt ("Run /user-test-eval") instead of inline invocation — preserves grading integrity and respects allowed-tools constraints on parent commands - P2: Add disable-model-invocation: true to eval skill frontmatter - P2: Replace hardcoded skill_version with dynamic plugin.json read - P3: Add AskUserQuestion to staleness and already-evaluated prompts - P3: Add references/ extraction as v2.53.0 out-of-scope consideration

…cked)

# Conflicts: # .claude-plugin/marketplace.json # plugins/compound-engineering/.claude-plugin/plugin.json # plugins/compound-engineering/CHANGELOG.md # plugins/compound-engineering/README.md

…39.0 pattern Commands were migrated to skills in upstream v2.39.0. Move user-test-commit and user-test-iterate from commands/ to skills/ as thin dispatcher skills. Remove commands/ directory — user-test and user-test-eval were already skills.

tmchow · 2026-03-18T16:07:15Z

@Drewx-Design have you considered how this might work with agent-browser (and possibly lightpanda engine with agent-browser)? I'm wondering if this testing can be headless?

…rmation Scale browser MCP budget by consecutive pass count (3/2/1 calls for 2-5/6-9/10+ passes) instead of flat 3 for all Proven areas. Add non-deterministic probe confirmation requiring 2 consecutive passes before treating LLM-dependent probes as genuinely passing. Updates all 12 cross-file "3 MCP" references across 4 files to point to the tiered system. Adds worked example for 1-call tier budget. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Fix stale "1 for Proven" in progressive narrowing table (P1) - Fix worked example wording implying probes consume budget (P2) - Add passing* inter-run persistence mechanism: keep status as failing/flaky in test file, track unconfirmed pass in last-run JSON (P2) - Add novelty waiver cross-reference at mandatory probe rule (P2) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Tighten run-targeting.md prose: remove redundant sentences, collapse 6 paragraphs to 3 without losing meaning - Add 2-call tier worked example to queries-and-multiturn.md - Define freed calls formula: N = sum of (3 - tier_budget) across all Proven areas tested this run 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Drewx-Design · 2026-03-18T20:04:43Z

r

Headless testing would be the eventual goal. I've been working on this for a week or two testing on my project, but really trying to nail the quality of the results first, with the compounding factor.

tmchow · 2026-03-18T21:51:51Z

r

Headless testing would be the eventual goal. I've been working on this for a week or two testing on my project, but really trying to nail the quality of the results first, with the compounding factor.

Yeah i'd love to see this in a followup because it'll end up being WAY faster and also then work for codex and other non-claude code coding environments which aligns with our cross platform goal.

Lightpanda browser in particular with agent browser cli is so ridiculously fast too so there's that benefit. Also lots of little features that are helpful for agent usage.

Drewx-Design and others added 15 commits February 28, 2026 14:42

fix(user-test): correct skill count — 21 not 22 (deepen-plan is untra…

5eed97d

…cked)

Merge remote-tracking branch 'origin/main' into feat/user-test-skill

49e0d85

# Conflicts: # .claude-plugin/marketplace.json # plugins/compound-engineering/.claude-plugin/plugin.json # plugins/compound-engineering/CHANGELOG.md # plugins/compound-engineering/README.md

Drewx-Design and others added 3 commits March 18, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(user-test): self-eval loop with binary evals and mutation proposals (v2.52.0)#308

feat(user-test): self-eval loop with binary evals and mutation proposals (v2.52.0)#308
Drewx-Design wants to merge 18 commits intoEveryInc:mainfrom
Drewx-Design:feat/user-test-skill

Drewx-Design commented Mar 18, 2026

Uh oh!

tmchow commented Mar 18, 2026

Uh oh!

Drewx-Design commented Mar 18, 2026

Uh oh!

tmchow commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Drewx-Design commented Mar 18, 2026

Summary

Key Design Decisions

Commits

Testing

Post-Deploy Monitoring & Validation

Uh oh!

tmchow commented Mar 18, 2026

Uh oh!

Drewx-Design commented Mar 18, 2026

Uh oh!

tmchow commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants