Skip to content

feat(user-test): self-eval loop with binary evals and mutation proposals (v2.52.0)#308

Draft
Drewx-Design wants to merge 18 commits intoEveryInc:mainfrom
Drewx-Design:feat/user-test-skill
Draft

feat(user-test): self-eval loop with binary evals and mutation proposals (v2.52.0)#308
Drewx-Design wants to merge 18 commits intoEveryInc:mainfrom
Drewx-Design:feat/user-test-skill

Conversation

@Drewx-Design
Copy link

Summary

  • Add /user-test-eval command and skill — grades user-test output against 3 binary evals (probe execution order, Proven regression distinction, P1 surfacing)
  • Records scores in skill-evals.json, proposes targeted mutations in skill-mutations.md
  • Adds execution_index tracking for artifact-only probe order verification
  • Adds .user-test-last-report.md persistence for presentation-layer eval grading
  • Schema v10 with v9 backward compatibility

Key Design Decisions

  • Eval runs as separate invocation — commit mode prompts "Run /user-test-eval" rather than auto-chaining, to preserve grading integrity (separate context = harder to game)
  • All evals are mechanical — Eval 1: index comparison, Eval 2: regex match for → Proven regression marker, Eval 3: slug+priority matching. No subjective judgment.
  • One mutation per failing eval — all failures get proposals in a single run, human reviews and accepts/rejects
  • Artifacts are project-scoped (tests/user-flows/) not plugin-scoped

Commits

Commit Description
v2.49.0 Cross-area probes, probe isolation, proactive browser restart (schema v7)
v2.50.0 Compounding quality: writebacks, weakness synthesis, fingerprints, CLI adversarial (schema v8)
v2.51.0 Multi-area journey testing (schema v9)
v2.52.0 Self-eval loop with binary evals and mutation proposals (schema v10)

Testing

  • Run /user-test on a real scenario, verify report persists to .user-test-last-report.md
  • Run /user-test-eval after a completed run, verify evals grade correctly
  • Verify skill-evals.json and skill-mutations.md are created/updated
  • Test eval with pre-v10 artifacts (should SKIP Eval 1 gracefully)

Post-Deploy Monitoring & Validation

No additional operational monitoring required: plugin changes are client-side skill definitions with no server component.


Compound Engineered

Drewx-Design and others added 15 commits February 28, 2026 14:42
… system

Complete implementation of the user-test skill across three plan iterations:
- v1: Core skill with 5-phase browser testing, maturity model, thin wrapper commands
- v2: Schema migration, timing, CLI mode, auto-commit, quality scoring, performance thresholds
- v3: Bug registry, per-area score history, structured skip reasons, pass thresholds,
  queryable qualitative data, discovery-to-regression graduation, UX opportunities + good patterns

New files: SKILL.md (364 lines), 5 reference files, 3 thin wrapper commands, 3 plans, 2 learnings.
Version: 2.37.0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…dence, and novelty

Schema v6: adds probe confidence field (high/medium/low) with execution
ordering, stable query rotation lifecycle (active → [stable] → [retired]),
mandatory novelty budget per area, and first-run code-reading orientation
for structural hypothesis probes.

New reference files: orientation.md, probes.md, queries-and-multiturn.md,
run-targeting.md, verification-patterns.md.
…dedup

P1: Fix generated→generated_from field name, broaden orientation bash
commands beyond src/ (language-agnostic discovery), fix CLI discovery
step numbering (step 2→3 after orientation insertion).

P2: Fix "Phase 5"→"commit mode" reference, remove "(Phase 4)" label
conflation, clarify score-history.json stores 1 entry per commit (not
per iterate run), define novelty run 1 state boundary, add v5 migration
bullet to SKILL.md Phase 1.

P3: Document low confidence as reserved, specify Probes Run History
column format (P/F entries), remove YAGNI orientation design note,
deduplicate scoring boundaries across files, note [retired] as
long-term maturity state. Merge v3/v4 migration bullets to maintain
SKILL.md at 420-line ceiling.
…aring, protected artifacts

1. Interaction Method preamble — AskUserQuestion fallback for non-Claude
   LLMs (numbered list + wait pattern). Prevents silent auto-configuration.
2. CLI-only graceful degradation — when Chrome MCP unavailable but
   cli_test_command covers all scored_output areas, offer CLI-only mode.
3. Optional Proof sharing — after Phase 4 report, offer to POST session
   summary to proofeditor.ai for team review. Best-effort, skip on failure.
4. Protected artifacts — declare tests/user-flows/ as pipeline output
   that review agents must not flag for deletion.

Consolidated 7 individual reference sections into compact list to stay
within 420-line SKILL.md ceiling.
…ive browser restart (schema v7)

Cross-area probe table tests state carry-over between areas. Probe
isolation guidance separates multi-cause symptoms. Proactive restart
prevents browser connection degradation after ~15 MCP calls. Connection
resilience extracted to reference file. SKILL.md stays at 420 lines.
…e, lifecycle compression

- Fix stale back-reference in browser-input-patterns.md (now points to
  connection-resilience.md directly)
- Add related_bug to probe generation output format with storage guidance
  (inline in Generated From column)
- Fix UX010 example → B002 (bug ID, not UX opportunity ID)
- Add cross-area probe mention to orientation.md for multi-area seams
- Consolidate browser-input-patterns.md Proactive Restart (removed redundant
  timing/cross-area paragraphs, kept clears/preserves list)
- Compress cross-area Lifecycle subsection (6 lines → 2)
- Add deferred restart exception note to connection-resilience.md
- Replace "randomly" with round-robin for spot-check rotation
- Add cross-area probe table mention to SKILL.md commit mode step 4
…synthesis, fingerprints, CLI adversarial (schema v8)

Four changes that make each run smarter automatically:

1. Richer commit writebacks (C1): tactical notes in Notes column, verified
   selectors auto-appended to verify: blocks, weakness_class field written
   when 2+ probes share a failure pattern
2. Weakness-class synthesis (C2): cross-area synthesis pass generates
   [cross-area] Explore Next Run entries with adversarial instructions
   when a weakness_class appears in 2+ areas
3. Novelty fingerprint persistence (C3): compact fingerprints persisted
   in .user-test-last-run.json across runs with read-merge-write sequence,
   20-per-area cap, iterate mode exemption
4. CLI adversarial browser mode (C4): CLI score 3 triggers adversarial
   browser testing — skip happy path, front-load competing constraints,
   pre-emptive P1 probe, SKIP→PROBES-ONLY override

JSON schema extracted from SKILL.md to references/last-run-schema.md
(52→2 lines inline). SKILL.md reduced from 421→369 lines.
…ring, JSON dedup, synthesis timing

7 review findings fixed:
1. Adversarial trigger scope: compressed SKILL.md to brief-then-defer pattern,
   removing contradiction with reference file's broader trigger condition
2. Step numbering collision: SKILL.md commit step 9 → 8b to avoid clash
   with queries doc's internal step 9; updated "steps 8-10" → "steps 8-12"
3. JSON field duplication: replaced duplicate JSON blocks in probes.md and
   queries-and-multiturn.md with cross-references to last-run-schema.md
4. Duplicate fingerprint override section removed from CLI Adversarial Mode
5. Synthesis timing: added "as present at run start — ignore this run's commit"
   to SKILL.md Phase 4 Step 6 to prevent premature synthesis
6. Added section anchor hints to tactical notes and selector writeback refs
7. Added tactical_note/confirmed_selectors defaults to v7 migration rule
Add journey testing layer for accumulated state across 3+ areas without
resets. Journeys execute after cross-area probes, before per-area testing,
with checkpoints at each step. failing-at-N pinpoints which step broke.

New: references/journeys.md (lifecycle, budget, execution, interactions)
Schema: v8→v9 additive (missing Journeys = empty, forward compatible)
JSON: journeys_run array with per-step checkpoint data
SKILL.md: 376 lines (under 420 ceiling)
…ring, JSON dedup, synthesis timing

- P1: Add escalated_to field for journey escalation dedup (prevents duplicate bugs)
- P1: Clarify continue-mode tracks per-step escalation independently
- P2: Add 3-journey spot-check cap for passing/stable journeys
- P2: Fix schema status enum (was missing untested/flaky/stable)
- P2: Add explicit count-change baseline capture instruction
- P3: Consolidate template comment, add schema hint
- P3: Add explicit "no graduation" statement for journeys
- P3: Define N-run summary terms (stabilized, persistent issues)
- P3: Clarify orientation synthesizes probes into journey suggestions
…ion proposals (schema v10)

Add /user-test-eval command and skill that grades skill output against 3 binary
evals (probe execution order, Proven regression distinction, P1 surfacing),
records scores in skill-evals.json, and proposes targeted mutations in
skill-mutations.md. Auto-triggers after commit mode. Adds execution_index
tracking for artifact-only probe order verification, and .user-test-last-report.md
persistence for presentation-layer eval grading.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…, frontmatter, dynamic version

- P1: Change auto-eval trigger to a prompt ("Run /user-test-eval") instead of
  inline invocation — preserves grading integrity and respects allowed-tools
  constraints on parent commands
- P2: Add disable-model-invocation: true to eval skill frontmatter
- P2: Replace hardcoded skill_version with dynamic plugin.json read
- P3: Add AskUserQuestion to staleness and already-evaluated prompts
- P3: Add references/ extraction as v2.53.0 out-of-scope consideration
# Conflicts:
#	.claude-plugin/marketplace.json
#	plugins/compound-engineering/.claude-plugin/plugin.json
#	plugins/compound-engineering/CHANGELOG.md
#	plugins/compound-engineering/README.md
…39.0 pattern

Commands were migrated to skills in upstream v2.39.0. Move user-test-commit
and user-test-iterate from commands/ to skills/ as thin dispatcher skills.
Remove commands/ directory — user-test and user-test-eval were already skills.
@tmchow
Copy link
Collaborator

tmchow commented Mar 18, 2026

@Drewx-Design have you considered how this might work with agent-browser (and possibly lightpanda engine with agent-browser)? I'm wondering if this testing can be headless?

Drewx-Design and others added 3 commits March 18, 2026 15:35
…rmation

Scale browser MCP budget by consecutive pass count (3/2/1 calls for
2-5/6-9/10+ passes) instead of flat 3 for all Proven areas. Add
non-deterministic probe confirmation requiring 2 consecutive passes
before treating LLM-dependent probes as genuinely passing.

Updates all 12 cross-file "3 MCP" references across 4 files to point
to the tiered system. Adds worked example for 1-call tier budget.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix stale "1 for Proven" in progressive narrowing table (P1)
- Fix worked example wording implying probes consume budget (P2)
- Add passing* inter-run persistence mechanism: keep status as
  failing/flaky in test file, track unconfirmed pass in last-run JSON (P2)
- Add novelty waiver cross-reference at mandatory probe rule (P2)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Tighten run-targeting.md prose: remove redundant sentences, collapse
  6 paragraphs to 3 without losing meaning
- Add 2-call tier worked example to queries-and-multiturn.md
- Define freed calls formula: N = sum of (3 - tier_budget) across all
  Proven areas tested this run

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Drewx-Design
Copy link
Author

r

Headless testing would be the eventual goal. I've been working on this for a week or two testing on my project, but really trying to nail the quality of the results first, with the compounding factor.

@tmchow
Copy link
Collaborator

tmchow commented Mar 18, 2026

r

Headless testing would be the eventual goal. I've been working on this for a week or two testing on my project, but really trying to nail the quality of the results first, with the compounding factor.

Yeah i'd love to see this in a followup because it'll end up being WAY faster and also then work for codex and other non-claude code coding environments which aligns with our cross platform goal.

Lightpanda browser in particular with agent browser cli is so ridiculously fast too so there's that benefit. Also lots of little features that are helpful for agent usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants