Skip to content

fix(cli): make --threshold override per-test score requirement#885

Merged
christso merged 6 commits intomainfrom
fix/result-verdict-use-mean-score
Mar 31, 2026
Merged

fix(cli): make --threshold override per-test score requirement#885
christso merged 6 commits intomainfrom
fix/result-verdict-use-mean-score

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Mar 31, 2026

Closes #882

Summary

--threshold now configures the per-test score requirement (default 0.8) and is threaded end-to-end through the orchestrator. The RESULT verdict, progress line, JSONL output, and exit code are all consistent.

Before (contradictory):

RESULT: FAIL  (28/31 passed, mean score: 0.927)
Suite score: 0.93 (threshold: 0.80) — PASS     ← exit code 0

After (consistent):

Without --threshold (no CI gate, default 0.8 for display):

RESULT: FAIL  (28/31 scored >= 0.8, mean: 0.927)    ← exit code 0 (no gate)

With --threshold 0.8 (CI gate enabled):

RESULT: FAIL  (28/31 scored >= 0.8, mean: 0.927)    ← exit code 1

With --threshold 0.2 (lower bar, all pass):

RESULT: PASS  (31/31 scored >= 0.2, mean: 0.927)    ← exit code 0

Changes

  • --threshold flows through orchestrator → classifyQualityStatus() so progress lines and executionStatus respect custom threshold
  • calculateEvaluationSummary() recomputes passed/failed from raw scores when threshold is set
  • formatEvaluationSummary() shows threshold in RESULT line: scored >= 0.8
  • Exit code matches RESULT verdict — no separate threshold check
  • Removed formatThresholdSummary() (merged into single RESULT line)
  • Updated CLI help text and docs

Test plan

  • All unit tests pass (including new threshold tests)
  • Build succeeds
  • Pre-push hooks pass
  • E2E verified: --threshold 0.5 shows FAIL + exit 1 when test scores 0.25; --threshold 0.2 shows PASS + exit 0

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 31, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: cd69e5d
Status: ✅  Deploy successful!
Preview URL: https://4433fa69.agentv.pages.dev
Branch Preview URL: https://fix-result-verdict-use-mean.agentv.pages.dev

View logs

christso and others added 3 commits March 31, 2026 13:01
The RESULT: PASS/FAIL line used all-must-pass logic (every individual
case must score >= 0.8), while --threshold used mean-based scoring.
This caused confusing contradictory output:

  RESULT: FAIL  (28/31 passed, mean score: 0.927)
  Suite score: 0.93 (threshold: 0.80) — PASS

Now the RESULT line uses mean >= 0.8, consistent with --threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RESULT line and --threshold check now both use pass rate (fraction
of cases scoring >= 0.8) instead of inconsistent metrics. Previously
the RESULT line used all-must-pass while --threshold used mean score.

Before:
  RESULT: FAIL  (28/31 passed, mean score: 0.927)
  Suite score: 0.93 (threshold: 0.80) — PASS

After:
  RESULT: FAIL  (pass rate: 90.3%, 28/31 passed, mean score: 0.927)
  Suite pass rate: 90.3% (threshold: 80.0%) — PASS

Both paths now consistently use pass rate. The RESULT line is
informational (all-must-pass), while --threshold gates CI exit code
against a configurable pass rate minimum.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RESULT: FAIL  (28 passed, 3 failed, mean score: 0.927)

The failed count makes it immediately obvious why the verdict is FAIL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso force-pushed the fix/result-verdict-use-mean-score branch from 00f0b19 to 6803b40 Compare March 31, 2026 13:06
@christso christso changed the title fix(cli): use mean score for RESULT verdict instead of all-must-pass fix(cli): use pass rate for --threshold and clarify RESULT verdict Mar 31, 2026
The --threshold flag now gates on pass rate, not mean score. Update
the CLI help text and docs site to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso changed the title fix(cli): use pass rate for --threshold and clarify RESULT verdict fix(cli): make --threshold override per-test score requirement Mar 31, 2026
christso and others added 2 commits March 31, 2026 13:36
--threshold now configures the per-test score requirement (default 0.8)
instead of comparing mean score. The RESULT verdict and exit code are
now consistent: exit 1 when any test scores below the threshold.

Before (contradictory):
  RESULT: FAIL  (28/31 passed, mean score: 0.927)
  Suite score: 0.93 (threshold: 0.80) — PASS  ← exit code 0

After (consistent):
  RESULT: PASS  (28/31 scored >= 0.8, mean: 0.927)  ← exit code 0

With --threshold 0.95:
  RESULT: FAIL  (20/31 scored >= 0.95, mean: 0.927)  ← exit code 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The threshold now flows from CLI → orchestrator → classifyQualityStatus,
so the live progress line (e.g., "0.750 FAIL") and executionStatus in
JSONL output both respect the custom threshold. Previously these were
hardcoded to PASS_THRESHOLD (0.8) regardless of --threshold.

Added threshold field to RunEvaluationOptions, RunEvalCaseOptions, and
all intermediate call sites (runBatchEvaluation, evaluateCandidate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@christso christso merged commit 0b53656 into main Mar 31, 2026
4 checks passed
@christso christso deleted the fix/result-verdict-use-mean-score branch March 31, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(cli): --threshold compares mean score instead of per-test score

1 participant