Skip to content

investigate: Retry loop fails to fix trivially solvable QA failures — improve agent problem-solving feedback loop #125

@Abernaughty

Description

@Abernaughty

Problem

During an E2E test run, the Lead Dev was tasked with writing a Python script that prints a Triforce from Zelda. The sandbox executed successfully (exit_code=0), but QA identified a clear, specific formatting issue: the first line was missing 6 leading spaces (/\ instead of /\).

Despite QA providing an exact, unambiguous failure description on each retry — specifying the expected output, the actual output, and precisely what was wrong — the Lead Dev failed to fix the issue across all 3 retry attempts. The task exhausted its retry budget on what should have been a trivial one-line fix.

Why This Matters

This isn't about the Triforce. It reveals a systemic weakness in how the retry loop communicates failure context to the Lead Dev. If the agent can't fix a spacing issue given an exact error description, it will fail on anything requiring iterative correction — which is the entire point of the retry mechanism.

Deep Dive Areas

1. QA Failure Report → Lead Dev Context Transfer

Current behavior: QA produces a verdict with a failure description. The Lead Dev receives this on retry but may not be effectively using it.

Questions to investigate:

  • What exactly does the Lead Dev see in its prompt on retry? Is the QA failure report included verbatim, or summarized/truncated?
  • Does the Lead Dev receive the current file contents alongside the failure report, or does it regenerate from scratch each time?
  • Is there a "diff mindset" — does the Dev understand it should make a targeted fix, or does it rewrite the entire file?

2. Retry Strategy: Targeted Fix vs Full Regeneration

Observed behavior: The Lead Dev appears to regenerate the file from scratch on each retry rather than applying a surgical fix to the specific issue QA identified.

Proposed improvements:

  • Structured failure context: On retry, the Lead Dev prompt should include: (a) the exact QA failure description, (b) the current file contents as written to disk, (c) explicit instruction: "Apply the minimum change to fix the reported issue. Do NOT rewrite the file from scratch."
  • Diff-mode retries: For simple failures (single file, clear fix), the retry prompt could instruct the Dev to output only the changed lines rather than the full file.
  • Failure classification: QA could tag failures as trivial-fix (formatting, off-by-one, missing import) vs structural (wrong approach, missing dependency, design flaw). Trivial fixes get a more constrained retry prompt.

3. QA Acceptance Criteria Quality

Observed behavior: QA invented its own acceptance criteria for the Triforce layout (requiring exactly 6 leading spaces) that may not match what the user actually wanted. The task had no user-provided acceptance criteria.

Questions:

  • When acceptance criteria are empty, how does QA decide what "correct" means?
  • Should QA be more lenient on subjective outputs (ASCII art, formatting) when no explicit criteria exist?
  • Should the Planner inject default acceptance criteria like "script runs without errors and produces visible output" for tasks with no user-specified criteria?

4. Sandbox Output as Ground Truth

Key insight: The sandbox executed the script successfully (1 test passed). The actual output existed — QA could have compared the real execution output against expectations rather than doing a static code review of spacing.

Proposed improvement: On retry, include the sandbox stdout in the failure context so the Lead Dev can see exactly what was produced vs what QA expected.

Acceptance Criteria

  • Document what the Lead Dev prompt looks like on retry (trace analysis)
  • Identify gaps in the failure context passed between QA → orchestrator → Lead Dev
  • Prototype at least one improvement (structured retry context OR failure classification)
  • Test the improvement against the same Triforce task (should pass in ≤1 retry)
  • Document findings and recommendations for remaining improvements

Related

Effort

Medium-Large (investigation + prototype)

Metadata

Metadata

Assignees

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions