Problem
During an E2E test run, the Lead Dev was tasked with writing a Python script that prints a Triforce from Zelda. The sandbox executed successfully (exit_code=0), but QA identified a clear, specific formatting issue: the first line was missing 6 leading spaces (/\ instead of /\).
Despite QA providing an exact, unambiguous failure description on each retry — specifying the expected output, the actual output, and precisely what was wrong — the Lead Dev failed to fix the issue across all 3 retry attempts. The task exhausted its retry budget on what should have been a trivial one-line fix.
Why This Matters
This isn't about the Triforce. It reveals a systemic weakness in how the retry loop communicates failure context to the Lead Dev. If the agent can't fix a spacing issue given an exact error description, it will fail on anything requiring iterative correction — which is the entire point of the retry mechanism.
Deep Dive Areas
1. QA Failure Report → Lead Dev Context Transfer
Current behavior: QA produces a verdict with a failure description. The Lead Dev receives this on retry but may not be effectively using it.
Questions to investigate:
- What exactly does the Lead Dev see in its prompt on retry? Is the QA failure report included verbatim, or summarized/truncated?
- Does the Lead Dev receive the current file contents alongside the failure report, or does it regenerate from scratch each time?
- Is there a "diff mindset" — does the Dev understand it should make a targeted fix, or does it rewrite the entire file?
2. Retry Strategy: Targeted Fix vs Full Regeneration
Observed behavior: The Lead Dev appears to regenerate the file from scratch on each retry rather than applying a surgical fix to the specific issue QA identified.
Proposed improvements:
- Structured failure context: On retry, the Lead Dev prompt should include: (a) the exact QA failure description, (b) the current file contents as written to disk, (c) explicit instruction: "Apply the minimum change to fix the reported issue. Do NOT rewrite the file from scratch."
- Diff-mode retries: For simple failures (single file, clear fix), the retry prompt could instruct the Dev to output only the changed lines rather than the full file.
- Failure classification: QA could tag failures as
trivial-fix (formatting, off-by-one, missing import) vs structural (wrong approach, missing dependency, design flaw). Trivial fixes get a more constrained retry prompt.
3. QA Acceptance Criteria Quality
Observed behavior: QA invented its own acceptance criteria for the Triforce layout (requiring exactly 6 leading spaces) that may not match what the user actually wanted. The task had no user-provided acceptance criteria.
Questions:
- When acceptance criteria are empty, how does QA decide what "correct" means?
- Should QA be more lenient on subjective outputs (ASCII art, formatting) when no explicit criteria exist?
- Should the Planner inject default acceptance criteria like "script runs without errors and produces visible output" for tasks with no user-specified criteria?
4. Sandbox Output as Ground Truth
Key insight: The sandbox executed the script successfully (1 test passed). The actual output existed — QA could have compared the real execution output against expectations rather than doing a static code review of spacing.
Proposed improvement: On retry, include the sandbox stdout in the failure context so the Lead Dev can see exactly what was produced vs what QA expected.
Acceptance Criteria
Related
Effort
Medium-Large (investigation + prototype)
Problem
During an E2E test run, the Lead Dev was tasked with writing a Python script that prints a Triforce from Zelda. The sandbox executed successfully (exit_code=0), but QA identified a clear, specific formatting issue: the first line was missing 6 leading spaces (
/\instead of/\).Despite QA providing an exact, unambiguous failure description on each retry — specifying the expected output, the actual output, and precisely what was wrong — the Lead Dev failed to fix the issue across all 3 retry attempts. The task exhausted its retry budget on what should have been a trivial one-line fix.
Why This Matters
This isn't about the Triforce. It reveals a systemic weakness in how the retry loop communicates failure context to the Lead Dev. If the agent can't fix a spacing issue given an exact error description, it will fail on anything requiring iterative correction — which is the entire point of the retry mechanism.
Deep Dive Areas
1. QA Failure Report → Lead Dev Context Transfer
Current behavior: QA produces a verdict with a failure description. The Lead Dev receives this on retry but may not be effectively using it.
Questions to investigate:
2. Retry Strategy: Targeted Fix vs Full Regeneration
Observed behavior: The Lead Dev appears to regenerate the file from scratch on each retry rather than applying a surgical fix to the specific issue QA identified.
Proposed improvements:
trivial-fix(formatting, off-by-one, missing import) vsstructural(wrong approach, missing dependency, design flaw). Trivial fixes get a more constrained retry prompt.3. QA Acceptance Criteria Quality
Observed behavior: QA invented its own acceptance criteria for the Triforce layout (requiring exactly 6 leading spaces) that may not match what the user actually wanted. The task had no user-provided acceptance criteria.
Questions:
4. Sandbox Output as Ground Truth
Key insight: The sandbox executed the script successfully (1 test passed). The actual output existed — QA could have compared the real execution output against expectations rather than doing a static code review of spacing.
Proposed improvement: On retry, include the sandbox stdout in the failure context so the Lead Dev can see exactly what was produced vs what QA expected.
Acceptance Criteria
Related
orchestrator.pyconditional routing after QA failureqa_nodeinorchestrator.pyapply_code_nodereads fromparsed_filesor tool-written files on disk (PR fix: deduplicate memory_writes on retry + add min_score memory filtering #123)Effort
Medium-Large (investigation + prototype)