investigate: Retry loop fails to fix trivially solvable QA failures — improve agent problem-solving feedback loop

## Problem

During an E2E test run, the Lead Dev was tasked with writing a Python script that prints a Triforce from Zelda. The sandbox executed successfully (exit_code=0), but QA identified a clear, specific formatting issue: the first line was missing 6 leading spaces (`/\` instead of `      /\`).

Despite QA providing an **exact, unambiguous failure description** on each retry — specifying the expected output, the actual output, and precisely what was wrong — the Lead Dev failed to fix the issue across all 3 retry attempts. The task exhausted its retry budget on what should have been a trivial one-line fix.

## Why This Matters

This isn't about the Triforce. It reveals a systemic weakness in how the retry loop communicates failure context to the Lead Dev. If the agent can't fix a spacing issue given an exact error description, it will fail on anything requiring iterative correction — which is the entire point of the retry mechanism.

## Deep Dive Areas

### 1. QA Failure Report → Lead Dev Context Transfer

**Current behavior:** QA produces a verdict with a failure description. The Lead Dev receives this on retry but may not be effectively using it.

**Questions to investigate:**
- What exactly does the Lead Dev see in its prompt on retry? Is the QA failure report included verbatim, or summarized/truncated?
- Does the Lead Dev receive the **current file contents** alongside the failure report, or does it regenerate from scratch each time?
- Is there a "diff mindset" — does the Dev understand it should make a targeted fix, or does it rewrite the entire file?

### 2. Retry Strategy: Targeted Fix vs Full Regeneration

**Observed behavior:** The Lead Dev appears to regenerate the file from scratch on each retry rather than applying a surgical fix to the specific issue QA identified.

**Proposed improvements:**
- **Structured failure context:** On retry, the Lead Dev prompt should include: (a) the exact QA failure description, (b) the current file contents as written to disk, (c) explicit instruction: "Apply the minimum change to fix the reported issue. Do NOT rewrite the file from scratch."
- **Diff-mode retries:** For simple failures (single file, clear fix), the retry prompt could instruct the Dev to output only the changed lines rather than the full file.
- **Failure classification:** QA could tag failures as `trivial-fix` (formatting, off-by-one, missing import) vs `structural` (wrong approach, missing dependency, design flaw). Trivial fixes get a more constrained retry prompt.

### 3. QA Acceptance Criteria Quality

**Observed behavior:** QA invented its own acceptance criteria for the Triforce layout (requiring exactly 6 leading spaces) that may not match what the user actually wanted. The task had no user-provided acceptance criteria.

**Questions:**
- When acceptance criteria are empty, how does QA decide what "correct" means?
- Should QA be more lenient on subjective outputs (ASCII art, formatting) when no explicit criteria exist?
- Should the Planner inject default acceptance criteria like "script runs without errors and produces visible output" for tasks with no user-specified criteria?

### 4. Sandbox Output as Ground Truth

**Key insight:** The sandbox executed the script successfully (1 test passed). The actual output existed — QA could have compared the **real execution output** against expectations rather than doing a static code review of spacing.

**Proposed improvement:** On retry, include the sandbox stdout in the failure context so the Lead Dev can see exactly what was produced vs what QA expected.

## Acceptance Criteria

- [ ] Document what the Lead Dev prompt looks like on retry (trace analysis)
- [ ] Identify gaps in the failure context passed between QA → orchestrator → Lead Dev
- [ ] Prototype at least one improvement (structured retry context OR failure classification)
- [ ] Test the improvement against the same Triforce task (should pass in ≤1 retry)
- [ ] Document findings and recommendations for remaining improvements

## Related

- Retry logic: `orchestrator.py` conditional routing after QA failure
- QA verdict structure: `qa_node` in `orchestrator.py`
- Apply code: `apply_code_node` reads from `parsed_files` or tool-written files on disk (PR #123)

## Effort

Medium-Large (investigation + prototype)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate: Retry loop fails to fix trivially solvable QA failures — improve agent problem-solving feedback loop #125

Problem

Why This Matters

Deep Dive Areas

1. QA Failure Report → Lead Dev Context Transfer

2. Retry Strategy: Targeted Fix vs Full Regeneration

3. QA Acceptance Criteria Quality

4. Sandbox Output as Ground Truth

Acceptance Criteria

Related

Effort

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

investigate: Retry loop fails to fix trivially solvable QA failures — improve agent problem-solving feedback loop #125

Description

Problem

Why This Matters

Deep Dive Areas

1. QA Failure Report → Lead Dev Context Transfer

2. Retry Strategy: Targeted Fix vs Full Regeneration

3. QA Acceptance Criteria Quality

4. Sandbox Output as Ground Truth

Acceptance Criteria

Related

Effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions