fix: deduplicate memory_writes on retry + add min_score memory filtering by Abernaughty · Pull Request #123 · Abernaughty/agent-dev

Abernaughty · 2026-04-05T20:11:25Z

Summary

Fixes two issues discovered via Langfuse trace review of task-ba22eafa (triforce script, PR #122).

Fix 1: Developer memory_writes dedup on retry (High)

Problem: developer_node blindly appended a new "Implemented blueprint" entry to memory_writes on every retry pass. With 2 retries, this produced 3x identical L1 entries in the state. The summarizer caught them at flush time, but the in-flight state was wasteful and would inflate context on longer tasks.

Fix: Before appending, check if an entry with the same (task_id, source_agent="developer") already exists. If so, replace in-place (last-wins). Other agents' entries (QA, etc.) are untouched.

Fix 2: Memory context relevance filtering (Medium)

Problem: _fetch_memory_context() queried Chroma with n_results=10 and no similarity threshold. In the triforce trace, 5 of 7 memory entries fed to the architect were about a completely unrelated "greet function" from prior tasks — pure noise wasting context window tokens.

Fix: Added min_score parameter to:

MemoryStore protocol (protocol.py)
ChromaMemoryStore.query() implementation (chroma_store.py)
_fetch_memory_context() caller in orchestrator.py — uses min_score=0.3

Default min_score=None preserves backward compatibility for all existing callers.

Tests

65 passed (60 existing + 5 new)
TestDeveloperMemoryDedup — verifies no duplicates on retry, preserves other agents' entries
TestEnrichedMetadata.test_query_min_score_* — verifies filtering, backward compat, high threshold
Ruff clean

Files Changed

File	Change
`dev-suite/src/orchestrator.py`	Dedup logic in `developer_node`, `min_score=0.3` in `_fetch_memory_context`
`dev-suite/src/memory/chroma_store.py`	`min_score` param in `query()`
`dev-suite/src/memory/protocol.py`	`min_score` param in protocol signature
`dev-suite/tests/test_orchestrator.py`	2 new dedup tests
`dev-suite/tests/test_memory.py`	3 new min_score tests

Summary by CodeRabbit

New Features
- Optional relevance-score filtering for memory queries with a moderate default threshold to tighten retrieved context.
- When developer tools wrote files, the system now prefers reading actual disk files as the source of code content.
Bug Fixes
- Avoids duplicate developer memory entries on task retry by replacing prior developer writes.
Tests
- Added tests for score filtering, memory deduplication, and disk-read behavior after tool writes.

Fixes two issues found via Langfuse trace review (task-ba22eafa): 1. Developer memory_writes dedup (High): - developer_node was appending a new "Implemented blueprint" entry on every retry pass, producing 3x identical entries. - Now checks for existing entry with same (task_id, source_agent) and replaces in-place instead of appending. 2. Memory context relevance filtering (Medium): - _fetch_memory_context had no similarity threshold, so unrelated entries from prior tasks leaked into context (5/7 entries were noise about a "greet" function in a "triforce" task). - Added min_score parameter to MemoryStore.query() protocol, ChromaMemoryStore implementation, and _fetch_memory_context. - Default min_score=0.3 (cosine) filters obvious noise while keeping anything remotely relevant. Both changes are backward-compatible (min_score=None preserves existing behavior). Tests: 65 passed (60 existing + 5 new), ruff clean.

…ntext filtering

coderabbitai · 2026-04-05T20:11:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dce965e4-f5f5-47e9-86bd-91b2cf407e29

📥 Commits

Reviewing files that changed from the base of the PR and between bb1a08b and de87d0f.

📒 Files selected for processing (1)

dev-suite/src/orchestrator.py

🚧 Files skipped from review as they are similar to previous changes (1)

dev-suite/src/orchestrator.py

📝 Walkthrough

Walkthrough

Added an optional min_score filter to memory queries (protocol and Chroma implementation), applied min_score=0.3 in orchestrator memory fetches, changed developer memory write handling to replace on retry (dedupe), and updated apply_code to prefer reading tool-written files from disk when available. Tests added for these behaviors.

Changes

Cohort / File(s)	Summary
Protocol & Contract `dev-suite/src/memory/protocol.py`	Extended `MemoryStore.query()` signature with optional `min_score: float \| None = None` and updated docstring to document the 0.0–1.0 similarity-score filter.
Core Implementation (Chroma) `dev-suite/src/memory/chroma_store.py`	Added `min_score` parameter handling in `ChromaMemoryStore.query()`; computes `score = 1 - dist`, filters out entries below `min_score`, and assigns `MemoryQueryResult.score` using the computed `score`.
Orchestration & Usage `dev-suite/src/orchestrator.py`	`_fetch_memory_context()` now calls memory query with `min_score=0.3`; `developer_node` constructs a `new_entry` and deduplicates developer `memory_writes` by `task_id` on retry (replace existing developer entry if present). `apply_code_node` resolves `workspace_root` (fallback) and, when `filesystem_write` tools succeeded, prefers reading `blueprint.target_files` from disk into `parsed_files`, falling back to parsing `generated_code` if reads fail. Minor docstring/comment removals.
Tests — Memory `dev-suite/tests/test_memory.py`	Added three tests covering `store.query()` with `min_score=None`, a moderate threshold filter, and a high threshold (`min_score=0.99`) to verify filtering behavior.
Tests — Orchestrator `dev-suite/tests/test_orchestrator.py`	Added `TestDeveloperMemoryDedup` suite with tests ensuring no duplicate developer memory writes on retry and that dedup preserves other agents' entries.
Tests — Apply Code `dev-suite/tests/test_apply_code.py`	Added `TestApplyCodeDiskRead` tests verifying `apply_code_node` prefers reading tool-written files from disk when `filesystem_write` succeeded and falls back correctly when missing or failed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop through memories, sorting by their score,
Replacing repeats and peeking at the floor.
If tools wrote files, I'll read them from the root,
Else parse what was made and keep the logs astute. 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.87% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the two main changes: deduplicating memory_writes on retry and adding min_score memory filtering, matching the core fixes in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/trace-dedup-memory-relevance

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

dev-suite/src/memory/chroma_store.py (1)
294-296: Docstring claims incorrect score range for edge cases.

The min_score parameter's docstring states "0.0-1.0", but with Chroma's cosine distance space (range [0, 2]), the formula score = 1 - dist can produce scores in the range [-1, 1]. While negative scores are unlikely in practice (requiring distances > 1, meaning nearly opposite vectors), the documentation is technically inaccurate.

Consider either:

Clamping: score = max(0.0, 1 - dist)

Updating the docstring to clarify the actual score range
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/memory/chroma_store.py` around lines 294 - 296, The docstring
for the min_score parameter is incorrect because the current computation score =
1 - dist (with Chroma cosine distance in [0,2]) can yield scores in [-1,1];
either clamp the computed score to the [0.0,1.0] range (e.g., replace the
calculation with a clamped result such as max(0.0, 1 - dist) before comparing to
min_score) or update the docstring for the function/method that declares the
min_score parameter to state the true possible range ([-1.0, 1.0]) and note that
negative scores indicate nearly opposite vectors; ensure you modify the same
block that contains the variable score and the comparison against min_score.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dev-suite/src/memory/chroma_store.py`:
- Around line 294-296: The docstring for the min_score parameter is incorrect
because the current computation score = 1 - dist (with Chroma cosine distance in
[0,2]) can yield scores in [-1,1]; either clamp the computed score to the
[0.0,1.0] range (e.g., replace the calculation with a clamped result such as
max(0.0, 1 - dist) before comparing to min_score) or update the docstring for
the function/method that declares the min_score parameter to state the true
possible range ([-1.0, 1.0]) and note that negative scores indicate nearly
opposite vectors; ensure you modify the same block that contains the variable
score and the comparison against min_score.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 819b57d8-d140-4fa4-8415-54c03b98cc75

📥 Commits

Reviewing files that changed from the base of the PR and between c92efbc and a701a6e.

📒 Files selected for processing (5)

dev-suite/src/memory/chroma_store.py
dev-suite/src/memory/protocol.py
dev-suite/src/orchestrator.py
dev-suite/tests/test_memory.py
dev-suite/tests/test_orchestrator.py

When developer uses filesystem_write tools, apply_code_node now reads the clean files from disk instead of re-parsing the LLM's generated_code text which contains markdown fences and summary prose. Root cause: trace review showed the developer wrote 567 chars of clean Python via filesystem_write, but apply_code re-parsed the 1533-char generated_code (which included ```python fences and ## Summary prose), producing corrupted 677-char output that failed sandbox validation. This caused 2 wasted retries on a task that the developer got right on the first attempt. Flow with fix: 1. Developer writes clean code via filesystem_write -> disk 2. apply_code detects successful filesystem_write in tool_calls_log 3. Reads files from disk using blueprint.target_files 4. Passes clean content to sandbox (no re-parsing needed) 5. Falls back to parser if disk read fails or no tools were used Tests: 4 new tests in TestApplyCodeDiskRead, all passing.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

dev-suite/tests/test_apply_code.py (2)

187-213: Consider extracting a shared state builder fixture.

_make_state here largely duplicates setup from TestApplyCodeNode; a shared fixture/helper would reduce drift and maintenance overhead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_apply_code.py` around lines 187 - 213, Extract the
duplicated state builder into a shared test fixture/helper so both this test
module and TestApplyCodeNode use the same setup: move the logic inside
_make_state (which builds the base dict with Blueprint, generated_code,
WorkflowStatus, etc.) into a reusable function or pytest fixture (e.g.,
make_state_fixture or shared_make_state) and update tests to call that helper
instead of duplicating the dict; ensure the helper returns the same structure
and accepts **overrides so existing callers of _make_state (and
TestApplyCodeNode) can pass custom fields unchanged.

236-237: Avoid brittle assertions on free-form trace wording.

Checking substrings like "tool-written" / "falling back" makes tests fragile to harmless trace-text edits. Prefer asserting structured behavior (source/content of parsed_files) or stable trace event tokens.

Also applies to: 273-274

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_apply_code.py` around lines 236 - 237, The assertion is
brittle because it inspects free-form trace text (result["trace"]) for
substrings like "tool-written"; instead, change the test to assert structured
behavior — e.g., check the parsed_files collection (parsed_files or
result["parsed_files"]) for the expected source/content or file count, or
inspect trace entries for stable tokens/fields (such as an explicit event type
or code like event["type"] == "tool_written") rather than matching free-form
messages; update both the assertion at the current location (assert
any("tool-written" in t for t in result["trace"])) and the similar one around
lines 273-274 to use those structured checks (result, result["trace"],
parsed_files, event["type"] are the symbols to modify).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dev-suite/tests/test_apply_code.py`:
- Around line 187-213: Extract the duplicated state builder into a shared test
fixture/helper so both this test module and TestApplyCodeNode use the same
setup: move the logic inside _make_state (which builds the base dict with
Blueprint, generated_code, WorkflowStatus, etc.) into a reusable function or
pytest fixture (e.g., make_state_fixture or shared_make_state) and update tests
to call that helper instead of duplicating the dict; ensure the helper returns
the same structure and accepts **overrides so existing callers of _make_state
(and TestApplyCodeNode) can pass custom fields unchanged.
- Around line 236-237: The assertion is brittle because it inspects free-form
trace text (result["trace"]) for substrings like "tool-written"; instead, change
the test to assert structured behavior — e.g., check the parsed_files collection
(parsed_files or result["parsed_files"]) for the expected source/content or file
count, or inspect trace entries for stable tokens/fields (such as an explicit
event type or code like event["type"] == "tool_written") rather than matching
free-form messages; update both the assertion at the current location (assert
any("tool-written" in t for t in result["trace"])) and the similar one around
lines 273-274 to use those structured checks (result, result["trace"],
parsed_files, event["type"] are the symbols to modify).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 095c5cec-7c26-4104-a9f6-6a5226b0ecf7

📥 Commits

Reviewing files that changed from the base of the PR and between a701a6e and bb1a08b.

📒 Files selected for processing (1)

dev-suite/tests/test_apply_code.py

coderabbitai · 2026-04-05T20:22:36Z

dev-suite/tests/test_apply_code.py

+    def test_prefers_disk_over_parser_when_tools_used(self, tmp_path):
+        """When filesystem_write was used, read files from disk instead of re-parsing."""
+        from src.orchestrator import apply_code_node
+
+        # Write clean code to disk (simulating what filesystem_write did)
+        (tmp_path / "app.py").write_text("print('hello from disk')\n")
+
+        # generated_code has fences that would corrupt parsing
+        state = self._make_state(
+            workspace_root=str(tmp_path),
+            tool_calls_log=[
+                {"agent": "developer", "turn": 1, "tool": "filesystem_write",
+                 "args_preview": "{}", "result_preview": "ok", "success": True},
+            ],
+        )
+
+        result = apply_code_node(state)
+
+        assert len(result["parsed_files"]) == 1
+        assert result["parsed_files"][0]["path"] == "app.py"
+        assert result["parsed_files"][0]["content"] == "print('hello from disk')\n"
+        assert any("tool-written" in t for t in result["trace"])
+


⚠️ Potential issue | 🟠 Major

Include the apply_code_node disk-read branch in this PR (or gate these tests until it lands).

These tests currently require behavior that is not present in src/orchestrator.py (see parser-only flow around Line 461-Line 505), which is why CI is failing here. Please ship the production disk-read path (successful developer filesystem_write → read from workspace → fallback to parser) in the same PR so this suite can pass.

Also applies to: 255-274

🧰 Tools

🪛 GitHub Actions: CI

[error] 235-235: Test failed: TestApplyCodeDiskRead.test_prefers_disk_over_parser_when_tools_used. Expected parsed_files[0].content to be "print('hello from disk')\n" but got "print('hello')".

…ing LLM output

Abernaughty added 2 commits April 5, 2026 14:06

fix(orchestrator): deduplicate developer memory_writes + min_score co…

a701a6e

…ntext filtering

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

fix(apply_code): read tool-written files from disk instead of re-pars…

de87d0f

…ing LLM output

Abernaughty merged commit 9e3f90e into main Apr 5, 2026
3 checks passed

Abernaughty deleted the fix/trace-dedup-memory-relevance branch April 5, 2026 20:26

Abernaughty mentioned this pull request Apr 6, 2026

investigate: Retry loop fails to fix trivially solvable QA failures — improve agent problem-solving feedback loop #125

Closed

5 tasks

coderabbitai bot mentioned this pull request Apr 6, 2026

feat(#125): retry loop fix — targeted retry prompts, QA leniency, fix hints #126

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate memory_writes on retry + add min_score memory filtering#123

fix: deduplicate memory_writes on retry + add min_score memory filtering#123
Abernaughty merged 4 commits intomainfrom
fix/trace-dedup-memory-relevance

Abernaughty commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abernaughty commented Apr 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Developer memory_writes dedup on retry (High)

Fix 2: Memory context relevance filtering (Medium)

Tests

Files Changed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abernaughty commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 5, 2026 •

edited

Loading