fix: deduplicate memory_writes on retry + add min_score memory filtering#123
fix: deduplicate memory_writes on retry + add min_score memory filtering#123Abernaughty merged 4 commits intomainfrom
Conversation
Fixes two issues found via Langfuse trace review (task-ba22eafa):
1. Developer memory_writes dedup (High):
- developer_node was appending a new "Implemented blueprint" entry
on every retry pass, producing 3x identical entries.
- Now checks for existing entry with same (task_id, source_agent)
and replaces in-place instead of appending.
2. Memory context relevance filtering (Medium):
- _fetch_memory_context had no similarity threshold, so unrelated
entries from prior tasks leaked into context (5/7 entries were
noise about a "greet" function in a "triforce" task).
- Added min_score parameter to MemoryStore.query() protocol,
ChromaMemoryStore implementation, and _fetch_memory_context.
- Default min_score=0.3 (cosine) filters obvious noise while
keeping anything remotely relevant.
Both changes are backward-compatible (min_score=None preserves
existing behavior).
Tests: 65 passed (60 existing + 5 new), ruff clean.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdded an optional Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dev-suite/src/memory/chroma_store.py (1)
294-296: Docstring claims incorrect score range for edge cases.The
min_scoreparameter's docstring states "0.0-1.0", but with Chroma's cosine distance space (range [0, 2]), the formulascore = 1 - distcan produce scores in the range [-1, 1]. While negative scores are unlikely in practice (requiring distances > 1, meaning nearly opposite vectors), the documentation is technically inaccurate.Consider either:
- Clamping:
score = max(0.0, 1 - dist)- Updating the docstring to clarify the actual score range
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dev-suite/src/memory/chroma_store.py` around lines 294 - 296, The docstring for the min_score parameter is incorrect because the current computation score = 1 - dist (with Chroma cosine distance in [0,2]) can yield scores in [-1,1]; either clamp the computed score to the [0.0,1.0] range (e.g., replace the calculation with a clamped result such as max(0.0, 1 - dist) before comparing to min_score) or update the docstring for the function/method that declares the min_score parameter to state the true possible range ([-1.0, 1.0]) and note that negative scores indicate nearly opposite vectors; ensure you modify the same block that contains the variable score and the comparison against min_score.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dev-suite/src/memory/chroma_store.py`:
- Around line 294-296: The docstring for the min_score parameter is incorrect
because the current computation score = 1 - dist (with Chroma cosine distance in
[0,2]) can yield scores in [-1,1]; either clamp the computed score to the
[0.0,1.0] range (e.g., replace the calculation with a clamped result such as
max(0.0, 1 - dist) before comparing to min_score) or update the docstring for
the function/method that declares the min_score parameter to state the true
possible range ([-1.0, 1.0]) and note that negative scores indicate nearly
opposite vectors; ensure you modify the same block that contains the variable
score and the comparison against min_score.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 819b57d8-d140-4fa4-8415-54c03b98cc75
📒 Files selected for processing (5)
dev-suite/src/memory/chroma_store.pydev-suite/src/memory/protocol.pydev-suite/src/orchestrator.pydev-suite/tests/test_memory.pydev-suite/tests/test_orchestrator.py
When developer uses filesystem_write tools, apply_code_node now reads the clean files from disk instead of re-parsing the LLM's generated_code text which contains markdown fences and summary prose. Root cause: trace review showed the developer wrote 567 chars of clean Python via filesystem_write, but apply_code re-parsed the 1533-char generated_code (which included ```python fences and ## Summary prose), producing corrupted 677-char output that failed sandbox validation. This caused 2 wasted retries on a task that the developer got right on the first attempt. Flow with fix: 1. Developer writes clean code via filesystem_write -> disk 2. apply_code detects successful filesystem_write in tool_calls_log 3. Reads files from disk using blueprint.target_files 4. Passes clean content to sandbox (no re-parsing needed) 5. Falls back to parser if disk read fails or no tools were used Tests: 4 new tests in TestApplyCodeDiskRead, all passing.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
dev-suite/tests/test_apply_code.py (2)
187-213: Consider extracting a shared state builder fixture.
_make_statehere largely duplicates setup fromTestApplyCodeNode; a shared fixture/helper would reduce drift and maintenance overhead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dev-suite/tests/test_apply_code.py` around lines 187 - 213, Extract the duplicated state builder into a shared test fixture/helper so both this test module and TestApplyCodeNode use the same setup: move the logic inside _make_state (which builds the base dict with Blueprint, generated_code, WorkflowStatus, etc.) into a reusable function or pytest fixture (e.g., make_state_fixture or shared_make_state) and update tests to call that helper instead of duplicating the dict; ensure the helper returns the same structure and accepts **overrides so existing callers of _make_state (and TestApplyCodeNode) can pass custom fields unchanged.
236-237: Avoid brittle assertions on free-form trace wording.Checking substrings like
"tool-written"/"falling back"makes tests fragile to harmless trace-text edits. Prefer asserting structured behavior (source/content ofparsed_files) or stable trace event tokens.Also applies to: 273-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dev-suite/tests/test_apply_code.py` around lines 236 - 237, The assertion is brittle because it inspects free-form trace text (result["trace"]) for substrings like "tool-written"; instead, change the test to assert structured behavior — e.g., check the parsed_files collection (parsed_files or result["parsed_files"]) for the expected source/content or file count, or inspect trace entries for stable tokens/fields (such as an explicit event type or code like event["type"] == "tool_written") rather than matching free-form messages; update both the assertion at the current location (assert any("tool-written" in t for t in result["trace"])) and the similar one around lines 273-274 to use those structured checks (result, result["trace"], parsed_files, event["type"] are the symbols to modify).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dev-suite/tests/test_apply_code.py`:
- Around line 187-213: Extract the duplicated state builder into a shared test
fixture/helper so both this test module and TestApplyCodeNode use the same
setup: move the logic inside _make_state (which builds the base dict with
Blueprint, generated_code, WorkflowStatus, etc.) into a reusable function or
pytest fixture (e.g., make_state_fixture or shared_make_state) and update tests
to call that helper instead of duplicating the dict; ensure the helper returns
the same structure and accepts **overrides so existing callers of _make_state
(and TestApplyCodeNode) can pass custom fields unchanged.
- Around line 236-237: The assertion is brittle because it inspects free-form
trace text (result["trace"]) for substrings like "tool-written"; instead, change
the test to assert structured behavior — e.g., check the parsed_files collection
(parsed_files or result["parsed_files"]) for the expected source/content or file
count, or inspect trace entries for stable tokens/fields (such as an explicit
event type or code like event["type"] == "tool_written") rather than matching
free-form messages; update both the assertion at the current location (assert
any("tool-written" in t for t in result["trace"])) and the similar one around
lines 273-274 to use those structured checks (result, result["trace"],
parsed_files, event["type"] are the symbols to modify).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 095c5cec-7c26-4104-a9f6-6a5226b0ecf7
📒 Files selected for processing (1)
dev-suite/tests/test_apply_code.py
| def test_prefers_disk_over_parser_when_tools_used(self, tmp_path): | ||
| """When filesystem_write was used, read files from disk instead of re-parsing.""" | ||
| from src.orchestrator import apply_code_node | ||
|
|
||
| # Write clean code to disk (simulating what filesystem_write did) | ||
| (tmp_path / "app.py").write_text("print('hello from disk')\n") | ||
|
|
||
| # generated_code has fences that would corrupt parsing | ||
| state = self._make_state( | ||
| workspace_root=str(tmp_path), | ||
| tool_calls_log=[ | ||
| {"agent": "developer", "turn": 1, "tool": "filesystem_write", | ||
| "args_preview": "{}", "result_preview": "ok", "success": True}, | ||
| ], | ||
| ) | ||
|
|
||
| result = apply_code_node(state) | ||
|
|
||
| assert len(result["parsed_files"]) == 1 | ||
| assert result["parsed_files"][0]["path"] == "app.py" | ||
| assert result["parsed_files"][0]["content"] == "print('hello from disk')\n" | ||
| assert any("tool-written" in t for t in result["trace"]) | ||
|
|
There was a problem hiding this comment.
Include the apply_code_node disk-read branch in this PR (or gate these tests until it lands).
These tests currently require behavior that is not present in src/orchestrator.py (see parser-only flow around Line 461-Line 505), which is why CI is failing here. Please ship the production disk-read path (successful developer filesystem_write → read from workspace → fallback to parser) in the same PR so this suite can pass.
Also applies to: 255-274
🧰 Tools
🪛 GitHub Actions: CI
[error] 235-235: Test failed: TestApplyCodeDiskRead.test_prefers_disk_over_parser_when_tools_used. Expected parsed_files[0].content to be "print('hello from disk')\n" but got "print('hello')".
Summary
Fixes two issues discovered via Langfuse trace review of
task-ba22eafa(triforce script, PR #122).Fix 1: Developer memory_writes dedup on retry (High)
Problem:
developer_nodeblindly appended a new "Implemented blueprint" entry tomemory_writeson every retry pass. With 2 retries, this produced 3x identical L1 entries in the state. The summarizer caught them at flush time, but the in-flight state was wasteful and would inflate context on longer tasks.Fix: Before appending, check if an entry with the same
(task_id, source_agent="developer")already exists. If so, replace in-place (last-wins). Other agents' entries (QA, etc.) are untouched.Fix 2: Memory context relevance filtering (Medium)
Problem:
_fetch_memory_context()queried Chroma withn_results=10and no similarity threshold. In the triforce trace, 5 of 7 memory entries fed to the architect were about a completely unrelated "greet function" from prior tasks — pure noise wasting context window tokens.Fix: Added
min_scoreparameter to:MemoryStoreprotocol (protocol.py)ChromaMemoryStore.query()implementation (chroma_store.py)_fetch_memory_context()caller inorchestrator.py— usesmin_score=0.3Default
min_score=Nonepreserves backward compatibility for all existing callers.Tests
TestDeveloperMemoryDedup— verifies no duplicates on retry, preserves other agents' entriesTestEnrichedMetadata.test_query_min_score_*— verifies filtering, backward compat, high thresholdFiles Changed
dev-suite/src/orchestrator.pydeveloper_node,min_score=0.3in_fetch_memory_contextdev-suite/src/memory/chroma_store.pymin_scoreparam inquery()dev-suite/src/memory/protocol.pymin_scoreparam in protocol signaturedev-suite/tests/test_orchestrator.pydev-suite/tests/test_memory.pySummary by CodeRabbit
New Features
Bug Fixes
Tests