Skip to content

fix: deduplicate memory_writes on retry + add min_score memory filtering#123

Merged
Abernaughty merged 4 commits intomainfrom
fix/trace-dedup-memory-relevance
Apr 5, 2026
Merged

fix: deduplicate memory_writes on retry + add min_score memory filtering#123
Abernaughty merged 4 commits intomainfrom
fix/trace-dedup-memory-relevance

Conversation

@Abernaughty
Copy link
Copy Markdown
Owner

@Abernaughty Abernaughty commented Apr 5, 2026

Summary

Fixes two issues discovered via Langfuse trace review of task-ba22eafa (triforce script, PR #122).

Fix 1: Developer memory_writes dedup on retry (High)

Problem: developer_node blindly appended a new "Implemented blueprint" entry to memory_writes on every retry pass. With 2 retries, this produced 3x identical L1 entries in the state. The summarizer caught them at flush time, but the in-flight state was wasteful and would inflate context on longer tasks.

Fix: Before appending, check if an entry with the same (task_id, source_agent="developer") already exists. If so, replace in-place (last-wins). Other agents' entries (QA, etc.) are untouched.

Fix 2: Memory context relevance filtering (Medium)

Problem: _fetch_memory_context() queried Chroma with n_results=10 and no similarity threshold. In the triforce trace, 5 of 7 memory entries fed to the architect were about a completely unrelated "greet function" from prior tasks — pure noise wasting context window tokens.

Fix: Added min_score parameter to:

  • MemoryStore protocol (protocol.py)
  • ChromaMemoryStore.query() implementation (chroma_store.py)
  • _fetch_memory_context() caller in orchestrator.py — uses min_score=0.3

Default min_score=None preserves backward compatibility for all existing callers.

Tests

  • 65 passed (60 existing + 5 new)
  • TestDeveloperMemoryDedup — verifies no duplicates on retry, preserves other agents' entries
  • TestEnrichedMetadata.test_query_min_score_* — verifies filtering, backward compat, high threshold
  • Ruff clean

Files Changed

File Change
dev-suite/src/orchestrator.py Dedup logic in developer_node, min_score=0.3 in _fetch_memory_context
dev-suite/src/memory/chroma_store.py min_score param in query()
dev-suite/src/memory/protocol.py min_score param in protocol signature
dev-suite/tests/test_orchestrator.py 2 new dedup tests
dev-suite/tests/test_memory.py 3 new min_score tests

Summary by CodeRabbit

  • New Features

    • Optional relevance-score filtering for memory queries with a moderate default threshold to tighten retrieved context.
    • When developer tools wrote files, the system now prefers reading actual disk files as the source of code content.
  • Bug Fixes

    • Avoids duplicate developer memory entries on task retry by replacing prior developer writes.
  • Tests

    • Added tests for score filtering, memory deduplication, and disk-read behavior after tool writes.

Fixes two issues found via Langfuse trace review (task-ba22eafa):

1. Developer memory_writes dedup (High):
   - developer_node was appending a new "Implemented blueprint" entry
     on every retry pass, producing 3x identical entries.
   - Now checks for existing entry with same (task_id, source_agent)
     and replaces in-place instead of appending.

2. Memory context relevance filtering (Medium):
   - _fetch_memory_context had no similarity threshold, so unrelated
     entries from prior tasks leaked into context (5/7 entries were
     noise about a "greet" function in a "triforce" task).
   - Added min_score parameter to MemoryStore.query() protocol,
     ChromaMemoryStore implementation, and _fetch_memory_context.
   - Default min_score=0.3 (cosine) filters obvious noise while
     keeping anything remotely relevant.

Both changes are backward-compatible (min_score=None preserves
existing behavior).

Tests: 65 passed (60 existing + 5 new), ruff clean.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dce965e4-f5f5-47e9-86bd-91b2cf407e29

📥 Commits

Reviewing files that changed from the base of the PR and between bb1a08b and de87d0f.

📒 Files selected for processing (1)
  • dev-suite/src/orchestrator.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dev-suite/src/orchestrator.py

📝 Walkthrough

Walkthrough

Added an optional min_score filter to memory queries (protocol and Chroma implementation), applied min_score=0.3 in orchestrator memory fetches, changed developer memory write handling to replace on retry (dedupe), and updated apply_code to prefer reading tool-written files from disk when available. Tests added for these behaviors.

Changes

Cohort / File(s) Summary
Protocol & Contract
dev-suite/src/memory/protocol.py
Extended MemoryStore.query() signature with optional min_score: float | None = None and updated docstring to document the 0.0–1.0 similarity-score filter.
Core Implementation (Chroma)
dev-suite/src/memory/chroma_store.py
Added min_score parameter handling in ChromaMemoryStore.query(); computes score = 1 - dist, filters out entries below min_score, and assigns MemoryQueryResult.score using the computed score.
Orchestration & Usage
dev-suite/src/orchestrator.py
_fetch_memory_context() now calls memory query with min_score=0.3; developer_node constructs a new_entry and deduplicates developer memory_writes by task_id on retry (replace existing developer entry if present). apply_code_node resolves workspace_root (fallback) and, when filesystem_write tools succeeded, prefers reading blueprint.target_files from disk into parsed_files, falling back to parsing generated_code if reads fail. Minor docstring/comment removals.
Tests — Memory
dev-suite/tests/test_memory.py
Added three tests covering store.query() with min_score=None, a moderate threshold filter, and a high threshold (min_score=0.99) to verify filtering behavior.
Tests — Orchestrator
dev-suite/tests/test_orchestrator.py
Added TestDeveloperMemoryDedup suite with tests ensuring no duplicate developer memory writes on retry and that dedup preserves other agents' entries.
Tests — Apply Code
dev-suite/tests/test_apply_code.py
Added TestApplyCodeDiskRead tests verifying apply_code_node prefers reading tool-written files from disk when filesystem_write succeeded and falls back correctly when missing or failed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop through memories, sorting by their score,
Replacing repeats and peeking at the floor.
If tools wrote files, I'll read them from the root,
Else parse what was made and keep the logs astute. 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the two main changes: deduplicating memory_writes on retry and adding min_score memory filtering, matching the core fixes in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/trace-dedup-memory-relevance

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dev-suite/src/memory/chroma_store.py (1)

294-296: Docstring claims incorrect score range for edge cases.

The min_score parameter's docstring states "0.0-1.0", but with Chroma's cosine distance space (range [0, 2]), the formula score = 1 - dist can produce scores in the range [-1, 1]. While negative scores are unlikely in practice (requiring distances > 1, meaning nearly opposite vectors), the documentation is technically inaccurate.

Consider either:

  1. Clamping: score = max(0.0, 1 - dist)
  2. Updating the docstring to clarify the actual score range
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/src/memory/chroma_store.py` around lines 294 - 296, The docstring
for the min_score parameter is incorrect because the current computation score =
1 - dist (with Chroma cosine distance in [0,2]) can yield scores in [-1,1];
either clamp the computed score to the [0.0,1.0] range (e.g., replace the
calculation with a clamped result such as max(0.0, 1 - dist) before comparing to
min_score) or update the docstring for the function/method that declares the
min_score parameter to state the true possible range ([-1.0, 1.0]) and note that
negative scores indicate nearly opposite vectors; ensure you modify the same
block that contains the variable score and the comparison against min_score.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dev-suite/src/memory/chroma_store.py`:
- Around line 294-296: The docstring for the min_score parameter is incorrect
because the current computation score = 1 - dist (with Chroma cosine distance in
[0,2]) can yield scores in [-1,1]; either clamp the computed score to the
[0.0,1.0] range (e.g., replace the calculation with a clamped result such as
max(0.0, 1 - dist) before comparing to min_score) or update the docstring for
the function/method that declares the min_score parameter to state the true
possible range ([-1.0, 1.0]) and note that negative scores indicate nearly
opposite vectors; ensure you modify the same block that contains the variable
score and the comparison against min_score.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 819b57d8-d140-4fa4-8415-54c03b98cc75

📥 Commits

Reviewing files that changed from the base of the PR and between c92efbc and a701a6e.

📒 Files selected for processing (5)
  • dev-suite/src/memory/chroma_store.py
  • dev-suite/src/memory/protocol.py
  • dev-suite/src/orchestrator.py
  • dev-suite/tests/test_memory.py
  • dev-suite/tests/test_orchestrator.py

When developer uses filesystem_write tools, apply_code_node now reads
the clean files from disk instead of re-parsing the LLM's generated_code
text which contains markdown fences and summary prose.

Root cause: trace review showed the developer wrote 567 chars of clean
Python via filesystem_write, but apply_code re-parsed the 1533-char
generated_code (which included ```python fences and ## Summary prose),
producing corrupted 677-char output that failed sandbox validation.
This caused 2 wasted retries on a task that the developer got right
on the first attempt.

Flow with fix:
1. Developer writes clean code via filesystem_write -> disk
2. apply_code detects successful filesystem_write in tool_calls_log
3. Reads files from disk using blueprint.target_files
4. Passes clean content to sandbox (no re-parsing needed)
5. Falls back to parser if disk read fails or no tools were used

Tests: 4 new tests in TestApplyCodeDiskRead, all passing.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
dev-suite/tests/test_apply_code.py (2)

187-213: Consider extracting a shared state builder fixture.

_make_state here largely duplicates setup from TestApplyCodeNode; a shared fixture/helper would reduce drift and maintenance overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_apply_code.py` around lines 187 - 213, Extract the
duplicated state builder into a shared test fixture/helper so both this test
module and TestApplyCodeNode use the same setup: move the logic inside
_make_state (which builds the base dict with Blueprint, generated_code,
WorkflowStatus, etc.) into a reusable function or pytest fixture (e.g.,
make_state_fixture or shared_make_state) and update tests to call that helper
instead of duplicating the dict; ensure the helper returns the same structure
and accepts **overrides so existing callers of _make_state (and
TestApplyCodeNode) can pass custom fields unchanged.

236-237: Avoid brittle assertions on free-form trace wording.

Checking substrings like "tool-written" / "falling back" makes tests fragile to harmless trace-text edits. Prefer asserting structured behavior (source/content of parsed_files) or stable trace event tokens.

Also applies to: 273-274

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-suite/tests/test_apply_code.py` around lines 236 - 237, The assertion is
brittle because it inspects free-form trace text (result["trace"]) for
substrings like "tool-written"; instead, change the test to assert structured
behavior — e.g., check the parsed_files collection (parsed_files or
result["parsed_files"]) for the expected source/content or file count, or
inspect trace entries for stable tokens/fields (such as an explicit event type
or code like event["type"] == "tool_written") rather than matching free-form
messages; update both the assertion at the current location (assert
any("tool-written" in t for t in result["trace"])) and the similar one around
lines 273-274 to use those structured checks (result, result["trace"],
parsed_files, event["type"] are the symbols to modify).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dev-suite/tests/test_apply_code.py`:
- Around line 187-213: Extract the duplicated state builder into a shared test
fixture/helper so both this test module and TestApplyCodeNode use the same
setup: move the logic inside _make_state (which builds the base dict with
Blueprint, generated_code, WorkflowStatus, etc.) into a reusable function or
pytest fixture (e.g., make_state_fixture or shared_make_state) and update tests
to call that helper instead of duplicating the dict; ensure the helper returns
the same structure and accepts **overrides so existing callers of _make_state
(and TestApplyCodeNode) can pass custom fields unchanged.
- Around line 236-237: The assertion is brittle because it inspects free-form
trace text (result["trace"]) for substrings like "tool-written"; instead, change
the test to assert structured behavior — e.g., check the parsed_files collection
(parsed_files or result["parsed_files"]) for the expected source/content or file
count, or inspect trace entries for stable tokens/fields (such as an explicit
event type or code like event["type"] == "tool_written") rather than matching
free-form messages; update both the assertion at the current location (assert
any("tool-written" in t for t in result["trace"])) and the similar one around
lines 273-274 to use those structured checks (result, result["trace"],
parsed_files, event["type"] are the symbols to modify).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 095c5cec-7c26-4104-a9f6-6a5226b0ecf7

📥 Commits

Reviewing files that changed from the base of the PR and between a701a6e and bb1a08b.

📒 Files selected for processing (1)
  • dev-suite/tests/test_apply_code.py

Comment on lines +215 to +237
def test_prefers_disk_over_parser_when_tools_used(self, tmp_path):
"""When filesystem_write was used, read files from disk instead of re-parsing."""
from src.orchestrator import apply_code_node

# Write clean code to disk (simulating what filesystem_write did)
(tmp_path / "app.py").write_text("print('hello from disk')\n")

# generated_code has fences that would corrupt parsing
state = self._make_state(
workspace_root=str(tmp_path),
tool_calls_log=[
{"agent": "developer", "turn": 1, "tool": "filesystem_write",
"args_preview": "{}", "result_preview": "ok", "success": True},
],
)

result = apply_code_node(state)

assert len(result["parsed_files"]) == 1
assert result["parsed_files"][0]["path"] == "app.py"
assert result["parsed_files"][0]["content"] == "print('hello from disk')\n"
assert any("tool-written" in t for t in result["trace"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Include the apply_code_node disk-read branch in this PR (or gate these tests until it lands).

These tests currently require behavior that is not present in src/orchestrator.py (see parser-only flow around Line 461-Line 505), which is why CI is failing here. Please ship the production disk-read path (successful developer filesystem_write → read from workspace → fallback to parser) in the same PR so this suite can pass.

Also applies to: 255-274

🧰 Tools
🪛 GitHub Actions: CI

[error] 235-235: Test failed: TestApplyCodeDiskRead.test_prefers_disk_over_parser_when_tools_used. Expected parsed_files[0].content to be "print('hello from disk')\n" but got "print('hello')".

@Abernaughty Abernaughty merged commit 9e3f90e into main Apr 5, 2026
3 checks passed
@Abernaughty Abernaughty deleted the fix/trace-dedup-memory-relevance branch April 5, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant