Skip to content

fix: add structural noise filtering for system traces and raw blobs#149

Open
eisen0419 wants to merge 1 commit intoCortexReach:masterfrom
eisen0419:fix/issue-127-noise-filter-structural
Open

fix: add structural noise filtering for system traces and raw blobs#149
eisen0419 wants to merge 1 commit intoCortexReach:masterfrom
eisen0419:fix/issue-127-noise-filter-structural

Conversation

@eisen0419
Copy link

Summary

noise-filter.ts previously only filtered surface-level noise (greetings, denials, meta-questions), but missed structural memory contamination: runtime traces, raw conversation blobs, and metadata wrappers that silently degrade memory corpus quality over time.

Changes

  • src/noise-filter.ts: Added STRUCTURAL_NOISE_PATTERNS array covering:
    • System: prefixed runtime messages
    • Compaction/model-switch/session-management traces
    • OpenClaw (untrusted metadata): wrapper remnants
    • Pure JSON objects ({...})
    • XML-wrapped content (<tag>...</tag>)
    • Multi-line blockquote blobs (>= 3 quoted lines)
  • New filterStructuralNoise option in NoiseFilterOptions (default: true)
  • test/noise-filter-structural.test.mjs: 10 test cases covering all new patterns + opt-out behavior

Test plan

  • All 10 new structural noise tests pass
  • Existing isNoise() behavior preserved (new patterns are additive)
  • Option filterStructuralNoise: false disables the new filters

Closes #127

Add STRUCTURAL_NOISE_PATTERNS to noise-filter.ts covering:
- System: prefixed runtime messages
- Compaction/model-switch/session-management traces
- OpenClaw (untrusted metadata) wrapper remnants
- Pure JSON objects and XML-wrapped content
- Multi-line blockquote blobs (>= 3 quoted lines)

New filterStructuralNoise option (default: true) in NoiseFilterOptions.

Closes CortexReach#127
@eisen0419
Copy link
Author

CI failure note

The CI failure is a pre-existing issue on master, unrelated to this PR's changes.

The plugin-manifest-regression.mjs test fails because openclaw.plugin.json has version 1.1.0-beta.4 while package.json has 1.1.0-beta.5:

AssertionError [ERR_ASSERTION]: openclaw.plugin.json version should stay aligned with package.json
+ '1.1.0-beta.4'
- '1.1.0-beta.5'

This version mismatch exists on the current master branch and affects all PRs based on it.

@AliceLJY
Copy link
Collaborator

Review: fix: add structural noise filtering for system traces and raw blobs

Verdict: Fix-then-merge — two pattern-level issues need addressing before this is safe to ship.


✅ What's working

  • filterStructuralNoise opt-out flag is correctly wired and tested
  • compacti(ng|on)\s*(context|safeguard) and (untrusted metadata)\s*: are highly specific — negligible false positive risk
  • /^\{[\s\S]*\}$/ anchored to full string correctly targets raw JSON blob memories
  • /(?:^>.*\n){3,}/m (3+ blockquote lines) is reasonable
  • All 10 tests pass

🔴 Blocking

1. npm test not updatednoise-filter-structural.test.mjs missing from package.json scripts.

2. XML/HTML pattern is not anchored and will false-positive on real user content

/<[a-z-]+>[\s\S]*<\/[a-z-]+>/i matches anywhere in a string and doesn't require matching tags. Any memory mentioning HTML/XML gets silently dropped:

"I learned that <div>content</div> structures HTML"   → filtered ❌
"response wrapped in <answer>...</answer> tags"        → filtered ❌
"The <tool_call> format used by Claude"               → filtered ❌ (if any closing tag exists later)

For the intended use case (filtering OpenClaw's <relevant-memories> injection), a targeted pattern is safer:

/^<(relevant-memories|memory-context|system-context)>[\s\S]*<\/\1>/i

Or anchor to the full string (^ + $) if broader XML filtering is intended.

🟡 Suggested before merge

3. model and session patterns lack word boundaries — risk of over-filtering

/model\s*(switch|switched|changed|swap)/i and /session\s*(reset|restart|start|end)/i have no anchors and no \b boundaries. Real user memories that would be silently dropped:

"Our business model changed after the pivot"  → matches /model\s*changed/  ❌
"The session started at 9am"                  → matches /session\s*start/  ❌
  ("started" begins with "start" — no word boundary)
"We need to model switching costs"            → matches /model\s*switch/   ❌

Suggested fix:

/\bmodel\s+(switched?|changed|swapped?)\b/i
/\bsession\s+(reset|restarted?|started?|ended?)\b/i

4. No negative test cases for the three risky patterns — especially important here given the false positive surface.


The compaction and untrusted metadata patterns are solid. The JSON blob pattern is clean. The XML and model/session patterns need scoping before this is ready to ship.

@AliceLJY
Copy link
Collaborator

Correction to my previous comment — the suggested regex fix was wrong on two counts:

  1. switched? does not match "model switch" (misses true positive)
  2. started? still matches "session started" (false positive remains)

The correct minimal fix is to add \b at the end of each pattern without changing the word list:

/\bmodel\s+(switch|switched|changed|swap)\b/i
/\bsession\s+(reset|restart|start|end)\b/i

With this:

  • "model switching costs" → no match ✅ (\b fails mid-word)
  • "session started at 9am" → no match ✅ (\b fails because ed follows)
  • "model switch detected" → match ✅
  • "session reset" → match ✅

One remaining limitation: "Our business model changed after the pivot" still matches /model\s+changed\b/ because model changed happens to appear as a valid word sequence. This is a semantic ambiguity that \b alone cannot resolve — worth noting as an accepted trade-off or handled with a deny-list of preceding words (e.g. business model, mental model).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Noise filter misses structural memory contamination at write time (System traces / raw blobs / fragments)

2 participants