Skip to content

feat: sanitize transport metadata before embedding/display#103

Open
76159482 wants to merge 1 commit intoCortexReach:mainfrom
76159482:feat/sanitize-transport-metadata
Open

feat: sanitize transport metadata before embedding/display#103
76159482 wants to merge 1 commit intoCortexReach:mainfrom
76159482:feat/sanitize-transport-metadata

Conversation

@76159482
Copy link

@76159482 76159482 commented Mar 8, 2026

Problem

OpenClaw's chat transport layers (Telegram, Discord, etc.) inject metadata blocks into message content:

  • Conversation info (untrusted metadata): \``json...````
  • Sender (untrusted metadata): \``json...````
  • [Queued messages while agent was busy]

These wrappers pollute memory embeddings and retrieval results, causing:

  1. Irrelevant memories to match on metadata keywords
  2. Noisy display in memory_recall output
  3. Wasted embedding API calls on transport boilerplate

Solution

Add sanitizeMemoryText() function to strip known metadata patterns:

  • Removes metadata blocks via regex patterns
  • Compresses excessive whitespace
  • Preserves human-readable content

Integrated at 3 key points:

  1. memory_recall - Clean text before display
  2. memory_store - Clean text before embedding
  3. memory_update - Clean text before re-embedding

Implementation

  • noise-filter.ts: Add METADATA_BLOCK_PATTERNS + sanitizeMemoryText()
  • tools.ts: Call sanitizeMemoryText() at embedding/display points
  • Backward compatible: No config changes required
  • Fail-safe: Returns original text if sanitization fails

Testing

Tested in production for 2+ weeks with OpenClaw Telegram transport.

Observed improvements:

  • Reduced false-positive matches on metadata keywords
  • Cleaner memory_recall output
  • No regressions in existing functionality

## Problem
OpenClaw's chat transport layers (Telegram, Discord, etc.) inject metadata
blocks into message content:
- 'Conversation info (untrusted metadata): '
- 'Sender (untrusted metadata): '
- '[Queued messages while agent was busy]'

These wrappers pollute memory embeddings and retrieval results, causing:
1. Irrelevant memories to match on metadata keywords
2. Noisy display in memory_recall output
3. Wasted embedding API calls on transport boilerplate

## Solution
Add `sanitizeMemoryText()` function to strip known metadata patterns:
- Removes metadata blocks via regex patterns
- Compresses excessive whitespace
- Preserves human-readable content

Integrated at 3 key points:
1. `memory_recall` - Clean text before display
2. `memory_store` - Clean text before embedding
3. `memory_update` - Clean text before re-embedding

## Implementation
- `noise-filter.ts`: Add METADATA_BLOCK_PATTERNS + sanitizeMemoryText()
- `tools.ts`: Call sanitizeMemoryText() at embedding/display points
- Backward compatible: No config changes required
- Fail-safe: Returns original text if sanitization fails

## Testing
Tested in production for 2+ weeks with OpenClaw Telegram transport.
Observed improvements:
- Reduced false-positive matches on metadata keywords
- Cleaner memory_recall output
- No regressions in existing functionality
@rwmjhb
Copy link
Collaborator

rwmjhb commented Mar 8, 2026

Thanks for tackling this.

I do see 3 issues with the current implementation:

  1. It only covers part of the ingestion path
    This PR updates memory_recall, memory_store, and memory_update, but the plugin's autoCapture path still reads raw event.messages content and directly embeds/stores it without applying this sanitization. So these wrappers can still enter the database through auto-capture.

  2. memory_store is internally inconsistent
    It uses cleanedText for embedding and duplicate detection, but still writes the original text to store.store() and mdMirror. That means the vector and persisted text can diverge, and the transport metadata is still being stored even though recall output looks cleaner.

  3. There is no automated test coverage for these wrapper patterns
    Since this is a behavior fix for a real production issue, the missing tests are a gap. Right now there is nothing locking in the expected behavior for these exact metadata envelopes.

So my assessment is: the problem is real and the direction is correct, but the current implementation is still partial and does not fully fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants