feat: sanitize transport metadata before embedding/display#103
feat: sanitize transport metadata before embedding/display#10376159482 wants to merge 1 commit intoCortexReach:mainfrom
Conversation
## Problem OpenClaw's chat transport layers (Telegram, Discord, etc.) inject metadata blocks into message content: - 'Conversation info (untrusted metadata): ' - 'Sender (untrusted metadata): ' - '[Queued messages while agent was busy]' These wrappers pollute memory embeddings and retrieval results, causing: 1. Irrelevant memories to match on metadata keywords 2. Noisy display in memory_recall output 3. Wasted embedding API calls on transport boilerplate ## Solution Add `sanitizeMemoryText()` function to strip known metadata patterns: - Removes metadata blocks via regex patterns - Compresses excessive whitespace - Preserves human-readable content Integrated at 3 key points: 1. `memory_recall` - Clean text before display 2. `memory_store` - Clean text before embedding 3. `memory_update` - Clean text before re-embedding ## Implementation - `noise-filter.ts`: Add METADATA_BLOCK_PATTERNS + sanitizeMemoryText() - `tools.ts`: Call sanitizeMemoryText() at embedding/display points - Backward compatible: No config changes required - Fail-safe: Returns original text if sanitization fails ## Testing Tested in production for 2+ weeks with OpenClaw Telegram transport. Observed improvements: - Reduced false-positive matches on metadata keywords - Cleaner memory_recall output - No regressions in existing functionality
|
Thanks for tackling this. I do see 3 issues with the current implementation:
So my assessment is: the problem is real and the direction is correct, but the current implementation is still partial and does not fully fix it. |
Problem
OpenClaw's chat transport layers (Telegram, Discord, etc.) inject metadata blocks into message content:
Conversation info (untrusted metadata): \``json...````Sender (untrusted metadata): \``json...````[Queued messages while agent was busy]These wrappers pollute memory embeddings and retrieval results, causing:
Solution
Add
sanitizeMemoryText()function to strip known metadata patterns:Integrated at 3 key points:
memory_recall- Clean text before displaymemory_store- Clean text before embeddingmemory_update- Clean text before re-embeddingImplementation
noise-filter.ts: Add METADATA_BLOCK_PATTERNS + sanitizeMemoryText()tools.ts: Call sanitizeMemoryText() at embedding/display pointsTesting
Tested in production for 2+ weeks with OpenClaw Telegram transport.
Observed improvements: