fix(security): patch command injection and SONA bugs, publish mincut-wasm#266
Merged
fix(security): patch command injection and SONA bugs, publish mincut-wasm#266
Conversation
…DR-115) Implements tier-aware product quantization for embedding compression: - 3-bit (CentroidMerged): 8.68x compression, 99.05% recall - 4-bit (DeltaCompressed): 6.83x compression, 99.78% recall - 2-bit (Archived): 11.91x compression, 95.43% recall Key changes: - Add quantization.rs with PiQQuantizer and QuantizedEmbedding types - Integrate quantization into web_ingest.rs Phase 5 - Add quantized_embedding field to WebMemory struct - Update ADR-115 with POC validation results Throughput: 97K-134K embeddings/sec on Apple Silicon Co-Authored-By: claude-flow <ruv@ruv.net>
- Add web_store and crawl_adapter fields to AppState (types.rs) - Initialize persistent adapter and web store in create_router (routes.rs) - Update crawl/discover endpoint to use persistent adapter - Update crawl/stats endpoint to include WebMemoryStore metrics - Stats now show tier distribution (full/delta/centroid/archived) This enables persistent stats accumulation across requests and prepares for production Common Crawl ingestion per ADR-115. Co-Authored-By: claude-flow <ruv@ruv.net>
- Add CdxCacheEntry struct with TTL (24h expiration) - Add cdx_cache DashMap to CommonCrawlAdapter - Cache CDX query results before URL filtering - Track cache hits/misses in CommonCrawlStats - Expose cache stats in /v1/pipeline/crawl/stats endpoint - Calculate and display cache hit rate percentage This eliminates redundant CDX API calls when querying the same domain pattern multiple times, reducing latency and API load. Co-Authored-By: claude-flow <ruv@ruv.net>
Common Crawl CDX API returns length and offset as strings, not integers. Add custom deserialize_string_to_u64 function to handle the type conversion. Co-Authored-By: claude-flow <ruv@ruv.net>
- Increase request timeout to 120s for slow CDX responses - Add connect_timeout (30s) and pool_idle_timeout (90s) - Disable default MIME/status filters for simpler queries - Update default crawl index to CC-MAIN-2026-08 - Use expect() instead of unwrap_or_default() for clearer errors Co-Authored-By: claude-flow <ruv@ruv.net>
- Add /v1/pipeline/crawl/test endpoint for diagnosing CDX issues - Add tracing for CDX query URLs and errors - Tests connectivity to Common Crawl index API Co-Authored-By: claude-flow <ruv@ruv.net>
Common Crawl servers don't send proper TLS close_notify, causing rustls to error. Switch to native-tls which is more lenient. - Change reqwest feature from rustls-tls to native-tls - Add openssl to build dependencies - Add libssl3 to runtime image Co-Authored-By: claude-flow <ruv@ruv.net>
…n Crawl Common Crawl CDX servers have issues with HTTP/2 and connection reuse: - Force HTTP/1.1 with http1_only() to avoid protocol issues - Disable connection pooling (pool_max_idle_per_host=0) since CC closes connections - Add tcp_nodelay for lower latency
The diagnostic endpoint was using reqwest::get() which creates a new client with default settings, potentially using rustls instead of our configured native-tls client. Now uses adapter.test_connectivity() which uses the properly configured HTTP client.
Compare Common Crawl connectivity against httpbin.org to determine if the issue is Cloud Run networking or specifically Common Crawl.
The discover endpoint was calling query_cdx twice: 1. Once explicitly to get cdx_records_found 2. Again inside discover_domain Due to URL deduplication in query_cdx, the second call returned 0 records. Fixed by adding discover_from_records() which accepts pre-fetched CDX records.
Common Crawl CDX servers are flaky and sometimes return incomplete responses. Added 3-attempt retry with exponential backoff (1s, 2s) for both CDX queries and connectivity tests.
Test Internet Archive CDX, data.commoncrawl.org, and httpbin.org to diagnose if the issue is specific to index.commoncrawl.org.
Try adding HTTP headers that might help with server compatibility: - Accept: application/json - Connection: close (avoid keep-alive issues)
When the CDX API at index.commoncrawl.org is unreachable from Cloud Run, fall back to pre-computed sample CDX records for demonstration purposes. This allows testing the full pipeline (WARC fetch, extraction, injection) while the CDX connectivity issue is being investigated.
…wasm Security: - Fix #256: Add sanitizeShellArg() to MCP workers_create handler preventing shell command injection via name/preset/triggers params Bug fixes: - Fix #257: Add fallback parser in sona-wrapper.js for Rust debug format strings from SonaEngine.getStats() - Fix #258: Add force parameter to BackgroundLoop::run_cycle() so forceLearn() bypasses 100-trajectory minimum requirement Features: - Fix #254: Build and publish @ruvector/mincut-wasm@0.1.0 to npm - Add Wayback Machine fallback for Common Crawl CDX API Published: - @ruvector/mincut-wasm@0.1.0 - ruvector@0.2.13 Co-Authored-By: claude-flow <ruv@ruv.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Security (Critical)
sanitizeShellArg()function to strip shell metacharacters from user inputname,preset,triggersinworkers_createhandlerBug Fixes
getStats()forceparameter torun_cycle()to bypass 100-trajectory minimumforce=truefromforce_background()forforceLearn()APIFeatures
Published Packages
@ruvector/mincut-wasm@0.1.0- WASM bindings for dynamic minimum cutruvector@0.2.13- Security fix releaseTest Results
Issues Resolved
Test plan
cargo test -p ruvector-sona(85 pass)npm testin ruvector package (69 pass)🤖 Generated with claude-flow