-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Background
StringyMcStringFace's semantic classifier needs to identify and tag various string patterns extracted from binaries, including:
- URLs and domains
- IP addresses (IPv4/IPv6)
- File paths (POSIX/Windows)
- Registry keys
- GUIDs
- Email addresses
- JWT tokens
- Base64 sequences
- Format strings (printf-style)
- User agent strings
Each classification requires regex pattern matching, and the classifier will process thousands of strings per binary. Without caching, regex compilation overhead becomes a significant performance bottleneck.
Problem Statement
Currently, the classification module (src/classification/mod.rs) is minimal. When implemented, it will need to compile multiple complex regex patterns. Compiling regex patterns on-demand for each string classification would:
- Add significant CPU overhead (regex compilation is expensive)
- Create unnecessary memory allocations
- Slow down the overall analysis pipeline
- Make the tool less suitable for batch processing
Proposed Solution
Implement a regex caching strategy using lazy initialization:
1. Add Dependencies
Add regex and once_cell (or use std::sync::LazyLock in Rust 2024) to Cargo.toml:
[dependencies]
regex = "1.11"
once_cell = "1.20" # or use std::sync::LazyLock2. Implement Lazy-Initialized Regex Cache
Create a module with static, lazily-initialized regex patterns:
use once_cell::sync::Lazy;
use regex::Regex;
pub struct PatternCache {
pub url: &'static Lazy<Regex>,
pub domain: &'static Lazy<Regex>,
pub ipv4: &'static Lazy<Regex>,
pub ipv6: &'static Lazy<Regex>,
pub filepath_posix: &'static Lazy<Regex>,
pub filepath_windows: &'static Lazy<Regex>,
pub registry_key: &'static Lazy<Regex>,
pub guid: &'static Lazy<Regex>,
pub email: &'static Lazy<Regex>,
pub base64: &'static Lazy<Regex>,
pub format_string: &'static Lazy<Regex>,
pub user_agent: &'static Lazy<Regex>,
}
static URL_PATTERN: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"https?://[\w.-]+(?:/[\w./?%&=-]*)?").unwrap()
});
static GUID_PATTERN: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"(?i)[{]?[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}[}]?").unwrap()
});
// ... additional patterns3. Use Cached Patterns in Classifier
The semantic classifier should reference these static patterns:
pub fn classify_string(text: &str) -> Vec<Tag> {
let mut tags = Vec::new();
if URL_PATTERN.is_match(text) {
tags.push(Tag::Url);
}
if GUID_PATTERN.is_match(text) {
tags.push(Tag::Guid);
}
// ... additional classifications
tags
}4. Consider RegexSet for Multiple Patterns
For optimal performance when checking multiple patterns against the same text, consider using RegexSet:
use regex::RegexSet;
static PATTERN_SET: Lazy<RegexSet> = Lazy::new(|| {
RegexSet::new(&[
r"https?://.*", // URL
r"[{]?[0-9a-f]{8}-.*", // GUID
// ... more patterns
]).unwrap()
});Implementation Tasks
- Add
regexdependency toCargo.toml - Add
once_cellor usestd::sync::LazyLockfor lazy initialization - Create
src/classification/patterns.rswith cached regex patterns - Implement pattern definitions for all semantic tags
- Create
classify_string()function using cached patterns - Add unit tests for each pattern
- Create benchmark suite to measure performance improvement
- Document pattern syntax and matching behavior
Success Criteria
- All regex patterns compiled once on first use
- Zero regex compilation overhead during string classification
- Measurable performance improvement (target: >90% reduction in classification time)
- Comprehensive test coverage for all patterns
- Benchmarks demonstrating caching effectiveness
Performance Benchmarks
Create benchmarks in benches/regex_caching.rs to measure:
- Compilation overhead: Time to compile patterns with/without caching
- Classification throughput: Strings classified per second
- Memory usage: Compare cached vs. on-demand compilation
- Batch processing: Time to classify 10k, 100k, 1M strings
Expected improvement: 10-100x faster classification depending on pattern complexity and string volume.
References
- Rust regex crate documentation: https://docs.rs/regex/latest/regex/
- Performance tips: https://docs.rs/regex/latest/regex/index.html#performance
- once_cell for lazy statics: https://docs.rs/once_cell/latest/once_cell/
Related
- Part of v0.1 MVP milestone
- Blocks efficient implementation of semantic classification (requirement 8.3)
- Required for achieving acceptable performance on large binaries