-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
Implement a noise penalty detection system that identifies and penalizes low-quality strings during the ranking phase. This feature is critical for ensuring that StringyMcStringFace surfaces meaningful strings while filtering out garbage data, padding bytes, encrypted content, and table artifacts that plague traditional strings output.
Problem Statement
Binary files contain various types of noise that should be deprioritized or filtered:
- High Entropy Strings: Random-looking data from encryption, compression, or binary blobs (e.g.,
xK9#mP@vQ2%nR) - Excessive Length: Extremely long strings are often padding, base64 blobs, or table data rather than meaningful text
- Repeated Patterns: Padding sequences (
AAAA...,\x00\x00...) or table delimiters that add no semantic value - Table Data: Structured binary data that appears as strings but contains mostly non-printable or low-information content
Without noise detection, the ranking system will surface these low-value strings alongside genuinely useful data like URLs, file paths, and error messages.
Proposed Solution
High Entropy Detection
Calculate Shannon entropy for each extracted string and apply penalties for entropy above threshold:
fn calculate_entropy(s: &str) -> f64 {
let mut frequencies = HashMap::new();
for c in s.chars() {
*frequencies.entry(c).or_insert(0) += 1;
}
let len = s.len() as f64;
frequencies.values()
.map(|&count| {
let p = count as f64 / len;
-p * p.log2()
})
.sum()
}
// Apply penalty if entropy > 4.5 (tunable threshold)
// Natural language typically has entropy 3.5-4.5
// Random data approaches 8.0 for printable ASCIIPenalty Approach:
- Entropy < 4.0: No penalty (natural language)
- Entropy 4.0-5.0: Linear penalty 0-30 points
- Entropy > 5.0: Fixed penalty -30 points
Excessive Length Penalty
Penalize strings beyond reasonable lengths using a tiered approach:
fn length_penalty(len: usize) -> i32 {
match len {
0..=256 => 0, // Normal strings
257..=512 => -10, // Slightly long
513..=1024 => -20, // Very long (likely base64/table)
_ => -30 // Extremely long (definitely noise)
}
}Repeated Pattern Detection
Detect strings dominated by repeated characters or short sequences:
fn detect_repeated_patterns(s: &str) -> Option<i32> {
// Check for single character repetition
let unique_chars = s.chars().collect::<HashSet<_>>().len();
let repetition_ratio = unique_chars as f64 / s.len() as f64;
if repetition_ratio < 0.1 {
return Some(-25); // >90% same character
}
// Check for short repeating sequences (2-4 chars)
// e.g., "ABABAB", "123123123"
for pattern_len in 2..=4 {
if is_repeating_pattern(s, pattern_len) {
return Some(-20);
}
}
None
}Table Data Heuristics
Detect table-like data using multiple indicators:
- Delimiter Density: High frequency of delimiters (commas, pipes, tabs, nulls)
- Binary Interleaving: Printable characters interspersed with non-printable
- Low Alphanumeric Ratio: Less than 60% alphanumeric content
fn is_table_data(s: &str) -> bool {
let delimiter_count = s.chars().filter(|&c| matches!(c, ',' | '|' | '\t' | '\0')).count();
let delimiter_ratio = delimiter_count as f64 / s.len() as f64;
let alphanum_count = s.chars().filter(|c| c.is_alphanumeric()).count();
let alphanum_ratio = alphanum_count as f64 / s.len() as f64;
delimiter_ratio > 0.3 || alphanum_ratio < 0.6
}Implementation Plan
-
Add noise detection module (
src/ranker/noise.rs)- Entropy calculation function
- Length penalty function
- Pattern detection functions
- Table data heuristics
-
Integrate with ranking system
- Add noise penalty field to
ExtractedStringscore calculation - Apply penalties during ranking phase
- Ensure penalties are composable (multiple penalties can apply)
- Add noise penalty field to
-
Configuration
- Make thresholds tunable via CLI flags or config file
- Default conservative values that work for most binaries
-
Testing
- Unit tests for each detection function with known inputs
- Integration tests with real binary samples
- Snapshot tests to verify penalty application
Acceptance Criteria
- Entropy calculation implemented with configurable threshold
- Length penalty with tiered approach (256/512/1024 byte thresholds)
- Repeated pattern detection for single chars and short sequences
- Table data heuristics with delimiter and alphanumeric ratio checks
- Integration with ranking system (penalties affect final scores)
- Unit tests achieving >90% coverage for noise detection module
- Documentation with examples of penalized vs. non-penalized strings
- Performance: Noise detection adds <10% overhead to overall extraction time
Technical Considerations
- Performance: Entropy calculation is O(n) per string; consider caching for reused strings
- Encoding-Aware: UTF-16 strings need byte-pair entropy, not character entropy
- Threshold Tuning: Default values should be validated against diverse binary corpus (ELF, PE, Mach-O)
- False Positives: Some legitimate strings may have high entropy (e.g., crypto keys, hashes); use context tags to avoid over-penalizing
Dependencies
- Blocked by: Ranking System Foundation (task: stringy-analyzer/ranking-system)
- Requires:
ExtractedStringtype with score field - Enables: High-quality output that rivals manual analysis
Requirements
Requirement: 5.3
Task ID
stringy-analyzer/noise-penalty-detection
Related
This feature directly addresses the "Eliminates noise" advantage mentioned in the README, helping StringyMcStringFace stop "dumping padding, tables, and interleaved garbage" that plagues standard strings output.