Skip to content

Implement Noise Penalty Detection for String Quality Scoring #24

@unclesp1d3r

Description

@unclesp1d3r

Overview

Implement a noise penalty detection system that identifies and penalizes low-quality strings during the ranking phase. This feature is critical for ensuring that StringyMcStringFace surfaces meaningful strings while filtering out garbage data, padding bytes, encrypted content, and table artifacts that plague traditional strings output.

Problem Statement

Binary files contain various types of noise that should be deprioritized or filtered:

  1. High Entropy Strings: Random-looking data from encryption, compression, or binary blobs (e.g., xK9#mP@vQ2%nR)
  2. Excessive Length: Extremely long strings are often padding, base64 blobs, or table data rather than meaningful text
  3. Repeated Patterns: Padding sequences (AAAA..., \x00\x00...) or table delimiters that add no semantic value
  4. Table Data: Structured binary data that appears as strings but contains mostly non-printable or low-information content

Without noise detection, the ranking system will surface these low-value strings alongside genuinely useful data like URLs, file paths, and error messages.

Proposed Solution

High Entropy Detection

Calculate Shannon entropy for each extracted string and apply penalties for entropy above threshold:

fn calculate_entropy(s: &str) -> f64 {
    let mut frequencies = HashMap::new();
    for c in s.chars() {
        *frequencies.entry(c).or_insert(0) += 1;
    }
    
    let len = s.len() as f64;
    frequencies.values()
        .map(|&count| {
            let p = count as f64 / len;
            -p * p.log2()
        })
        .sum()
}

// Apply penalty if entropy > 4.5 (tunable threshold)
// Natural language typically has entropy 3.5-4.5
// Random data approaches 8.0 for printable ASCII

Penalty Approach:

  • Entropy < 4.0: No penalty (natural language)
  • Entropy 4.0-5.0: Linear penalty 0-30 points
  • Entropy > 5.0: Fixed penalty -30 points

Excessive Length Penalty

Penalize strings beyond reasonable lengths using a tiered approach:

fn length_penalty(len: usize) -> i32 {
    match len {
        0..=256 => 0,           // Normal strings
        257..=512 => -10,       // Slightly long
        513..=1024 => -20,      // Very long (likely base64/table)
        _ => -30                // Extremely long (definitely noise)
    }
}

Repeated Pattern Detection

Detect strings dominated by repeated characters or short sequences:

fn detect_repeated_patterns(s: &str) -> Option<i32> {
    // Check for single character repetition
    let unique_chars = s.chars().collect::<HashSet<_>>().len();
    let repetition_ratio = unique_chars as f64 / s.len() as f64;
    
    if repetition_ratio < 0.1 {
        return Some(-25); // >90% same character
    }
    
    // Check for short repeating sequences (2-4 chars)
    // e.g., "ABABAB", "123123123"
    for pattern_len in 2..=4 {
        if is_repeating_pattern(s, pattern_len) {
            return Some(-20);
        }
    }
    
    None
}

Table Data Heuristics

Detect table-like data using multiple indicators:

  • Delimiter Density: High frequency of delimiters (commas, pipes, tabs, nulls)
  • Binary Interleaving: Printable characters interspersed with non-printable
  • Low Alphanumeric Ratio: Less than 60% alphanumeric content
fn is_table_data(s: &str) -> bool {
    let delimiter_count = s.chars().filter(|&c| matches!(c, ',' | '|' | '\t' | '\0')).count();
    let delimiter_ratio = delimiter_count as f64 / s.len() as f64;
    
    let alphanum_count = s.chars().filter(|c| c.is_alphanumeric()).count();
    let alphanum_ratio = alphanum_count as f64 / s.len() as f64;
    
    delimiter_ratio > 0.3 || alphanum_ratio < 0.6
}

Implementation Plan

  1. Add noise detection module (src/ranker/noise.rs)

    • Entropy calculation function
    • Length penalty function
    • Pattern detection functions
    • Table data heuristics
  2. Integrate with ranking system

    • Add noise penalty field to ExtractedString score calculation
    • Apply penalties during ranking phase
    • Ensure penalties are composable (multiple penalties can apply)
  3. Configuration

    • Make thresholds tunable via CLI flags or config file
    • Default conservative values that work for most binaries
  4. Testing

    • Unit tests for each detection function with known inputs
    • Integration tests with real binary samples
    • Snapshot tests to verify penalty application

Acceptance Criteria

  • Entropy calculation implemented with configurable threshold
  • Length penalty with tiered approach (256/512/1024 byte thresholds)
  • Repeated pattern detection for single chars and short sequences
  • Table data heuristics with delimiter and alphanumeric ratio checks
  • Integration with ranking system (penalties affect final scores)
  • Unit tests achieving >90% coverage for noise detection module
  • Documentation with examples of penalized vs. non-penalized strings
  • Performance: Noise detection adds <10% overhead to overall extraction time

Technical Considerations

  • Performance: Entropy calculation is O(n) per string; consider caching for reused strings
  • Encoding-Aware: UTF-16 strings need byte-pair entropy, not character entropy
  • Threshold Tuning: Default values should be validated against diverse binary corpus (ELF, PE, Mach-O)
  • False Positives: Some legitimate strings may have high entropy (e.g., crypto keys, hashes); use context tags to avoid over-penalizing

Dependencies

  • Blocked by: Ranking System Foundation (task: stringy-analyzer/ranking-system)
  • Requires: ExtractedString type with score field
  • Enables: High-quality output that rivals manual analysis

Requirements

Requirement: 5.3

Task ID

stringy-analyzer/noise-penalty-detection

Related

This feature directly addresses the "Eliminates noise" advantage mentioned in the README, helping StringyMcStringFace stop "dumping padding, tables, and interleaved garbage" that plagues standard strings output.

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions