Skip to content

Performance: Implement Regex Compilation Caching for Semantic Classifier #33

@unclesp1d3r

Description

@unclesp1d3r

Background

StringyMcStringFace's semantic classifier needs to identify and tag various string patterns extracted from binaries, including:

  • URLs and domains
  • IP addresses (IPv4/IPv6)
  • File paths (POSIX/Windows)
  • Registry keys
  • GUIDs
  • Email addresses
  • JWT tokens
  • Base64 sequences
  • Format strings (printf-style)
  • User agent strings

Each classification requires regex pattern matching, and the classifier will process thousands of strings per binary. Without caching, regex compilation overhead becomes a significant performance bottleneck.

Problem Statement

Currently, the classification module (src/classification/mod.rs) is minimal. When implemented, it will need to compile multiple complex regex patterns. Compiling regex patterns on-demand for each string classification would:

  • Add significant CPU overhead (regex compilation is expensive)
  • Create unnecessary memory allocations
  • Slow down the overall analysis pipeline
  • Make the tool less suitable for batch processing

Proposed Solution

Implement a regex caching strategy using lazy initialization:

1. Add Dependencies

Add regex and once_cell (or use std::sync::LazyLock in Rust 2024) to Cargo.toml:

[dependencies]
regex = "1.11"
once_cell = "1.20"  # or use std::sync::LazyLock

2. Implement Lazy-Initialized Regex Cache

Create a module with static, lazily-initialized regex patterns:

use once_cell::sync::Lazy;
use regex::Regex;

pub struct PatternCache {
    pub url: &'static Lazy<Regex>,
    pub domain: &'static Lazy<Regex>,
    pub ipv4: &'static Lazy<Regex>,
    pub ipv6: &'static Lazy<Regex>,
    pub filepath_posix: &'static Lazy<Regex>,
    pub filepath_windows: &'static Lazy<Regex>,
    pub registry_key: &'static Lazy<Regex>,
    pub guid: &'static Lazy<Regex>,
    pub email: &'static Lazy<Regex>,
    pub base64: &'static Lazy<Regex>,
    pub format_string: &'static Lazy<Regex>,
    pub user_agent: &'static Lazy<Regex>,
}

static URL_PATTERN: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"https?://[\w.-]+(?:/[\w./?%&=-]*)?").unwrap()
});

static GUID_PATTERN: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"(?i)[{]?[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}[}]?").unwrap()
});

// ... additional patterns

3. Use Cached Patterns in Classifier

The semantic classifier should reference these static patterns:

pub fn classify_string(text: &str) -> Vec<Tag> {
    let mut tags = Vec::new();
    
    if URL_PATTERN.is_match(text) {
        tags.push(Tag::Url);
    }
    if GUID_PATTERN.is_match(text) {
        tags.push(Tag::Guid);
    }
    // ... additional classifications
    
    tags
}

4. Consider RegexSet for Multiple Patterns

For optimal performance when checking multiple patterns against the same text, consider using RegexSet:

use regex::RegexSet;

static PATTERN_SET: Lazy<RegexSet> = Lazy::new(|| {
    RegexSet::new(&[
        r"https?://.*",  // URL
        r"[{]?[0-9a-f]{8}-.*",  // GUID
        // ... more patterns
    ]).unwrap()
});

Implementation Tasks

  • Add regex dependency to Cargo.toml
  • Add once_cell or use std::sync::LazyLock for lazy initialization
  • Create src/classification/patterns.rs with cached regex patterns
  • Implement pattern definitions for all semantic tags
  • Create classify_string() function using cached patterns
  • Add unit tests for each pattern
  • Create benchmark suite to measure performance improvement
  • Document pattern syntax and matching behavior

Success Criteria

  • All regex patterns compiled once on first use
  • Zero regex compilation overhead during string classification
  • Measurable performance improvement (target: >90% reduction in classification time)
  • Comprehensive test coverage for all patterns
  • Benchmarks demonstrating caching effectiveness

Performance Benchmarks

Create benchmarks in benches/regex_caching.rs to measure:

  1. Compilation overhead: Time to compile patterns with/without caching
  2. Classification throughput: Strings classified per second
  3. Memory usage: Compare cached vs. on-demand compilation
  4. Batch processing: Time to classify 10k, 100k, 1M strings

Expected improvement: 10-100x faster classification depending on pattern complexity and string volume.

References

Related

  • Part of v0.1 MVP milestone
  • Blocks efficient implementation of semantic classification (requirement 8.3)
  • Required for achieving acceptable performance on large binaries

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions