Performance: Implement Regex Compilation Caching for Semantic Classifier

## Background

StringyMcStringFace's semantic classifier needs to identify and tag various string patterns extracted from binaries, including:

- URLs and domains
- IP addresses (IPv4/IPv6)  
- File paths (POSIX/Windows)
- Registry keys
- GUIDs
- Email addresses
- JWT tokens
- Base64 sequences
- Format strings (printf-style)
- User agent strings

Each classification requires regex pattern matching, and the classifier will process thousands of strings per binary. Without caching, regex compilation overhead becomes a significant performance bottleneck.

## Problem Statement

Currently, the classification module (`src/classification/mod.rs`) is minimal. When implemented, it will need to compile multiple complex regex patterns. Compiling regex patterns on-demand for each string classification would:

- Add significant CPU overhead (regex compilation is expensive)
- Create unnecessary memory allocations
- Slow down the overall analysis pipeline
- Make the tool less suitable for batch processing

## Proposed Solution

Implement a regex caching strategy using lazy initialization:

### 1. Add Dependencies

Add `regex` and `once_cell` (or use std::sync::LazyLock in Rust 2024) to `Cargo.toml`:

```toml
[dependencies]
regex = "1.11"
once_cell = "1.20"  # or use std::sync::LazyLock
```

### 2. Implement Lazy-Initialized Regex Cache

Create a module with static, lazily-initialized regex patterns:

```rust
use once_cell::sync::Lazy;
use regex::Regex;

pub struct PatternCache {
    pub url: &'static Lazy<Regex>,
    pub domain: &'static Lazy<Regex>,
    pub ipv4: &'static Lazy<Regex>,
    pub ipv6: &'static Lazy<Regex>,
    pub filepath_posix: &'static Lazy<Regex>,
    pub filepath_windows: &'static Lazy<Regex>,
    pub registry_key: &'static Lazy<Regex>,
    pub guid: &'static Lazy<Regex>,
    pub email: &'static Lazy<Regex>,
    pub base64: &'static Lazy<Regex>,
    pub format_string: &'static Lazy<Regex>,
    pub user_agent: &'static Lazy<Regex>,
}

static URL_PATTERN: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"https?://[\w.-]+(?:/[\w./?%&=-]*)?").unwrap()
});

static GUID_PATTERN: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"(?i)[{]?[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}[}]?").unwrap()
});

// ... additional patterns
```

### 3. Use Cached Patterns in Classifier

The semantic classifier should reference these static patterns:

```rust
pub fn classify_string(text: &str) -> Vec<Tag> {
    let mut tags = Vec::new();
    
    if URL_PATTERN.is_match(text) {
        tags.push(Tag::Url);
    }
    if GUID_PATTERN.is_match(text) {
        tags.push(Tag::Guid);
    }
    // ... additional classifications
    
    tags
}
```

### 4. Consider RegexSet for Multiple Patterns

For optimal performance when checking multiple patterns against the same text, consider using `RegexSet`:

```rust
use regex::RegexSet;

static PATTERN_SET: Lazy<RegexSet> = Lazy::new(|| {
    RegexSet::new(&[
        r"https?://.*",  // URL
        r"[{]?[0-9a-f]{8}-.*",  // GUID
        // ... more patterns
    ]).unwrap()
});
```

## Implementation Tasks

- [ ] Add `regex` dependency to `Cargo.toml`
- [ ] Add `once_cell` or use `std::sync::LazyLock` for lazy initialization
- [ ] Create `src/classification/patterns.rs` with cached regex patterns
- [ ] Implement pattern definitions for all semantic tags
- [ ] Create `classify_string()` function using cached patterns
- [ ] Add unit tests for each pattern
- [ ] Create benchmark suite to measure performance improvement
- [ ] Document pattern syntax and matching behavior

## Success Criteria

- All regex patterns compiled once on first use
- Zero regex compilation overhead during string classification
- Measurable performance improvement (target: >90% reduction in classification time)
- Comprehensive test coverage for all patterns
- Benchmarks demonstrating caching effectiveness

## Performance Benchmarks

Create benchmarks in `benches/regex_caching.rs` to measure:

1. **Compilation overhead**: Time to compile patterns with/without caching
2. **Classification throughput**: Strings classified per second
3. **Memory usage**: Compare cached vs. on-demand compilation
4. **Batch processing**: Time to classify 10k, 100k, 1M strings

Expected improvement: 10-100x faster classification depending on pattern complexity and string volume.

## References

- Rust regex crate documentation: https://docs.rs/regex/latest/regex/
- Performance tips: https://docs.rs/regex/latest/regex/index.html#performance
- once_cell for lazy statics: https://docs.rs/once_cell/latest/once_cell/

## Related

- Part of v0.1 MVP milestone
- Blocks efficient implementation of semantic classification (requirement 8.3)
- Required for achieving acceptable performance on large binaries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance: Implement Regex Compilation Caching for Semantic Classifier #33

Background

Problem Statement

Proposed Solution

1. Add Dependencies

2. Implement Lazy-Initialized Regex Cache

3. Use Cached Patterns in Classifier

4. Consider RegexSet for Multiple Patterns

Implementation Tasks

Success Criteria

Performance Benchmarks

References

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Performance: Implement Regex Compilation Caching for Semantic Classifier #33

Description

Background

Problem Statement

Proposed Solution

1. Add Dependencies

2. Implement Lazy-Initialized Regex Cache

3. Use Cached Patterns in Classifier

4. Consider RegexSet for Multiple Patterns

Implementation Tasks

Success Criteria

Performance Benchmarks

References

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions