-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Create the main extraction orchestrator that serves as the primary API for analyzing binaries and extracting meaningful strings. This orchestrator will wire together all components (format detection, parsing, extraction, classification, and ranking) into a cohesive pipeline.
Background
StringyMcStringFace aims to be a smarter alternative to the standard strings command by being data-structure aware, section-aware, and semantically intelligent. The foundation is solid with working binary format parsers (PE, ELF, Mach-O) via goblin, but the core extraction pipeline that coordinates all components needs to be implemented.
Current Status:
- ✅ Format detection (
container::detect_format) - ✅ Container parsers (
ContainerParsertrait with PE/ELF/Mach-O implementations) - ✅ Type definitions (
FoundString,ContainerInfo, etc.) - ❌ String extraction engine (empty
extraction/mod.rs) - ❌ Classification/tagging system (empty
classification/mod.rs) - ❌ Ranking/scoring algorithm
- ❌ Main orchestrator API
Proposed Solution
Architecture
Create a StringAnalyzer orchestrator in src/lib.rs that provides a clean public API:
pub struct StringAnalyzer {
min_length: usize,
encodings: Vec<Encoding>,
// ... configuration
}
impl StringAnalyzer {
pub fn new() -> Self { /* ... */ }
pub fn analyze(&self, data: &[u8]) -> Result<Vec<FoundString>> {
// 1. Detect format
// 2. Parse container metadata
// 3. Extract strings from prioritized sections
// 4. Classify/tag strings
// 5. Score/rank strings
// 6. Return sorted results
}
}Implementation Plan
Phase 1: Core Extraction (extraction/mod.rs)
- Implement
StringExtractortrait with methods:extract_ascii_utf8(data: &[u8], offset: usize) -> Vec<FoundString>extract_utf16le(data: &[u8], offset: usize) -> Vec<FoundString>extract_utf16be(data: &[u8], offset: usize) -> Vec<FoundString>
- Section-aware extraction that respects
SectionInfoboundaries - Configurable minimum length (default 4)
- Track source (section name, offset, RVA)
Phase 2: Classification System (classification/mod.rs)
- Implement
StringClassifierwith pattern matching for:- URLs (http://, https://, ftp://)
- Domains (DNS patterns)
- IP addresses (IPv4/IPv6)
- File paths (Unix:
/, Windows:C:\\,/) - Registry keys (
HKEY_*) - GUIDs (
{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}) - User agents
- Format strings (
%s,%d,{}) - Base64 patterns
- Crypto constants (known algorithm identifiers)
- Return
Vec<Tag>for each string
Phase 3: Ranking Algorithm
- Implement scoring system based on:
- Section type priority (.rodata/.rdata > .data > .text)
- Tag relevance (URL/domain/GUID = high, format string = medium)
- String length (longer = more informative, up to a point)
- Encoding confidence
- Import/export name bonus
- Score range: 0-100
Phase 4: Main Orchestrator
- Wire components together in
StringAnalyzer - Proper error handling with context propagation
- Configuration options (min length, encoding filters, tag filters)
- Memory-efficient processing for large binaries
- Integration with
main.rsCLI
Pipeline Flow
Binary Input
↓
Format Detection (container::detect_format)
↓
Parser Creation (container::create_parser)
↓
Container Parsing (parser.parse())
↓
Section Prioritization (by SectionType)
↓
String Extraction (extraction module)
↓
Classification/Tagging (classification module)
↓
Scoring/Ranking
↓
Sorted Results (Vec<FoundString>)
Implementation Requirements
- Error Handling: Use
StringyErrorthroughout with proper context - Testing: Unit tests for each component, integration test for full pipeline
- Performance: Process large binaries (100MB+) efficiently
- Documentation: Doc comments for public API with examples
- Integration: Wire into
main.rsCLI to replace TODO
Dependencies
- Existing:
goblin,bstr,regex,serde - May need: Pattern matching crates for classification
Acceptance Criteria
-
StringAnalyzerpublic API implemented insrc/lib.rs - String extraction engine working for ASCII/UTF-8 and UTF-16LE/BE
- Classification system tagging strings with semantic categories
- Ranking algorithm scoring strings by relevance
- Full pipeline integration test with real PE/ELF/Mach-O binaries
- CLI in
main.rscalls orchestrator and displays results - Documentation and examples in doc comments
- All tests passing
Related Issues
This is the main integration point that blocks most other features. Once complete, we can add:
- Output formatters (JSON, YARA, human-readable)
- Advanced filtering and search
- Performance optimizations
- Additional classification patterns