Skip to content

Implement Main String Extraction Orchestrator and Pipeline Integration #36

@unclesp1d3r

Description

@unclesp1d3r

Summary

Create the main extraction orchestrator that serves as the primary API for analyzing binaries and extracting meaningful strings. This orchestrator will wire together all components (format detection, parsing, extraction, classification, and ranking) into a cohesive pipeline.

Background

StringyMcStringFace aims to be a smarter alternative to the standard strings command by being data-structure aware, section-aware, and semantically intelligent. The foundation is solid with working binary format parsers (PE, ELF, Mach-O) via goblin, but the core extraction pipeline that coordinates all components needs to be implemented.

Current Status:

  • ✅ Format detection (container::detect_format)
  • ✅ Container parsers (ContainerParser trait with PE/ELF/Mach-O implementations)
  • ✅ Type definitions (FoundString, ContainerInfo, etc.)
  • ❌ String extraction engine (empty extraction/mod.rs)
  • ❌ Classification/tagging system (empty classification/mod.rs)
  • ❌ Ranking/scoring algorithm
  • ❌ Main orchestrator API

Proposed Solution

Architecture

Create a StringAnalyzer orchestrator in src/lib.rs that provides a clean public API:

pub struct StringAnalyzer {
    min_length: usize,
    encodings: Vec<Encoding>,
    // ... configuration
}

impl StringAnalyzer {
    pub fn new() -> Self { /* ... */ }
    
    pub fn analyze(&self, data: &[u8]) -> Result<Vec<FoundString>> {
        // 1. Detect format
        // 2. Parse container metadata
        // 3. Extract strings from prioritized sections
        // 4. Classify/tag strings
        // 5. Score/rank strings
        // 6. Return sorted results
    }
}

Implementation Plan

Phase 1: Core Extraction (extraction/mod.rs)

  • Implement StringExtractor trait with methods:
    • extract_ascii_utf8(data: &[u8], offset: usize) -> Vec<FoundString>
    • extract_utf16le(data: &[u8], offset: usize) -> Vec<FoundString>
    • extract_utf16be(data: &[u8], offset: usize) -> Vec<FoundString>
  • Section-aware extraction that respects SectionInfo boundaries
  • Configurable minimum length (default 4)
  • Track source (section name, offset, RVA)

Phase 2: Classification System (classification/mod.rs)

  • Implement StringClassifier with pattern matching for:
    • URLs (http://, https://, ftp://)
    • Domains (DNS patterns)
    • IP addresses (IPv4/IPv6)
    • File paths (Unix: /, Windows: C:\\, /)
    • Registry keys (HKEY_*)
    • GUIDs ({xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx})
    • User agents
    • Format strings (%s, %d, {})
    • Base64 patterns
    • Crypto constants (known algorithm identifiers)
  • Return Vec<Tag> for each string

Phase 3: Ranking Algorithm

  • Implement scoring system based on:
    • Section type priority (.rodata/.rdata > .data > .text)
    • Tag relevance (URL/domain/GUID = high, format string = medium)
    • String length (longer = more informative, up to a point)
    • Encoding confidence
    • Import/export name bonus
  • Score range: 0-100

Phase 4: Main Orchestrator

  • Wire components together in StringAnalyzer
  • Proper error handling with context propagation
  • Configuration options (min length, encoding filters, tag filters)
  • Memory-efficient processing for large binaries
  • Integration with main.rs CLI

Pipeline Flow

Binary Input
    ↓
Format Detection (container::detect_format)
    ↓
Parser Creation (container::create_parser)
    ↓
Container Parsing (parser.parse())
    ↓
Section Prioritization (by SectionType)
    ↓
String Extraction (extraction module)
    ↓
Classification/Tagging (classification module)
    ↓
Scoring/Ranking
    ↓
Sorted Results (Vec<FoundString>)

Implementation Requirements

  1. Error Handling: Use StringyError throughout with proper context
  2. Testing: Unit tests for each component, integration test for full pipeline
  3. Performance: Process large binaries (100MB+) efficiently
  4. Documentation: Doc comments for public API with examples
  5. Integration: Wire into main.rs CLI to replace TODO

Dependencies

  • Existing: goblin, bstr, regex, serde
  • May need: Pattern matching crates for classification

Acceptance Criteria

  • StringAnalyzer public API implemented in src/lib.rs
  • String extraction engine working for ASCII/UTF-8 and UTF-16LE/BE
  • Classification system tagging strings with semantic categories
  • Ranking algorithm scoring strings by relevance
  • Full pipeline integration test with real PE/ELF/Mach-O binaries
  • CLI in main.rs calls orchestrator and displays results
  • Documentation and examples in doc comments
  • All tests passing

Related Issues

This is the main integration point that blocks most other features. Once complete, we can add:

  • Output formatters (JSON, YARA, human-readable)
  • Advanced filtering and search
  • Performance optimizations
  • Additional classification patterns

Task-ID: stringy-analyzer/main-extraction-pipeline

Sub-issues

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions