-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
This epic tracks the implementation of the core MVP functionality for StringyMcStringFace - the complete pipeline from binary parsing to user output. This represents the minimal viable product for a weekend demonstration.
Context
StringyMcStringFace is a smarter alternative to the Unix strings command that uses binary analysis to extract meaningful strings from executables. The project foundation is complete with:
- ✅ Core infrastructure and data types
- ✅ Format detection (ELF, PE, Mach-O) via
goblin - ✅ Container parsers with section classification
This epic covers the remaining components needed for an end-to-end working demo.
Pipeline Architecture
Binary Input
↓
[goblin] Parse format & extract sections
↓
[Section List] Classify sections by string likelihood
↓
[Extraction] ASCII/UTF-8 + UTF-16LE/BE extraction
↓
[Classification] Tag strings (URL, path, GUID, etc.)
↓
[Ranking] Score by relevance & section importance
↓
[Output] JSONL format + Human-readable TTY view
Scope
In Scope for MVP
-
String Extraction Engine (
src/extraction/mod.rs)- ASCII/UTF-8 extraction from byte streams
- UTF-16LE/BE extraction (critical for PE binaries)
- Minimum length filtering (default: 4 chars)
- Confidence scoring for encoding detection
-
Semantic Classification (
src/classification/mod.rs)- Pattern matching for high-value strings:
- URLs (http://, https://)
- File paths (Unix:
/, Windows:C:\) - GUIDs (
{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}) - IP addresses
- Format strings (
%s,%d, etc.)
- Tagging system for multiple classifications per string
- Pattern matching for high-value strings:
-
Ranking System
- Section-based scoring (e.g.,
.rodata>.data) - Classification boost (URLs, GUIDs rank higher)
- Length penalties for very short/long strings
- Top-N selection for output
- Section-based scoring (e.g.,
-
Output Formats (
src/output/mod.rs)- JSONL: One JSON object per line for pipeline integration
- TTY/Human-readable: Formatted table with columns:
- Score
- Offset (hex)
- Section name
- Tags
- String content (truncated if needed)
-
CLI Integration (
src/main.rs)- Accept binary path as positional argument
- Basic flags:
--json,--format,--min-len - Wire up the complete pipeline
Out of Scope (Post-MVP)
- Advanced filters (
--only url,filepath) - YARA output format
- PE-specific resource extraction
- Rust symbol demangling in output
- Configuration file support
- Progress indicators for large binaries
Acceptance Criteria
- Can extract ASCII/UTF-8 strings from all three binary formats (ELF, PE, Mach-O)
- UTF-16 extraction works correctly on Windows PE files
- At least 5 semantic tags implemented (URL, path, GUID, IP, format string)
- Ranking algorithm deprioritizes low-value sections like
.bss - JSONL output is valid and parseable by
jq - TTY output displays top 50 strings by default with proper formatting
- CLI can process a real-world binary (e.g.,
/bin/ls) without errors - Output is significantly less noisy than standard
stringscommand
Implementation Order
- String Extraction - Foundation for everything else
- Basic Classification - URL + filepath patterns first
- Ranking System - Section scoring + classification boost
- JSONL Output - Easiest output format
- TTY Output - Human-friendly display
- CLI Wiring - Connect all components
Testing Strategy
For MVP, manual testing is acceptable:
- Test against
/bin/ls(ELF, Unix) - Test against
notepad.exe(PE, Windows) if available - Compare output quality vs.
stringscommand - Verify JSON is valid with
jq
Success Metrics
- Noise Reduction: 50%+ fewer irrelevant strings than
strings - Signal Boost: URLs, paths, GUIDs appear in top 50 results
- Performance: Processes typical binary (<10MB) in under 1 second
- Demo-Ready: Can show side-by-side comparison with
strings
Related Issues
Child Tasks (v0.1 Milestone)
- Complete End-to-End Pipeline Integration with Error Recovery and Testing #37 - Complete End-to-End Pipeline Integration with Error Recovery and Testing
- Implement Main String Extraction Orchestrator and Pipeline Integration #36 - Implement Main String Extraction Orchestrator and Pipeline Integration
- Add Comprehensive Integration Test Suite with Benchmarks and Snapshot Testing #35 - Comprehensive Integration Tests
- Establish Test Infrastructure with Multi-Format Binary Fixtures #34 - Establish Test Infrastructure with Multi-Format Binary Fixtures
- Performance: Implement Regex Compilation Caching for Semantic Classifier #33 - Regex Caching
- Implement Memory-Mapped File I/O for Efficient Large Binary Analysis #32 - Implement Memory-Mapped File I/O for Efficient Large Binary Analysis
- CLI: Implement output format selection (--top, --json flags) #31 - Output Format CLI Arguments
- Implement CLI Filtering Arguments for String Extraction Control #30 - Filtering CLI Arguments
- Enhance CLI with Advanced Argument Parsing and Error Handling #29 - Enhance CLI with Advanced Argument Parsing and Error Handling
- Implement YARA Rule Output Formatter with String Escaping and Modifiers #28 - YARA-Friendly Output
- Implement Human-Readable Table Formatter for String Analysis Results #27 - Human-Readable Output
- Implement JSONL (JSON Lines) Output Formatter for Structured String Data #26 - JSONL Output Format
- Implement Output Formatter Framework with Trait-Based Architecture #25 - Implement Output Formatter Framework with Trait-Based Architecture
- Implement Noise Penalty Detection for String Quality Scoring #24 - Implement Noise Penalty Detection for String Quality Scoring
- Implement Semantic Boost Scoring for String Ranking #23 - Semantic Boost Scoring
Related Epics
- Epic: v0.2 - PE Resources, Symbol Demangling & Import/Export Enhancement #40 - v0.2: PE Resources, Symbol Demangling & Import/Export Enhancement
- Add relocation-hinted string reference detection with Capstone disassembly #41 - v0.3: Relocation-hinted string reference detection with Capstone
- Epic: v0.4 - Advanced Binary Analysis Features (DWARF, Mach-O Load Commands, Go Build Info) #42 - v0.4: Advanced Binary Analysis Features (DWARF, Mach-O, Go)
- Project: StringyMcStringFace v1.0 Production Release #38 - v1.0: Production Release - Full-Featured Binary String Analyzer
Notes
This is a time-boxed weekend implementation. Focus on "working" over "perfect". Code quality and comprehensive testing can be improved post-MVP.