Skip to content

Epic: MVP Weekend Implementation - Complete String Extraction Pipeline #39

@unclesp1d3r

Description

@unclesp1d3r

Overview

This epic tracks the implementation of the core MVP functionality for StringyMcStringFace - the complete pipeline from binary parsing to user output. This represents the minimal viable product for a weekend demonstration.

Context

StringyMcStringFace is a smarter alternative to the Unix strings command that uses binary analysis to extract meaningful strings from executables. The project foundation is complete with:

  • ✅ Core infrastructure and data types
  • ✅ Format detection (ELF, PE, Mach-O) via goblin
  • ✅ Container parsers with section classification

This epic covers the remaining components needed for an end-to-end working demo.

Pipeline Architecture

Binary Input
    ↓
[goblin] Parse format & extract sections
    ↓
[Section List] Classify sections by string likelihood
    ↓
[Extraction] ASCII/UTF-8 + UTF-16LE/BE extraction
    ↓
[Classification] Tag strings (URL, path, GUID, etc.)
    ↓
[Ranking] Score by relevance & section importance
    ↓
[Output] JSONL format + Human-readable TTY view

Scope

In Scope for MVP

  1. String Extraction Engine (src/extraction/mod.rs)

    • ASCII/UTF-8 extraction from byte streams
    • UTF-16LE/BE extraction (critical for PE binaries)
    • Minimum length filtering (default: 4 chars)
    • Confidence scoring for encoding detection
  2. Semantic Classification (src/classification/mod.rs)

    • Pattern matching for high-value strings:
      • URLs (http://, https://)
      • File paths (Unix: /, Windows: C:\)
      • GUIDs ({xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx})
      • IP addresses
      • Format strings (%s, %d, etc.)
    • Tagging system for multiple classifications per string
  3. Ranking System

    • Section-based scoring (e.g., .rodata > .data)
    • Classification boost (URLs, GUIDs rank higher)
    • Length penalties for very short/long strings
    • Top-N selection for output
  4. Output Formats (src/output/mod.rs)

    • JSONL: One JSON object per line for pipeline integration
    • TTY/Human-readable: Formatted table with columns:
      • Score
      • Offset (hex)
      • Section name
      • Tags
      • String content (truncated if needed)
  5. CLI Integration (src/main.rs)

    • Accept binary path as positional argument
    • Basic flags: --json, --format, --min-len
    • Wire up the complete pipeline

Out of Scope (Post-MVP)

  • Advanced filters (--only url,filepath)
  • YARA output format
  • PE-specific resource extraction
  • Rust symbol demangling in output
  • Configuration file support
  • Progress indicators for large binaries

Acceptance Criteria

  • Can extract ASCII/UTF-8 strings from all three binary formats (ELF, PE, Mach-O)
  • UTF-16 extraction works correctly on Windows PE files
  • At least 5 semantic tags implemented (URL, path, GUID, IP, format string)
  • Ranking algorithm deprioritizes low-value sections like .bss
  • JSONL output is valid and parseable by jq
  • TTY output displays top 50 strings by default with proper formatting
  • CLI can process a real-world binary (e.g., /bin/ls) without errors
  • Output is significantly less noisy than standard strings command

Implementation Order

  1. String Extraction - Foundation for everything else
  2. Basic Classification - URL + filepath patterns first
  3. Ranking System - Section scoring + classification boost
  4. JSONL Output - Easiest output format
  5. TTY Output - Human-friendly display
  6. CLI Wiring - Connect all components

Testing Strategy

For MVP, manual testing is acceptable:

  • Test against /bin/ls (ELF, Unix)
  • Test against notepad.exe (PE, Windows) if available
  • Compare output quality vs. strings command
  • Verify JSON is valid with jq

Success Metrics

  • Noise Reduction: 50%+ fewer irrelevant strings than strings
  • Signal Boost: URLs, paths, GUIDs appear in top 50 results
  • Performance: Processes typical binary (<10MB) in under 1 second
  • Demo-Ready: Can show side-by-side comparison with strings

Related Issues

Child Tasks (v0.1 Milestone)

Related Epics

Notes

This is a time-boxed weekend implementation. Focus on "working" over "perfect". Code quality and comprehensive testing can be improved post-MVP.

Sub-issues

Metadata

Metadata

Assignees

Labels

MVPMinimum viable product featuresenhancementNew feature or requestepicLarge feature or initiative spanning multiple tasksstory-points: 2121 story points

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions