Skip to content

Implement Output Formatter Framework with Trait-Based Architecture #25

@unclesp1d3r

Description

@unclesp1d3r

Context

StringyMcStringFace is a binary string extraction tool that analyzes executables and extracts meaningful strings with semantic classification and scoring. The tool needs a flexible output formatting system to support multiple output formats optimized for different use cases (interactive analysis, automation, YARA rule generation).

Currently, src/output/mod.rs exists but only contains a placeholder comment. The architecture is well-defined in docs/src/output-formats.md and docs/src/architecture.md, and the core data structures (FoundString, Encoding, Tag, etc.) are already implemented in src/types.rs.

Problem Statement

The tool requires a robust output formatting framework that:

  • Supports multiple output formats (Human-readable tables, JSON Lines, YARA rules)
  • Provides a consistent trait-based interface for formatters
  • Allows easy extension for future formats (CSV, XML, Markdown)
  • Handles edge cases (special character escaping, long string truncation, color coding)
  • Integrates with the existing FoundString data structure
  • Supports output customization (filtering, sorting, field selection)

Requirements (from Requirement 6.1)

  1. Trait Design: Create a Formatter trait that defines the interface for all output formatters
  2. Configuration: Implement output configuration options (format selection, color support, truncation limits)
  3. Core Implementations: Provide initial implementations for:
    • Human-readable table format (default)
    • JSON Lines format (machine-readable)
    • YARA rule format (detection rules)
  4. Error Handling: Proper error types for formatting failures
  5. Testing: Unit tests for each formatter with edge cases
  6. Documentation: Inline documentation for the trait and configuration options

Proposed Solution

1. Module Structure

src/output/
├── mod.rs           # Public API, Formatter trait, OutputConfig
├── human.rs         # HumanFormatter (interactive table view)
├── json.rs          # JsonFormatter (JSON Lines)
└── yara.rs          # YaraFormatter (YARA rules)

2. Trait Design

pub trait Formatter {
    /// Format a collection of strings for output
    fn format(&self, strings: &[FoundString], config: &OutputConfig) -> Result<String>;
    
    /// Format a single string (for streaming output)
    fn format_one(&self, string: &FoundString, config: &OutputConfig) -> Result<String>;
    
    /// Write header/preamble if needed
    fn write_header(&self, writer: &mut dyn Write, config: &OutputConfig) -> Result<()>;
    
    /// Write footer/postamble if needed
    fn write_footer(&self, writer: &mut dyn Write, config: &OutputConfig) -> Result<()>;
}

3. Configuration Structure

pub struct OutputConfig {
    pub format: OutputFormat,
    pub color: bool,
    pub truncate_length: Option<usize>,
    pub min_score: Option<i32>,
    pub sections_filter: Option<Vec<String>>,
    pub tags_filter: Option<Vec<Tag>>,
    pub max_results: Option<usize>,
}

pub enum OutputFormat {
    Human,
    Json,
    Yara,
}

4. Implementation Details

Human Formatter

  • Use comfy-table or similar crate for table rendering
  • Implement color coding based on score ranges (green: 80+, yellow: 60-79, red: <60)
  • Handle truncation with "..." indicator for long strings
  • Sort by score (descending) by default

JSON Formatter

  • One JSON object per line (JSON Lines format)
  • Leverage existing serde derives on FoundString
  • No pretty-printing for pipeline compatibility
  • Each object contains all fields from FoundString

YARA Formatter

  • Proper escaping for YARA string syntax
  • Group strings by semantic tag (URLs, GUIDs, paths, etc.)
  • Include metadata (source file, generation timestamp)
  • Add appropriate ascii/wide modifiers based on encoding
  • Filter to high-confidence strings (score >= 80) by default

5. Testing Strategy

  • Unit tests for each formatter with sample FoundString instances
  • Edge cases:
    • Empty string collections
    • Strings with special characters (quotes, backslashes, newlines)
    • Very long strings (truncation)
    • UTF-16 encoded strings in YARA output
    • Null/missing fields (section, rva)
  • Integration tests with real binary analysis output

6. Integration Points

  • Called from main.rs after classification and ranking phases
  • Receives sorted Vec<FoundString> from the analysis pipeline
  • Configuration driven by CLI arguments (parsed by clap)
  • Output to stdout by default, file redirection supported

Acceptance Criteria

  • Formatter trait defined with documentation
  • OutputConfig structure implemented with all options
  • HumanFormatter implementation with color coding and tables
  • JsonFormatter implementation producing valid JSON Lines
  • YaraFormatter implementation with proper escaping and modifiers
  • Unit tests achieving >80% code coverage for output module
  • Integration test demonstrating all three formats
  • Inline documentation for public APIs
  • README updated with output format examples (reference existing docs/src/output-formats.md)

References

  • Architecture Documentation: docs/src/architecture.md
  • Output Format Specifications: docs/src/output-formats.md
  • Core Data Structures: src/types.rs
  • Related Issue: Task-ID stringy-analyzer/output-formatting-framework

Dependencies

  • serde / serde_json (already in Cargo.toml)
  • Consider adding: comfy-table or prettytable-rs for human-readable output
  • Consider adding: colored or termcolor for color support

Estimated Effort

Medium complexity - requires trait design, three formatter implementations, and comprehensive testing. Estimated 1-2 days of focused development.

Sub-issues

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions