Skip to content

Implement YARA Rule Output Formatter with String Escaping and Modifiers #28

@unclesp1d3r

Description

@unclesp1d3r

Overview

This task implements a YARA rule formatter for StringyMcStringFace that outputs extracted strings in valid YARA rule syntax. YARA is a widely-used pattern matching tool in malware research and incident response, making YARA-compatible output a critical feature for security analysts who want to create detection rules from extracted strings.

Context

The binary analyzer currently extracts strings from ELF, PE, and Mach-O binaries with rich metadata including:

  • Encoding types (ASCII, UTF-8, UTF-16LE, UTF-16BE)
  • Semantic tags (URLs, domains, IPs, file paths, etc.)
  • Offsets, RVAs, and section information
  • Relevance scores

To maximize utility for security practitioners, we need to format this data as valid YARA rules that can be immediately used for threat hunting and detection.

Technical Requirements

1. String Escaping

Implement proper C-style escaping for YARA text strings:

  • Escape double quotes: \"
  • Escape backslashes: \\
  • Escape newlines: \n
  • Escape carriage returns: \r
  • Escape tabs: \t
  • Non-printable bytes: \xNN (hex notation)

2. Encoding Mapping

Map FoundString encoding types to YARA string modifiers:

  • Encoding::Ascii / Encoding::Utf8ascii (default)
  • Encoding::Utf16Le / Encoding::Utf16Bewide modifier
  • Consider adding ascii wide for broader matching when appropriate

3. String Truncation

Apply truncation rules for excessively long strings:

  • Maximum string length: 256 bytes (configurable)
  • Truncate with indicator comment (e.g., // truncated from N bytes)
  • Consider hex string format for binary data or very long strings

4. Semantic Tag Integration

Leverage semantic tags to enhance rule conditions:

  • Group strings by tag type in rule conditions
  • Add metadata comments indicating tag classifications
  • Generate compound conditions (e.g., any of ($url*))

5. Rule Structure

Generate complete, valid YARA rules:

rule binary_name_strings {
    meta:
        description = "Extracted strings from binary_name"
        format = "ELF/PE/MachO"
        generated_by = "StringyMcStringFace"
    
    strings:
        $s1 = "extracted_string" ascii
        $s2 = "wide_string" wide
        $url1 = "https://example.com" nocase
    
    condition:
        any of them
}

Proposed Implementation

File: src/output/yara.rs

pub struct YaraFormatter {
    max_string_length: usize,
    include_metadata: bool,
    rule_name: String,
}

impl YaraFormatter {
    pub fn format_rule(&self, strings: &[FoundString], binary_info: &ContainerInfo) -> String;
    fn escape_string(&self, s: &str) -> String;
    fn get_string_modifiers(&self, string: &FoundString) -> Vec<&str>;
    fn truncate_if_needed(&self, s: &str) -> (String, bool);
    fn generate_condition(&self, strings: &[FoundString]) -> String;
}

Module Registration: src/output/mod.rs

pub mod yara;

pub enum OutputFormat {
    Json,
    Yara,
    // ... other formats
}

pub trait Formatter {
    fn format(&self, strings: &[FoundString], info: &ContainerInfo) -> String;
}

Example Output

Given extracted strings from a binary, the formatter should produce:

rule suspicious_binary_strings {
    meta:
        description = "Strings extracted from suspicious.exe"
        format = "PE"
        generated_by = "StringyMcStringFace v0.1"
        extracted_count = 47
    
    strings:
        // URLs and network indicators
        $url1 = "http://malicious.example.com/payload" nocase
        $ip1 = "192.168.1.100" ascii
        
        // File paths
        $path1 = "C:\\Windows\\System32\\evil.dll" nocase
        
        // Wide strings (UTF-16)
        $wide1 = "WideString" wide
        
        // Import/Export names
        $imp1 = "CreateRemoteThread" ascii
        $imp2 = "VirtualAllocEx" ascii
    
    condition:
        any of ($url*) or 
        2 of ($imp*) or
        any of ($path*)
}

Test Scenarios

Unit tests should cover:

  1. String Escaping

    • Quotes, backslashes, newlines in strings
    • Non-ASCII characters (hex escape)
    • Already-escaped content (no double-escaping)
  2. Encoding Handling

    • ASCII strings → ascii modifier
    • UTF-16LE/BE → wide modifier
    • Mixed encoding in single rule
  3. Truncation

    • Strings under limit → no truncation
    • Strings over limit → truncated with comment
    • Extreme cases (empty strings, very long strings)
  4. Rule Generation

    • Valid YARA syntax (parseable by YARA)
    • Proper section formatting (meta, strings, condition)
    • Special characters in rule names
  5. Semantic Tags

    • URLs grouped and commented
    • Network indicators (IPs, domains)
    • Import/Export grouping

Dependencies

Acceptance Criteria

  • src/output/yara.rs implements YaraFormatter
  • All special characters properly escaped per YARA specification
  • Encoding types correctly mapped to YARA modifiers
  • Long strings truncated at configurable threshold
  • Generated rules pass YARA syntax validation
  • Comprehensive unit tests with >90% coverage
  • Integration test with sample binaries (ELF, PE, Mach-O)
  • Documentation with usage examples

References

Related Issues

  • Output Formatting Framework (dependency)
  • Requirement 6.3 implementation

Task ID: stringy-analyzer/yara-friendly-output

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions