Skip to content

Implement JSONL (JSON Lines) Output Formatter for Structured String Data #26

@unclesp1d3r

Description

@unclesp1d3r

Context

StringyMcStringFace is a binary string extraction and analysis tool that extracts meaningful strings from executable files (PE, ELF, Mach-O) with semantic classification, tagging, and scoring. The tool requires multiple output formats to serve different use cases: human-readable for interactive analysis, JSONL for automation and data pipelines, and YARA for security rule generation.

This issue focuses on implementing the JSONL (JSON Lines) output formatter, which provides machine-readable structured output where each line is a complete JSON object representing a FoundString. This format is ideal for:

  • Pipeline integration and streaming processing
  • Database ingestion and batch analysis
  • Automated security tooling and SIEM integration
  • Post-processing with jq, Python, or other tools

Problem Statement

The tool currently has a placeholder src/output/mod.rs but no concrete formatter implementations. This issue implements the JSONL formatter as part of the output formatting system defined in issue #25.

Data Structure

The JSONL formatter serializes FoundString instances with all fields:

pub struct FoundString {
    pub text: String,              // The extracted string
    pub encoding: Encoding,        // Ascii, Utf8, Utf16Le, Utf16Be
    pub offset: u64,               // File offset
    pub rva: Option<u64>,          // Relative Virtual Address (if available)
    pub section: Option<String>,   // Section name (.text, .rdata, etc.)
    pub length: u32,               // Length in bytes
    pub tags: Vec<Tag>,            // Semantic tags (Url, FilePath, Guid, etc.)
    pub score: i32,                // Relevance score for ranking
    pub source: StringSource,      // SectionData, ImportName, ExportName, etc.
}

Requirements

  • Requirement 6.1: Implement output formatting framework with trait-based architecture
  • Requirement 6.4: Support machine-readable JSON Lines format for automation

Proposed Solution

1. File Creation

Create src/output/json.rs implementing the JSONL formatter.

2. Implementation Approach

use crate::types::FoundString;
use serde_json;
use std::io::Write;

pub struct JsonFormatter;

impl JsonFormatter {
    pub fn new() -> Self {
        Self
    }
    
    /// Format strings as JSON Lines (one JSON object per line)
    pub fn format(&self, strings: &[FoundString]) -> crate::Result<String> {
        let mut output = String::new();
        for found_string in strings {
            let json = serde_json::to_string(found_string)?;
            output.push_str(&json);
            output.push('\n');
        }
        Ok(output)
    }
    
    /// Format a single string for streaming output
    pub fn format_one(&self, found_string: &FoundString) -> crate::Result<String> {
        let json = serde_json::to_string(found_string)?;
        Ok(format!("{}\n", json))
    }
}

3. Key Design Decisions

JSON Lines Format

  • One object per line: Each line is a complete, valid JSON object
  • No pretty-printing: Compact format for efficient parsing and storage
  • UTF-8 encoding: Standard for JSON
  • No array wrapper: Unlike standard JSON arrays, JSONL has no [] wrapper

Field Serialization

  • All fields included: Complete FoundString data in every record
  • Null handling: Option<T> fields serialize as null when absent
  • Enum serialization: Leverage existing serde derives on Encoding, Tag, StringSource
  • String escaping: serde_json handles special characters automatically

Example Output

{"text":"kernel32.dll","encoding":"Ascii","offset":4096,"rva":8192,"section":".idata","length":12,"tags":["ImportName"],"score":95,"source":"ImportName"}
{"text":"https://api.example.com/v1","encoding":"Utf8","offset":16384,"rva":20480,"section":".rdata","length":26,"tags":["Url","Domain"],"score":88,"source":"SectionData"}
{"text":"C:\\\\Windows\\\\System32\\\\config","encoding":"Utf16Le","offset":32768,"rva":null,"section":null,"length":56,"tags":["FilePath"],"score":72,"source":"SectionData"}

4. Integration with Framework

Once issue #25 (Output Formatter Framework) is complete, this implementation should:

  1. Implement the Formatter trait defined in src/output/mod.rs
  2. Integrate with OutputConfig for filtering options
  3. Support streaming output for large result sets
  4. Be selectable via CLI --json or --format json flags

5. Error Handling

use thiserror::Error;

#[derive(Error, Debug)]
pub enum JsonFormatterError {
    #[error("Failed to serialize string: {0}")]
    SerializationError(#[from] serde_json::Error),
    
    #[error("I/O error: {0}")]
    IoError(#[from] std::io::Error),
}

6. Testing Requirements

Unit Tests

#[cfg(test)]
mod tests {
    use super::*;
    use crate::types::{Encoding, Tag, StringSource};

    #[test]
    fn test_basic_jsonl_output() {
        let strings = vec![
            FoundString {
                text: "test".to_string(),
                encoding: Encoding::Ascii,
                offset: 0,
                rva: Some(4096),
                section: Some(".text".to_string()),
                length: 4,
                tags: vec![],
                score: 50,
                source: StringSource::SectionData,
            },
        ];
        
        let formatter = JsonFormatter::new();
        let output = formatter.format(&strings).unwrap();
        
        // Should have one line with newline
        assert_eq!(output.lines().count(), 1);
        
        // Should be valid JSON
        let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
        assert_eq!(parsed["text"], "test");
        assert_eq!(parsed["offset"], 0);
    }

    #[test]
    fn test_special_characters_escaping() {
        // Test strings with quotes, backslashes, newlines
        let strings = vec![
            FoundString {
                text: "path\\to\\file\"quoted\"".to_string(),
                encoding: Encoding::Ascii,
                offset: 100,
                rva: None,
                section: None,
                length: 20,
                tags: vec![Tag::FilePath],
                score: 60,
                source: StringSource::SectionData,
            },
        ];
        
        let formatter = JsonFormatter::new();
        let output = formatter.format(&strings).unwrap();
        
        // Should be valid JSON despite special characters
        let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
        assert!(parsed["text"].as_str().unwrap().contains("file"));
    }

    #[test]
    fn test_null_optional_fields() {
        let strings = vec![
            FoundString {
                text: "test".to_string(),
                encoding: Encoding::Utf8,
                offset: 0,
                rva: None,  // Optional field
                section: None,  // Optional field
                length: 4,
                tags: vec![],
                score: 50,
                source: StringSource::SectionData,
            },
        ];
        
        let formatter = JsonFormatter::new();
        let output = formatter.format(&strings).unwrap();
        let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
        
        assert!(parsed["rva"].is_null());
        assert!(parsed["section"].is_null());
    }

    #[test]
    fn test_multiple_strings() {
        let strings = vec![
            FoundString {
                text: "first".to_string(),
                encoding: Encoding::Ascii,
                offset: 0,
                rva: Some(100),
                section: Some(".text".to_string()),
                length: 5,
                tags: vec![],
                score: 50,
                source: StringSource::SectionData,
            },
            FoundString {
                text: "second".to_string(),
                encoding: Encoding::Utf8,
                offset: 100,
                rva: Some(200),
                section: Some(".data".to_string()),
                length: 6,
                tags: vec![Tag::Url],
                score: 75,
                source: StringSource::ImportName,
            },
        ];
        
        let formatter = JsonFormatter::new();
        let output = formatter.format(&strings).unwrap();
        
        // Should have two lines
        assert_eq!(output.lines().count(), 2);
        
        // Each line should be valid JSON
        for line in output.lines() {
            serde_json::from_str::<serde_json::Value>(line).unwrap();
        }
    }

    #[test]
    fn test_empty_collection() {
        let strings: Vec<FoundString> = vec![];
        let formatter = JsonFormatter::new();
        let output = formatter.format(&strings).unwrap();
        
        assert_eq!(output, "");
    }

    #[test]
    fn test_utf16_encoding() {
        let strings = vec![
            FoundString {
                text: "wide string".to_string(),
                encoding: Encoding::Utf16Le,
                offset: 1000,
                rva: Some(2000),
                section: Some(".rdata".to_string()),
                length: 24,  // 2 bytes per char
                tags: vec![],
                score: 65,
                source: StringSource::ResourceString,
            },
        ];
        
        let formatter = JsonFormatter::new();
        let output = formatter.format(&strings).unwrap();
        let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
        
        assert_eq!(parsed["encoding"], "Utf16Le");
        assert_eq!(parsed["length"], 24);
    }
}

Integration Tests

  • Test with real FoundString collections from binary analysis
  • Verify output can be parsed by jq and other JSON tools
  • Test large collections (10k+ strings) for performance
  • Verify streaming output for memory efficiency

7. Documentation Requirements

  • Inline documentation for public functions
  • Examples in doc comments showing usage
  • Reference to JSON Lines specification: https://jsonlines.org/
  • CLI usage examples in docs/src/output-formats.md

Acceptance Criteria

  • src/output/json.rs created with JsonFormatter struct
  • format() method serializes Vec<FoundString> to JSONL format
  • format_one() method for streaming single strings
  • All FoundString fields serialized correctly
  • Special characters properly escaped by serde_json
  • Optional fields (rva, section) serialize as null when absent
  • Unit tests for:
    • Basic output formatting
    • Special character escaping
    • Null optional fields
    • Multiple strings
    • Empty collections
    • All encoding types (Ascii, Utf8, Utf16Le, Utf16Be)
    • All source types
  • Test coverage ≥85% for json.rs module
  • Documentation with examples
  • Integration with Formatter trait (once Implement Output Formatter Framework with Trait-Based Architecture #25 is complete)
  • Verify output parseable by jq and standard JSON parsers

Edge Cases to Handle

  1. Very long strings: No truncation in JSONL (unlike human format)
  2. Binary/invalid UTF-8: Already handled by String type in FoundString
  3. Empty string text: Should serialize as {"text":"", ...}
  4. Zero-length collections: Should produce empty output (no lines)
  5. Unicode characters: serde_json handles UTF-8 automatically
  6. Control characters: serde_json escapes appropriately (\n, \r, \t)

Dependencies

References

Estimated Effort

Low-Medium complexity - Straightforward serialization using existing serde infrastructure. Primary work is comprehensive testing and edge case handling. Estimated 4-6 hours of development time.

Example CLI Usage (Post-Integration)

# Basic JSONL output
stringy --json malware.exe

# Save to file
stringy --json binary.elf > strings.jsonl

# Pipeline with jq
stringy --json app.exe | jq '.[] | select(.score > 80)'

# Filter URLs
stringy --json binary | jq 'select(.tags | contains(["Url"]))'

# Count by section
stringy --json binary | jq -r '.section' | sort | uniq -c

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions