-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
StringyMcStringFace is a binary string extraction and analysis tool that extracts meaningful strings from executable files (PE, ELF, Mach-O) with semantic classification, tagging, and scoring. The tool requires multiple output formats to serve different use cases: human-readable for interactive analysis, JSONL for automation and data pipelines, and YARA for security rule generation.
This issue focuses on implementing the JSONL (JSON Lines) output formatter, which provides machine-readable structured output where each line is a complete JSON object representing a FoundString. This format is ideal for:
- Pipeline integration and streaming processing
- Database ingestion and batch analysis
- Automated security tooling and SIEM integration
- Post-processing with
jq, Python, or other tools
Problem Statement
The tool currently has a placeholder src/output/mod.rs but no concrete formatter implementations. This issue implements the JSONL formatter as part of the output formatting system defined in issue #25.
Data Structure
The JSONL formatter serializes FoundString instances with all fields:
pub struct FoundString {
pub text: String, // The extracted string
pub encoding: Encoding, // Ascii, Utf8, Utf16Le, Utf16Be
pub offset: u64, // File offset
pub rva: Option<u64>, // Relative Virtual Address (if available)
pub section: Option<String>, // Section name (.text, .rdata, etc.)
pub length: u32, // Length in bytes
pub tags: Vec<Tag>, // Semantic tags (Url, FilePath, Guid, etc.)
pub score: i32, // Relevance score for ranking
pub source: StringSource, // SectionData, ImportName, ExportName, etc.
}Requirements
- Requirement 6.1: Implement output formatting framework with trait-based architecture
- Requirement 6.4: Support machine-readable JSON Lines format for automation
Proposed Solution
1. File Creation
Create src/output/json.rs implementing the JSONL formatter.
2. Implementation Approach
use crate::types::FoundString;
use serde_json;
use std::io::Write;
pub struct JsonFormatter;
impl JsonFormatter {
pub fn new() -> Self {
Self
}
/// Format strings as JSON Lines (one JSON object per line)
pub fn format(&self, strings: &[FoundString]) -> crate::Result<String> {
let mut output = String::new();
for found_string in strings {
let json = serde_json::to_string(found_string)?;
output.push_str(&json);
output.push('\n');
}
Ok(output)
}
/// Format a single string for streaming output
pub fn format_one(&self, found_string: &FoundString) -> crate::Result<String> {
let json = serde_json::to_string(found_string)?;
Ok(format!("{}\n", json))
}
}3. Key Design Decisions
JSON Lines Format
- One object per line: Each line is a complete, valid JSON object
- No pretty-printing: Compact format for efficient parsing and storage
- UTF-8 encoding: Standard for JSON
- No array wrapper: Unlike standard JSON arrays, JSONL has no
[]wrapper
Field Serialization
- All fields included: Complete
FoundStringdata in every record - Null handling:
Option<T>fields serialize asnullwhen absent - Enum serialization: Leverage existing
serdederives onEncoding,Tag,StringSource - String escaping:
serde_jsonhandles special characters automatically
Example Output
{"text":"kernel32.dll","encoding":"Ascii","offset":4096,"rva":8192,"section":".idata","length":12,"tags":["ImportName"],"score":95,"source":"ImportName"}
{"text":"https://api.example.com/v1","encoding":"Utf8","offset":16384,"rva":20480,"section":".rdata","length":26,"tags":["Url","Domain"],"score":88,"source":"SectionData"}
{"text":"C:\\\\Windows\\\\System32\\\\config","encoding":"Utf16Le","offset":32768,"rva":null,"section":null,"length":56,"tags":["FilePath"],"score":72,"source":"SectionData"}4. Integration with Framework
Once issue #25 (Output Formatter Framework) is complete, this implementation should:
- Implement the
Formattertrait defined insrc/output/mod.rs - Integrate with
OutputConfigfor filtering options - Support streaming output for large result sets
- Be selectable via CLI
--jsonor--format jsonflags
5. Error Handling
use thiserror::Error;
#[derive(Error, Debug)]
pub enum JsonFormatterError {
#[error("Failed to serialize string: {0}")]
SerializationError(#[from] serde_json::Error),
#[error("I/O error: {0}")]
IoError(#[from] std::io::Error),
}6. Testing Requirements
Unit Tests
#[cfg(test)]
mod tests {
use super::*;
use crate::types::{Encoding, Tag, StringSource};
#[test]
fn test_basic_jsonl_output() {
let strings = vec![
FoundString {
text: "test".to_string(),
encoding: Encoding::Ascii,
offset: 0,
rva: Some(4096),
section: Some(".text".to_string()),
length: 4,
tags: vec![],
score: 50,
source: StringSource::SectionData,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
// Should have one line with newline
assert_eq!(output.lines().count(), 1);
// Should be valid JSON
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert_eq!(parsed["text"], "test");
assert_eq!(parsed["offset"], 0);
}
#[test]
fn test_special_characters_escaping() {
// Test strings with quotes, backslashes, newlines
let strings = vec![
FoundString {
text: "path\\to\\file\"quoted\"".to_string(),
encoding: Encoding::Ascii,
offset: 100,
rva: None,
section: None,
length: 20,
tags: vec![Tag::FilePath],
score: 60,
source: StringSource::SectionData,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
// Should be valid JSON despite special characters
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert!(parsed["text"].as_str().unwrap().contains("file"));
}
#[test]
fn test_null_optional_fields() {
let strings = vec![
FoundString {
text: "test".to_string(),
encoding: Encoding::Utf8,
offset: 0,
rva: None, // Optional field
section: None, // Optional field
length: 4,
tags: vec![],
score: 50,
source: StringSource::SectionData,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert!(parsed["rva"].is_null());
assert!(parsed["section"].is_null());
}
#[test]
fn test_multiple_strings() {
let strings = vec![
FoundString {
text: "first".to_string(),
encoding: Encoding::Ascii,
offset: 0,
rva: Some(100),
section: Some(".text".to_string()),
length: 5,
tags: vec![],
score: 50,
source: StringSource::SectionData,
},
FoundString {
text: "second".to_string(),
encoding: Encoding::Utf8,
offset: 100,
rva: Some(200),
section: Some(".data".to_string()),
length: 6,
tags: vec![Tag::Url],
score: 75,
source: StringSource::ImportName,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
// Should have two lines
assert_eq!(output.lines().count(), 2);
// Each line should be valid JSON
for line in output.lines() {
serde_json::from_str::<serde_json::Value>(line).unwrap();
}
}
#[test]
fn test_empty_collection() {
let strings: Vec<FoundString> = vec![];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
assert_eq!(output, "");
}
#[test]
fn test_utf16_encoding() {
let strings = vec![
FoundString {
text: "wide string".to_string(),
encoding: Encoding::Utf16Le,
offset: 1000,
rva: Some(2000),
section: Some(".rdata".to_string()),
length: 24, // 2 bytes per char
tags: vec![],
score: 65,
source: StringSource::ResourceString,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert_eq!(parsed["encoding"], "Utf16Le");
assert_eq!(parsed["length"], 24);
}
}Integration Tests
- Test with real
FoundStringcollections from binary analysis - Verify output can be parsed by
jqand other JSON tools - Test large collections (10k+ strings) for performance
- Verify streaming output for memory efficiency
7. Documentation Requirements
- Inline documentation for public functions
- Examples in doc comments showing usage
- Reference to JSON Lines specification: https://jsonlines.org/
- CLI usage examples in
docs/src/output-formats.md
Acceptance Criteria
-
src/output/json.rscreated withJsonFormatterstruct -
format()method serializesVec<FoundString>to JSONL format -
format_one()method for streaming single strings - All
FoundStringfields serialized correctly - Special characters properly escaped by
serde_json - Optional fields (
rva,section) serialize asnullwhen absent - Unit tests for:
- Basic output formatting
- Special character escaping
- Null optional fields
- Multiple strings
- Empty collections
- All encoding types (Ascii, Utf8, Utf16Le, Utf16Be)
- All source types
- Test coverage ≥85% for json.rs module
- Documentation with examples
- Integration with
Formattertrait (once Implement Output Formatter Framework with Trait-Based Architecture #25 is complete) - Verify output parseable by
jqand standard JSON parsers
Edge Cases to Handle
- Very long strings: No truncation in JSONL (unlike human format)
- Binary/invalid UTF-8: Already handled by
Stringtype inFoundString - Empty string text: Should serialize as
{"text":"", ...} - Zero-length collections: Should produce empty output (no lines)
- Unicode characters:
serde_jsonhandles UTF-8 automatically - Control characters:
serde_jsonescapes appropriately (\n,\r,\t)
Dependencies
- Blocked by: Issue Implement Output Formatter Framework with Trait-Based Architecture #25 (Output Formatter Framework) - Trait definition required
- Requires:
serde_json(already inCargo.toml) - Uses:
serdederives onFoundString,Encoding,Tag,StringSource(already implemented)
References
- Architecture:
docs/src/architecture.md(lines 289-294) - Output Format Spec:
docs/src/output-formats.md - Data Structures:
src/types.rs(line 144:FoundStringdefinition) - JSON Lines Spec: https://jsonlines.org/
- Issue Implement Output Formatter Framework with Trait-Based Architecture #25: Output Formatter Framework (blocker)
- Task-ID:
stringy-analyzer/jsonl-output-format
Estimated Effort
Low-Medium complexity - Straightforward serialization using existing serde infrastructure. Primary work is comprehensive testing and edge case handling. Estimated 4-6 hours of development time.
Example CLI Usage (Post-Integration)
# Basic JSONL output
stringy --json malware.exe
# Save to file
stringy --json binary.elf > strings.jsonl
# Pipeline with jq
stringy --json app.exe | jq '.[] | select(.score > 80)'
# Filter URLs
stringy --json binary | jq 'select(.tags | contains(["Url"]))'
# Count by section
stringy --json binary | jq -r '.section' | sort | uniq -c