-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Background
Stringy is designed to analyze binary files for string extraction, which often involves processing large executables, malware samples, system libraries, and packed binaries that can range from megabytes to gigabytes in size. Loading entire files into memory using traditional std::fs::read() is inefficient and can cause memory pressure, especially when analyzing multiple files or very large binaries.
Problem
Currently, the extraction pipeline is not yet implemented, but when it is, it will need to read binary files efficiently. The container parsing infrastructure already works with byte slices (&[u8]), which is perfect for memory-mapped files. Without memory mapping:
- Large files (>100MB) consume excessive RAM
- Memory allocation overhead impacts performance
- No benefit from OS page caching
- Potential OOM errors on systems with limited RAM
- Slower startup time for large binaries
Proposed Solution
Implement a file reading strategy using the memmap2 crate that:
1. Add Dependencies
Add memmap2 to Cargo.toml:
[dependencies]
memmap2 = "0.9"2. Create File Reader Module
Create src/io.rs or src/file_reader.rs with:
FileReadertrait or struct that abstracts file reading- Memory-mapped reading for files > threshold (e.g., 10MB)
- Direct
std::fs::read()for smaller files (avoids mmap overhead) - Safe handling of memory mapping (readonly, error handling)
3. Integration Points
- Update
src/main.rsto use the new file reader - Ensure
src/container/mod.rsparsers work with memory-mapped data - Handle edge cases (empty files, special files, pipes)
4. Implementation Example
pub struct FileReader {
_mmap: Option<Mmap>,
data: Vec<u8>,
}
impl FileReader {
pub fn open<P: AsRef<Path>>(path: P) -> Result<Self> {
let file = File::open(path)?;
let metadata = file.metadata()?;
if metadata.len() > MMAP_THRESHOLD {
// Use memory mapping for large files
let mmap = unsafe { Mmap::map(&file)? };
Ok(Self {
_mmap: Some(mmap),
data: Vec::new(),
})
} else {
// Read small files directly
let data = std::fs::read(path)?;
Ok(Self {
_mmap: None,
data,
})
}
}
pub fn as_slice(&self) -> &[u8] {
self._mmap.as_ref()
.map(|m| m.as_ref())
.unwrap_or(&self.data)
}
}Benefits
- Performance: Eliminates large memory allocations and copies
- Scalability: Enables analysis of multi-gigabyte files without memory issues
- Efficiency: OS handles paging and caching automatically
- User Experience: Faster startup and lower memory footprint
- Architecture: Clean abstraction that doesn't affect existing parser APIs
Testing Requirements
-
Unit Tests:
- Test small file reading (< threshold)
- Test large file reading (> threshold)
- Test empty files and edge cases
- Verify correct byte content for both paths
-
Integration Tests:
- Test with real binary files (ELF, PE, Mach-O)
- Verify parser compatibility with memory-mapped data
- Test error handling for invalid/missing files
-
Benchmarks:
- Compare performance vs.
std::fs::read()for various file sizes - Measure memory usage with large files
- Use
criterion(already in dev-dependencies)
- Compare performance vs.
Acceptance Criteria
-
memmap2dependency added toCargo.toml - File reading abstraction implemented with size-based strategy
- Unit tests achieve >90% code coverage for file reading module
- Integration tests verify compatibility with all binary formats
- Benchmarks demonstrate performance improvement for files >100MB
- Documentation includes usage examples and threshold rationale
- Error handling covers all edge cases (permissions, special files, etc.)
Related
- Part of milestone v0.1 (Binary Analyzer MVP)
- Foundation for extraction pipeline implementation
- Enables efficient processing of large malware samples and system binaries
Task-ID
stringy-analyzer/memory-mapping-support