Skip to content

Implement CLI Filtering Arguments for String Extraction Control #30

@unclesp1d3r

Description

@unclesp1d3r

Overview

This issue implements command-line filtering arguments that give users fine-grained control over string extraction from binary files. These filters are essential for reducing noise and focusing analysis on relevant string types, making Stringy more practical for real-world reverse engineering and malware analysis workflows.

Context

Stringy extracts and categorizes strings from binaries using semantic tags (URLs, file paths, GUIDs, format strings, etc.) and supports multiple encodings (ASCII, UTF-8, UTF-16LE/BE). However, without filtering capabilities, users receive all extracted strings regardless of their analysis goals. For example:

  • Malware analysts often need only URLs and network indicators
  • Reverse engineers may focus on format strings and error messages
  • Performance: Filtering reduces output size and processing overhead
  • Pipelines: Filtered JSON output enables targeted downstream processing

Requirements

This implements requirements 7.1, 7.2, 7.3, and 7.4 from the specification.

Proposed Solution

CLI Arguments

Implement the following filtering arguments using the clap crate:

  1. --min-len <LENGTH>

    • Filter strings shorter than the specified length
    • Type: usize
    • Default: 4 (consistent with standard strings command)
    • Example: stringy --min-len 8 binary.exe
  2. --enc <ENCODINGS>

    • Comma-separated list of encodings to include
    • Type: Vec<String>
    • Valid values: ascii, utf8, utf16le, utf16be
    • Default: all encodings
    • Example: stringy --enc ascii,utf8 binary.exe
  3. --only-tags <TAGS>

    • Comma-separated list of semantic tags to include (whitelist)
    • Type: Vec<String>
    • Valid values: url, domain, ip, filepath, registry, guid, useragent, fmt, base64, crypto
    • Mutually exclusive with --notags
    • Example: stringy --only-tags url,domain,ip binary.exe
  4. --notags

    • Exclude strings without any semantic tags (show only classified strings)
    • Type: bool flag
    • Mutually exclusive with --only-tags
    • Example: stringy --notags binary.exe

Implementation Approach

  1. Argument Parsing (src/cli.rs or main CLI module)

    • Define filtering arguments in the clap Args struct
    • Implement validation for encoding names and tag names
    • Add mutual exclusivity check for --only-tags and --notags
  2. Filter Configuration (src/types.rs or new src/filter.rs)

    • Create a FilterConfig struct to hold parsed filter settings
    • Implement FilterConfig::from_args() to convert CLI args
    • Add validation methods for encoding and tag names
  3. Filtering Logic (in string extraction pipeline)

    • Apply min-len filter during or after extraction
    • Apply encoding filter by skipping unwanted encoding attempts
    • Apply tag filters when assembling final output
    • Ensure filtering preserves ranking/scoring when applicable
  4. Error Handling

    • Invalid encoding names → clear error message with valid options
    • Invalid tag names → clear error message with valid options
    • Conflicting --only-tags and --notags → clap conflict group

Example Usage

# Extract only URLs and domains with minimum length 10
stringy --only-tags url,domain --min-len 10 malware.exe

# Extract only UTF-16 strings (common in PE files)
stringy --enc utf16le malware.exe

# Extract only tagged/classified strings
stringy --notags suspicious.elf

# Combine filters for precision
stringy --min-len 8 --enc ascii,utf8 --only-tags filepath,registry binary.dll

Acceptance Criteria

  • --min-len argument filters strings by minimum length
  • --enc argument accepts comma-separated encoding list and filters accordingly
  • --only-tags argument accepts comma-separated tag list and shows only matching strings
  • --notags flag excludes untagged strings
  • --only-tags and --notags are mutually exclusive (enforced by clap)
  • Invalid encoding names produce helpful error messages
  • Invalid tag names produce helpful error messages
  • Unit tests cover:
    • Argument parsing for all flags
    • Filter configuration validation
    • Filtering logic for each argument type
    • Mutual exclusivity enforcement
    • Edge cases (empty results, no matches, etc.)
  • Documentation updated with filtering examples
  • Filtering works correctly in both human-readable and JSON output modes

Dependencies

  • Blocked by: Basic CLI Structure (must be completed first)
  • Depends on: String extraction and tagging system (framework exists)

Testing Strategy

  1. Unit Tests (tests/cli_filtering.rs):

    • Parse valid and invalid argument combinations
    • Validate filter configuration creation
    • Test filter application logic in isolation
  2. Integration Tests:

    • Run stringy with various filter combinations on test binaries
    • Verify output contains only expected strings
    • Test edge cases (no matches, all filtered out, etc.)

Task ID

stringy-analyzer/filtering-cli-arguments

Related Issues

  • Basic CLI Structure (prerequisite)
  • String Extraction Engine (#TBD)
  • Semantic Tagging System (#TBD)

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions