Skip to content

Implement String Ranking Engine with Configurable Scoring System #21

@unclesp1d3r

Description

@unclesp1d3r

Summary

Create a flexible RankingEngine that assigns importance scores to extracted strings based on their semantic tags, source location, and section characteristics. This enables prioritization of potentially interesting strings in binary analysis.

Background

The string analyzer extracts and classifies strings from binaries with semantic tags (URLs, IPs, file paths, etc.), section types (code, data, resources), and source locations (imports, exports, section data). However, not all strings are equally interesting for analysis. A ranking system is needed to:

  • Prioritize high-value strings (e.g., network indicators, file paths, registry keys)
  • Deprioritize noise (e.g., common debug strings, version info)
  • Weight by context (strings from executable sections vs. debug sections)
  • Enable customizable scoring for different analysis scenarios (malware analysis, reverse engineering, compliance scanning)

Proposed Solution

Architecture

Create src/classification/ranking.rs with the following components:

  1. RankingEngine struct: Main scoring engine with configurable weights
  2. ScoreConfig struct: Configuration for tag weights, source weights, and section type multipliers
  3. StringScore struct: Returned score with breakdown for transparency
  4. Default scoring profiles: Presets for common use cases (malware analysis, general strings, etc.)

Scoring Algorithm

final_score = (tag_weight + source_weight) × section_type_multiplier

Tag Weights (base importance):

  • High value (8-10): URLs, Domains, IPv4/IPv6, Email, Registry paths
  • Medium value (5-7): File paths, GUIDs, Base64 (potential encoding)
  • Lower value (2-4): Format strings, User agents
  • Contextual (variable): Imports/Exports (depends on name), Version strings

Source Weights:

  • ImportName/ExportName: +3 (API calls are interesting)
  • SectionData: +2 (hardcoded strings)
  • ResourceString: +1 (UI strings, less critical)
  • DebugInfo: -2 (usually noise)

Section Type Multipliers:

  • Code sections: ×1.5 (strings in executable code are unusual)
  • StringData/ReadOnlyData: ×1.0 (expected location)
  • WritableData: ×1.2 (potentially modified at runtime)
  • Resources: ×0.8 (often benign UI strings)
  • Debug: ×0.3 (low priority noise)

Implementation Details

pub struct RankingEngine {
    config: ScoreConfig,
}

pub struct ScoreConfig {
    tag_weights: HashMap<Tag, f32>,
    source_weights: HashMap<StringSource, f32>,
    section_multipliers: HashMap<SectionType, f32>,
}

pub struct StringScore {
    pub total: f32,
    pub tag_weight: f32,
    pub source_weight: f32,
    pub section_multiplier: f32,
}

impl RankingEngine {
    pub fn new(config: ScoreConfig) -> Self;
    pub fn with_defaults() -> Self;
    pub fn score(&self, tag: &Tag, source: StringSource, section: SectionType) -> StringScore;
}

Acceptance Criteria

  • RankingEngine struct created with configurable scoring
  • ScoreConfig supports custom weights for tags, sources, and sections
  • Default scoring profile implemented with sensible weights
  • score() method returns detailed StringScore with breakdown
  • Unit tests for various scoring combinations
  • Documentation with examples of customization
  • Integration point in classification module (mod.rs)

Technical Notes

  • Use f32 for scores to allow fractional weights
  • Consider using builder pattern for ScoreConfig customization
  • Scores should be normalized (0-100 range recommended)
  • Future enhancement: Machine learning-based weight tuning

Dependencies

  • Requires existing types from src/classification/mod.rs: Tag, StringSource, SectionType
  • No external crate dependencies expected for MVP

References

  • Requirements: 5.1
  • Task-ID: stringy-analyzer/ranking-system-foundation

Sub-issues

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions