Skip to content

Implement Encoding Confidence Scoring and Integrate Section Weights into Ranking System #22

@unclesp1d3r

Description

@unclesp1d3r

Summary

Implement encoding confidence scoring and integrate existing section weights into the string ranking system to support the overall string relevance scoring algorithm.

Current State

Already Implemented:

  • Section weight calculation based on SectionType (PE, ELF, Mach-O)
  • Section classification logic
  • Unit tests for section weight calculation
  • SectionInfo.weight field populated during binary parsing

Not Yet Implemented:

  • Encoding confidence scoring algorithm
  • String relevance scoring integration
  • Population of FoundString.score field

Requirements

From the architecture design (concept.md), the ranking algorithm is:

Score = SectionWeight + EncodingConfidence + SemanticBoost - NoisePenalty

This issue addresses the first two components: SectionWeight (integration) and EncodingConfidence (implementation).

5.1: Encoding Confidence Scoring

Implement confidence scoring for each detected encoding type based on:

  1. Character validity: Percentage of valid characters in detected encoding
  2. Printability: Ratio of printable to non-printable characters
  3. Null termination: Proper null termination for C-style strings
  4. Byte alignment: Correct alignment for multi-byte encodings (UTF-16)
  5. Control character ratio: Lower scores for high control character density

Proposed Confidence Scale: 0.0 to 10.0

  • 9.0-10.0: High confidence (clean ASCII/UTF-8, properly aligned UTF-16)
  • 7.0-8.9: Medium confidence (some irregularities but valid)
  • 5.0-6.9: Low confidence (marginal validity, needs context)
  • < 5.0: Very low confidence (likely noise)

5.5: Section Weight Integration

Integrate existing section weights into the string scoring system:

  1. Add section weight to FoundString scoring calculation
  2. Apply section weight as a multiplier or additive component
  3. Ensure strings from high-weight sections (e.g., .rodata, __cstring) rank higher

Proposed Implementation

1. Encoding Confidence Module (src/scoring/encoding_confidence.rs)

pub fn calculate_encoding_confidence(
    text: &str,
    encoding: Encoding,
    raw_bytes: &[u8],
) -> f32 {
    match encoding {
        Encoding::Ascii => score_ascii_confidence(text, raw_bytes),
        Encoding::Utf8 => score_utf8_confidence(text, raw_bytes),
        Encoding::Utf16Le | Encoding::Utf16Be => score_utf16_confidence(text, raw_bytes, encoding),
    }
}

fn score_ascii_confidence(text: &str, raw_bytes: &[u8]) -> f32 {
    let printable_count = text.chars().filter(|c| c.is_ascii_graphic() || c.is_ascii_whitespace()).count();
    let total_chars = text.chars().count();
    let printable_ratio = printable_count as f32 / total_chars.max(1) as f32;
    
    // Null-terminated strings get a bonus
    let null_terminated = raw_bytes.last() == Some(&0);
    let base_score = printable_ratio * 10.0;
    
    if null_terminated { base_score.min(10.0) } else { (base_score * 0.95).min(9.5) }
}

2. String Scoring Integration (src/scoring/mod.rs)

pub fn calculate_string_score(
    string: &FoundString,
    section_weight: f32,
    encoding_confidence: f32,
) -> i32 {
    let base_score = section_weight + encoding_confidence;
    // Round to integer for storage
    base_score.round() as i32
}

3. Update String Extraction

Modify string extraction logic to:

  1. Calculate encoding confidence when creating FoundString
  2. Look up section weight from ContainerInfo
  3. Compute and assign final score

Testing Requirements

Unit Tests

  • test_ascii_confidence_scoring: Various ASCII string qualities
  • test_utf8_confidence_scoring: Valid/invalid UTF-8 sequences
  • test_utf16_confidence_scoring: Aligned/misaligned UTF-16 data
  • test_section_weight_integration: Score calculation with different section weights

Integration Tests

  • Extract strings from real binaries and verify scoring
  • High-weight sections should produce higher-scored strings
  • Clean encodings should score higher than noisy ones

Acceptance Criteria

  • Encoding confidence scoring implemented for all encoding types
  • Section weight integrated into string scoring
  • FoundString.score field populated during extraction
  • Unit tests achieve >90% coverage for scoring logic
  • Integration tests verify end-to-end scoring behavior
  • Documentation updated with scoring algorithm details

Dependencies

  • Blocked by: Ranking System Foundation
  • Blocks: Semantic Tagging (#TBD), Noise Penalty Calculation (#TBD)

References

  • Architecture: concept.md - Ranking Algorithm section
  • Related code: src/container/{pe,elf,macho}.rs - Section weight calculation
  • Requirements: 5.1 (Encoding Confidence), 5.5 (Section Weight Integration)

Task-ID: stringy-analyzer/section-weight-scoring

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions