-
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
MVPMinimum viable product featuresMinimum viable product featuresarea:analyzerBinary analyzer functionalityBinary analyzer functionalitylang:rustRust implementationRust implementationneeds:testsNeeds test coverageNeeds test coveragepriority:mediumMedium priority taskMedium priority taskstatus:backlogTask in backlogTask in backlogstory-points: 88 story points8 story pointstype:enhancementNew feature or requestNew feature or request
Milestone
Description
Summary
Implement encoding confidence scoring and integrate existing section weights into the string ranking system to support the overall string relevance scoring algorithm.
Current State
✅ Already Implemented:
- Section weight calculation based on
SectionType(PE, ELF, Mach-O) - Section classification logic
- Unit tests for section weight calculation
SectionInfo.weightfield populated during binary parsing
❌ Not Yet Implemented:
- Encoding confidence scoring algorithm
- String relevance scoring integration
- Population of
FoundString.scorefield
Requirements
From the architecture design (concept.md), the ranking algorithm is:
Score = SectionWeight + EncodingConfidence + SemanticBoost - NoisePenalty
This issue addresses the first two components: SectionWeight (integration) and EncodingConfidence (implementation).
5.1: Encoding Confidence Scoring
Implement confidence scoring for each detected encoding type based on:
- Character validity: Percentage of valid characters in detected encoding
- Printability: Ratio of printable to non-printable characters
- Null termination: Proper null termination for C-style strings
- Byte alignment: Correct alignment for multi-byte encodings (UTF-16)
- Control character ratio: Lower scores for high control character density
Proposed Confidence Scale: 0.0 to 10.0
- 9.0-10.0: High confidence (clean ASCII/UTF-8, properly aligned UTF-16)
- 7.0-8.9: Medium confidence (some irregularities but valid)
- 5.0-6.9: Low confidence (marginal validity, needs context)
- < 5.0: Very low confidence (likely noise)
5.5: Section Weight Integration
Integrate existing section weights into the string scoring system:
- Add section weight to
FoundStringscoring calculation - Apply section weight as a multiplier or additive component
- Ensure strings from high-weight sections (e.g.,
.rodata,__cstring) rank higher
Proposed Implementation
1. Encoding Confidence Module (src/scoring/encoding_confidence.rs)
pub fn calculate_encoding_confidence(
text: &str,
encoding: Encoding,
raw_bytes: &[u8],
) -> f32 {
match encoding {
Encoding::Ascii => score_ascii_confidence(text, raw_bytes),
Encoding::Utf8 => score_utf8_confidence(text, raw_bytes),
Encoding::Utf16Le | Encoding::Utf16Be => score_utf16_confidence(text, raw_bytes, encoding),
}
}
fn score_ascii_confidence(text: &str, raw_bytes: &[u8]) -> f32 {
let printable_count = text.chars().filter(|c| c.is_ascii_graphic() || c.is_ascii_whitespace()).count();
let total_chars = text.chars().count();
let printable_ratio = printable_count as f32 / total_chars.max(1) as f32;
// Null-terminated strings get a bonus
let null_terminated = raw_bytes.last() == Some(&0);
let base_score = printable_ratio * 10.0;
if null_terminated { base_score.min(10.0) } else { (base_score * 0.95).min(9.5) }
}2. String Scoring Integration (src/scoring/mod.rs)
pub fn calculate_string_score(
string: &FoundString,
section_weight: f32,
encoding_confidence: f32,
) -> i32 {
let base_score = section_weight + encoding_confidence;
// Round to integer for storage
base_score.round() as i32
}3. Update String Extraction
Modify string extraction logic to:
- Calculate encoding confidence when creating
FoundString - Look up section weight from
ContainerInfo - Compute and assign final score
Testing Requirements
Unit Tests
test_ascii_confidence_scoring: Various ASCII string qualitiestest_utf8_confidence_scoring: Valid/invalid UTF-8 sequencestest_utf16_confidence_scoring: Aligned/misaligned UTF-16 datatest_section_weight_integration: Score calculation with different section weights
Integration Tests
- Extract strings from real binaries and verify scoring
- High-weight sections should produce higher-scored strings
- Clean encodings should score higher than noisy ones
Acceptance Criteria
- Encoding confidence scoring implemented for all encoding types
- Section weight integrated into string scoring
-
FoundString.scorefield populated during extraction - Unit tests achieve >90% coverage for scoring logic
- Integration tests verify end-to-end scoring behavior
- Documentation updated with scoring algorithm details
Dependencies
- Blocked by: Ranking System Foundation
- Blocks: Semantic Tagging (#TBD), Noise Penalty Calculation (#TBD)
References
- Architecture:
concept.md- Ranking Algorithm section - Related code:
src/container/{pe,elf,macho}.rs- Section weight calculation - Requirements: 5.1 (Encoding Confidence), 5.5 (Section Weight Integration)
Task-ID: stringy-analyzer/section-weight-scoring
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
MVPMinimum viable product featuresMinimum viable product featuresarea:analyzerBinary analyzer functionalityBinary analyzer functionalitylang:rustRust implementationRust implementationneeds:testsNeeds test coverageNeeds test coveragepriority:mediumMedium priority taskMedium priority taskstatus:backlogTask in backlogTask in backlogstory-points: 88 story points8 story pointstype:enhancementNew feature or requestNew feature or request