Skip to content

Implement Symbol Classification and Import/Export Tagging #20

@unclesp1d3r

Description

@unclesp1d3r

Summary

Implement the classification layer for import/export symbols and section names, converting parsed binary metadata into tagged, scored FoundString objects for analysis.

Current State

Container Parsers Implemented: All three parsers (ELF, PE, Mach-O) successfully extract import/export metadata:

  • ImportInfo: Symbol names, library names, addresses
  • ExportInfo: Symbol names, addresses, ordinals
  • Section metadata with full classification

🚧 Classification Module Empty: src/classification/mod.rs contains only a comment. Need to implement symbol processing logic.

Requirements

4.2: Import Name Identification and Tagging

  • Convert ImportInfo objects into FoundString with Tag::Import
  • Set source: StringSource::ImportName
  • Apply semantic classification (e.g., crypto APIs, network APIs)
  • Boost relevance scores for security-relevant imports

4.3: Export Name Identification and Tagging

  • Convert ExportInfo objects into FoundString with Tag::Export
  • Set source: StringSource::ExportName
  • Demangle Rust symbols using rustc-demangle
  • Identify entry points and special exports

4.4: Section Name Processing and Classification

  • Treat section names as high-value strings
  • Tag with relevant semantic categories
  • Use section names to provide context for other strings

Proposed Solution

Implementation Structure

Create src/classification/symbols.rs module with:

pub struct SymbolClassifier {
    crypto_apis: HashSet<String>,
    network_apis: HashSet<String>,
    file_apis: HashSet<String>,
}

impl SymbolClassifier {
    pub fn process_imports(imports: &[ImportInfo]) -> Vec<FoundString>;
    pub fn process_exports(exports: &[ExportInfo]) -> Vec<FoundString>;
    pub fn classify_symbol(name: &str) -> Vec<Tag>;
    pub fn demangle_rust_symbol(name: &str) -> Option<String>;
}

Key Features

  1. Import Processing

    • Convert each ImportInfo to FoundString
    • Base score: +20 points (high value)
    • Additional tags based on API name patterns:
      • Crypto: CryptEncrypt, EVP_*, crypto_*
      • Network: socket, connect, WSA*, curl_*
      • File I/O: CreateFile, open, fopen
    • Preserve library information in metadata
  2. Export Processing

    • Convert each ExportInfo to FoundString
    • Attempt Rust symbol demangling
    • Identify special exports:
      • Entry points: main, DllMain, _start
      • Mangled Rust functions
    • Base score: +15 points
  3. Symbol Demangling

    • Use rustc-demangle crate for Rust symbols
    • Preserve both mangled and demangled forms
    • Add context-specific tags (panic, alloc, etc.)
  4. Section Name Processing

    • Extract section names from SectionInfo
    • High relevance score (+10)
    • Use for contextualizing other strings in same section

API Design

// Main entry point for symbol processing
pub fn extract_symbol_strings(container_info: &ContainerInfo) -> Vec<FoundString> {
    let mut strings = Vec::new();
    
    let classifier = SymbolClassifier::new();
    strings.extend(classifier.process_imports(&container_info.imports));
    strings.extend(classifier.process_exports(&container_info.exports));
    strings.extend(extract_section_names(&container_info.sections));
    
    strings
}

Test Requirements

Create tests/classification_symbols.rs with:

  1. Import Classification Tests

    • Test crypto API detection (CryptEncrypt, AES_encrypt)
    • Test network API detection (socket, WSAStartup)
    • Verify Tag::Import applied correctly
    • Check library attribution preserved
  2. Export Classification Tests

    • Test basic export tagging
    • Test entry point detection (main, DllMain)
    • Verify Tag::Export applied
  3. Rust Demangling Tests

    • Test successful demangling of Rust symbols
    • Verify both forms preserved
    • Test context tag detection (panic, main, etc.)
    • Handle non-Rust symbols gracefully
  4. Section Name Tests

    • Verify section names extracted as strings
    • Check appropriate scoring applied
    • Test all three formats (ELF, PE, Mach-O)

Success Criteria

  • SymbolClassifier implemented in src/classification/symbols.rs
  • All imports converted to FoundString with Tag::Import
  • All exports converted to FoundString with Tag::Export
  • Rust symbol demangling working with rustc-demangle
  • Semantic API classification (crypto, network, file I/O)
  • Section names processed as high-value strings
  • Unit tests achieving >90% coverage
  • Integration tests for all three formats
  • Documentation with examples

Dependencies

  • ✅ Symbol Processing (container parsers extract symbols)
  • 📦 rustc-demangle crate (add to Cargo.toml)

Technical Notes

  • Use lazy_static for compiled API pattern sets
  • Preserve all original metadata (addresses, ordinals, library names)
  • Scoring should be additive: base score + semantic boosts
  • Handle edge cases: empty names, duplicate symbols, stripped binaries
  • Consider memory efficiency with large symbol tables

Example Output

{
  "text": "CryptEncrypt",
  "encoding": "Ascii",
  "offset": 0,
  "section": ".idata",
  "tags": ["import", "crypto"],
  "score": 25,
  "source": "ImportName"
}

References

  • Architecture docs: docs/src/architecture.md (Symbol Classification section)
  • Classification docs: docs/src/classification.md (Symbol Classifier implementation)
  • Concept doc: concept.md (Demangling requirements)
  • Related: Container parsing implementation (✅ complete)

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions