Implement Main String Extraction Orchestrator and Pipeline Integration

## Summary

Create the main extraction orchestrator that serves as the primary API for analyzing binaries and extracting meaningful strings. This orchestrator will wire together all components (format detection, parsing, extraction, classification, and ranking) into a cohesive pipeline.

## Background

StringyMcStringFace aims to be a smarter alternative to the standard `strings` command by being data-structure aware, section-aware, and semantically intelligent. The foundation is solid with working binary format parsers (PE, ELF, Mach-O) via goblin, but the core extraction pipeline that coordinates all components needs to be implemented.

**Current Status:**
- ✅ Format detection (`container::detect_format`)
- ✅ Container parsers (`ContainerParser` trait with PE/ELF/Mach-O implementations)
- ✅ Type definitions (`FoundString`, `ContainerInfo`, etc.)
- ❌ String extraction engine (empty `extraction/mod.rs`)
- ❌ Classification/tagging system (empty `classification/mod.rs`)
- ❌ Ranking/scoring algorithm
- ❌ Main orchestrator API

## Proposed Solution

### Architecture

Create a `StringAnalyzer` orchestrator in `src/lib.rs` that provides a clean public API:

```rust
pub struct StringAnalyzer {
    min_length: usize,
    encodings: Vec<Encoding>,
    // ... configuration
}

impl StringAnalyzer {
    pub fn new() -> Self { /* ... */ }
    
    pub fn analyze(&self, data: &[u8]) -> Result<Vec<FoundString>> {
        // 1. Detect format
        // 2. Parse container metadata
        // 3. Extract strings from prioritized sections
        // 4. Classify/tag strings
        // 5. Score/rank strings
        // 6. Return sorted results
    }
}
```

### Implementation Plan

#### Phase 1: Core Extraction (extraction/mod.rs)
- Implement `StringExtractor` trait with methods:
  - `extract_ascii_utf8(data: &[u8], offset: usize) -> Vec<FoundString>`
  - `extract_utf16le(data: &[u8], offset: usize) -> Vec<FoundString>`
  - `extract_utf16be(data: &[u8], offset: usize) -> Vec<FoundString>`
- Section-aware extraction that respects `SectionInfo` boundaries
- Configurable minimum length (default 4)
- Track source (section name, offset, RVA)

#### Phase 2: Classification System (classification/mod.rs)
- Implement `StringClassifier` with pattern matching for:
  - URLs (http://, https://, ftp://)
  - Domains (DNS patterns)
  - IP addresses (IPv4/IPv6)
  - File paths (Unix: `/`, Windows: `C:\\`, `/`)
  - Registry keys (`HKEY_*`)
  - GUIDs (`{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}`)
  - User agents
  - Format strings (`%s`, `%d`, `{}`)
  - Base64 patterns
  - Crypto constants (known algorithm identifiers)
- Return `Vec<Tag>` for each string

#### Phase 3: Ranking Algorithm
- Implement scoring system based on:
  - Section type priority (.rodata/.rdata > .data > .text)
  - Tag relevance (URL/domain/GUID = high, format string = medium)
  - String length (longer = more informative, up to a point)
  - Encoding confidence
  - Import/export name bonus
- Score range: 0-100

#### Phase 4: Main Orchestrator
- Wire components together in `StringAnalyzer`
- Proper error handling with context propagation
- Configuration options (min length, encoding filters, tag filters)
- Memory-efficient processing for large binaries
- Integration with `main.rs` CLI

### Pipeline Flow

```
Binary Input
    ↓
Format Detection (container::detect_format)
    ↓
Parser Creation (container::create_parser)
    ↓
Container Parsing (parser.parse())
    ↓
Section Prioritization (by SectionType)
    ↓
String Extraction (extraction module)
    ↓
Classification/Tagging (classification module)
    ↓
Scoring/Ranking
    ↓
Sorted Results (Vec<FoundString>)
```

## Implementation Requirements

1. **Error Handling**: Use `StringyError` throughout with proper context
2. **Testing**: Unit tests for each component, integration test for full pipeline
3. **Performance**: Process large binaries (100MB+) efficiently
4. **Documentation**: Doc comments for public API with examples
5. **Integration**: Wire into `main.rs` CLI to replace TODO

## Dependencies

- Existing: `goblin`, `bstr`, `regex`, `serde`
- May need: Pattern matching crates for classification

## Acceptance Criteria

- [ ] `StringAnalyzer` public API implemented in `src/lib.rs`
- [ ] String extraction engine working for ASCII/UTF-8 and UTF-16LE/BE
- [ ] Classification system tagging strings with semantic categories
- [ ] Ranking algorithm scoring strings by relevance
- [ ] Full pipeline integration test with real PE/ELF/Mach-O binaries
- [ ] CLI in `main.rs` calls orchestrator and displays results
- [ ] Documentation and examples in doc comments
- [ ] All tests passing

## Related Issues

This is the main integration point that blocks most other features. Once complete, we can add:
- Output formatters (JSON, YARA, human-readable)
- Advanced filtering and search
- Performance optimizations
- Additional classification patterns

## Task-ID: stringy-analyzer/main-extraction-pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Main String Extraction Orchestrator and Pipeline Integration #36

Summary

Background

Proposed Solution

Architecture

Implementation Plan

Phase 1: Core Extraction (extraction/mod.rs)

Phase 2: Classification System (classification/mod.rs)

Phase 3: Ranking Algorithm

Phase 4: Main Orchestrator

Pipeline Flow

Implementation Requirements

Dependencies

Acceptance Criteria

Related Issues

Task-ID: stringy-analyzer/main-extraction-pipeline

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Implement Main String Extraction Orchestrator and Pipeline Integration #36

Description

Summary

Background

Proposed Solution

Architecture

Implementation Plan

Phase 1: Core Extraction (extraction/mod.rs)

Phase 2: Classification System (classification/mod.rs)

Phase 3: Ranking Algorithm

Phase 4: Main Orchestrator

Pipeline Flow

Implementation Requirements

Dependencies

Acceptance Criteria

Related Issues

Task-ID: stringy-analyzer/main-extraction-pipeline

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions