[P0] Standardize cross-run comparability

## Problem
Different attack counts across runs (34/14/12) make comparisons misleading.

## Tasks
- [ ] Create canonical attack subsets (e.g., core-14, full-61)
- [ ] Add --attack-set flag to CLI
- [ ] Warn when comparing runs with different attack sets
- [ ] Add comparison tooling that checks alignment

## Acceptance Criteria
- CLI prevents accidental apples-to-oranges comparisons