Skip to content

Add selective X-propagation#21

Open
robtaylor wants to merge 12 commits intofeature/multi-clock-domainsfrom
multivalue
Open

Add selective X-propagation#21
robtaylor wants to merge 12 commits intofeature/multi-clock-domainsfrom
multivalue

Conversation

@robtaylor
Copy link

Summary

  • Static analysis identifies DFF Q outputs and SRAM read ports as X-sources, computes forward cone with fixpoint iteration to classify ~5% of signals as X-capable
  • Partition-level classification: only X-capable partitions run the dual-lane (value + X-mask) kernel; X-free partitions have zero overhead
  • Full GPU kernel support (Metal + CUDA) with shared memory sideband (shared_state_x[256], shared_writeouts_x[256]) and SRAM X-mask shadow
  • CLI: --xprop flag on loom map and loom sim, VCD output emits Value::X for unknown primary outputs
  • CPU reference kernel + sanity check for GPU validation
  • Criterion benchmarks confirm: X-free ~0% overhead, X-capable ~1.5-1.6x (within 2x budget)

Test plan

  • cargo test passes (unit tests for X-source analysis, CPU kernel correctness)
  • cargo bench --bench xprop runs successfully
  • Map a design with --xprop and verify X-capable pin/partition stats in log output
  • Simulate with --xprop and verify VCD contains X values for uninitialised outputs
  • Simulate without --xprop and verify identical behaviour to baseline (backward compat)
  • Load old .gemparts (without xprop fields) and verify xprop_enabled == false

@robtaylor robtaylor force-pushed the feature/multi-clock-domains branch from 2f71182 to 11bb424 Compare February 23, 2026 23:39
An error occurred while trying to automatically change base from feature/multi-clock-domains to main February 24, 2026 01:38
@robtaylor robtaylor force-pushed the feature/multi-clock-domains branch from fbefda6 to 3b7fc26 Compare February 24, 2026 17:19
The prefix-based `is_sequential_cell()` matched `starts_with("dl")`
which incorrectly classified `dlygate4sd3` (a combinational delay
buffer used for hold-time fixing) as a sequential element. This
inserted a phantom DFF in the logic path, breaking simulation.

Replace prefix matching with an exhaustive table of 32 sequential
cells derived from the PDK by grepping for `udp_dff`/`udp_dlatch`
primitives in behavioral Verilog models.

Regression introduced in d11e914.

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
Covers the full process from library detection through cell
classification, behavioral model parsing, AIG decomposition, and
testing. Based on the SKY130 enablement experience, including the
dlygate4sd3 misclassification pitfall.

Co-developed-by: Claude Code v2.1.44 (claude-opus-4-6)
An error occurred while trying to automatically change base from feature/multi-clock-domains to main February 24, 2026 17:31
Proposes a compile-time static analysis approach to identify X-capable
signals in the AIG, enabling mixed two-state/four-state simulation.
Only partitions with genuinely unknown signals pay the ~2.3x ALU
overhead; the rest continue at full two-state speed.

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Add compute_x_sources() and compute_x_capable_pins() to AIG for
identifying signals that can carry unknown (X) values during simulation.

Algorithm: mark DFF Q outputs and SRAM read data ports as X sources,
forward-propagate through AND gates (leveraging topological order),
then fixpoint-iterate through DFF feedback loops until convergence.

This is Stage 1 of the selective X-propagation feature: compile-time
analysis that identifies the ~5% of signals needing X-aware simulation.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Add xprop_enabled, partition_x_capable, and xprop_state_offset fields
to FlattenedScriptV1. Add from_with_xprop() constructor that accepts
x_capable_pins from AIG analysis and classifies partitions.

Classification: a partition is X-capable if any of its AIG pins are
X-capable, with fixpoint propagation for inter-partition reads.
Metadata words 8 (is_x_capable) and 9 (xmask_state_offset) are
patched into each partition's script for GPU kernel consumption.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Implement simulate_block_v1_xprop() that mirrors the standard kernel but
tracks a parallel X-mask sideband through global reads, boomerang stages,
SRAM, clock enable gating, and DFF writeout. Non-X-capable partitions
delegate to the standard kernel with zero overhead.

Also adds sanity_check_cpu_xprop() for GPU validation and 13 unit tests
covering the AND gate X-prop formula, DFF clock-enable behavior, SRAM
X-mask semantics, and end-to-end kernel execution.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Wire selective X-propagation through the CLI and simulation pipeline:
- Add --xprop flag to `loom map` (informational X-analysis report)
- Add --xprop flag to `loom sim` (enables X-prop in FlattenedScript)
- Add xprop field to DesignArgs; load_design runs compute_x_capable_pins
  and builds script via from_with_xprop when enabled
- Add write_output_vcd_xprop to emit Value::X for X-masked output signals
- Log X-capable partition count and warn about X transitions in VCD

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Dual-lane X-mask tracking through all kernel phases: global read,
boomerang shuffle/reduction (hier[0]-[12]), writeout hooks,
SRAM/duplicate permutation, SRAM commit with X-mask shadow,
clock enable permutation, and DFF writeout with X-aware gating.

X-free partitions (metadata word 8 == 0) execute unchanged code
path with zero overhead — the is_x_capable branch is uniform per
threadgroup so there is no warp/SIMD divergence.

Shared memory additions: shared_state_x[256] and shared_writeouts_x[256]
(+2KB, well within 32KB threadgroup memory limit).

Updates all Rust dispatch code (loom.rs, metal_test.rs, cuda_test.rs,
cuda_dummy_test.rs) to allocate and pass sram_xmask buffers.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Fix critical state buffer sizing: when xprop is enabled, the GPU kernel
expects a doubled state buffer (values + X-mask) per cycle. Add
`effective_state_size()` method and `expand_states_for_xprop()` /
`split_xprop_states()` helpers for correct buffer management.

State buffer layout: [values (reg_io_state_size) | xmask (reg_io_state_size)]
per cycle. X-mask initialized to 0xFFFFFFFF for DFF positions (unknown)
and 0 for primary input positions (known from VCD).

Post-simulation diagnostics:
- First-cycle-X-free detection with log message
- Warning when X values persist at primary outputs at final cycle
- CPU sanity check now uses xprop variant when enabled

Wire xprop through all dispatch paths (loom sim Metal/CUDA,
metal_test, cuda_test, cuda_dummy_test) with effective_state_size.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Measures three scenarios across varying IO counts and stage depths:
- two_state: baseline kernel (no xprop)
- xprop_xfree: X-aware kernel on non-X-capable partition (zero overhead)
- xprop_xcapable: X-aware kernel on X-capable partition (~1.5-1.6x)

Results confirm X-free partitions pay no overhead, and X-capable
partitions stay well within the 2x budget.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Replace the speculative Open Questions section with Design Decisions
documenting the choices made during implementation: conservative
whole-SRAM X granularity, skipping reset-aware analysis, VCD X output
enabled, partition-level granularity, runtime CLI flag, and state
buffer layout with metadata words 8/9.

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Unit tests for vcd_io.rs xprop helpers (expand/split roundtrip, cycle
structure preservation, X-mask template correctness) and flatten.rs
xprop fields (effective_state_size, xprop_state_offset, partition
metadata words 8/9).

CI updates: run xprop benchmark alongside event_buffer, add E2E xprop
simulation steps to both Metal and CUDA jobs (map --xprop, sim --xprop,
verify VCD contains X values).

Co-developed-by: Claude Code v2.1.39 (claude-opus-4-6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant