Stochastic Volatility tracking across 5000 ticks, true latent log-volatility (white) vs BPF filtered estimate (blue). All scenarios use zero model mismatch (filter knows the true DGP parameters).
Performance: 51M particles, ~89 µs/tick on RTX 5080. Hand-written PTX runs 15% faster than nvcc-compiled CUDA C — ex2.approx, rcp.approx, no cvta.to.global overhead, tighter register allocation.
A complete Bootstrap Particle Filter for Stochastic Volatility estimation, written in raw NVIDIA PTX assembly targeting SM_120 (Blackwell / RTX 5080).
8 hand-written kernels implement the full BPF pipeline: propagate + weight, log-weight normalization (reduce-max, exp-sub, reduce-sum, scale), log-likelihood accumulation, and systematic resampling via binary search.
This is an educational companion to the CUDA C implementation (gpu_bpf.cu).
Every kernel that the CUDA compiler (nvcc) would generate is written here by hand,
so you can see exactly what happens at the instruction level.
├── gpu_bpf/
│ ├── gpu_bpf.cu / gpu_bpf.cuh — nvcc-compiled BPF/APF/IMM runtime
│ ├── gpu_bpf_ptx.cu / gpu_bpf.cuh — Hybrid: PTX weights + cuRAND propagation
│ ├── gpu_bpf_ptx_full.cu / gpu_bpf_full.cuh — Full PTX: 13 BPF kernels, PCG32 RNG
│ └── bpf_ptx_host.cu — Standalone PTX loader demo
├── ptx_assembly/
│ ├── bpf_kernels.ptx — 7 weight-processing kernels (hybrid path)
│ ├── bpf_kernels_full.ptx — All 13 BPF kernels: PCG32, ICDF, propagate
│ └── *.cubin — Pre-compiled binaries (SM_120)
├── tests/
│ ├── test_bpf_matched_dgp.cu — Matched-DGP test suite (all filter variants)
│ ├── bench_particle_scaling.cu — Latency vs particle count benchmark
│ └── test_icdf_diag.cu — ICDF/RNG accuracy diagnostics
├── csv_bank/ — Output data files
├── CMakeLists.txt — Cross-platform build (MSVC / ICX / GCC)
└── README.m
# Option 1: Ahead-of-time compile PTX → cubin, then load at runtime
ptxas -arch=sm_120 -o bpf_kernels.cubin bpf_kernels.ptx
nvcc -o bpf_demo bpf_ptx_host.cu -lcuda -arch=sm_120
./bpf_demo bpf_kernels.ptx
# Option 2: JIT — the driver API compiles PTX at load time
# (cuModuleLoadData handles this automatically)
nvcc -o bpf_demo bpf_ptx_host.cu -lcuda
./bpf_demo bpf_kernels.ptxCUDA C → nvcc → PTX (virtual ISA) → ptxas → SASS (real machine code)
↑ YOU ARE HERE
- PTX is a virtual ISA — it looks like assembly but is hardware-independent
- SASS is the actual GPU machine code (different per SM generation)
ptxasconverts PTX → SASS. The Driver API can JIT this at load time- Writing PTX lets you bypass
nvcc's optimizer and control instruction selection
.reg .u32 %r1; // 32-bit unsigned int
.reg .s32 %r2; // 32-bit signed int
.reg .u64 %rd1; // 64-bit unsigned (pointers)
.reg .f32 %f1; // 32-bit float (what we use everywhere)
.reg .f64 %fd1; // 64-bit double
.reg .pred %p1; // 1-bit predicate (for conditional execution)
| PTX Instruction | CUDA C Equivalent | Notes |
|---|---|---|
ld.global.f32 |
x = arr[i] |
Load from global memory |
st.global.f32 |
arr[i] = x |
Store to global memory |
ld.shared.f32 |
sdata[tid] |
Load from shared memory |
mad.lo.u32 %idx, %ctaid, %ntid, %tid |
blockIdx.x * blockDim.x + threadIdx.x |
Thread index |
mul.wide.u32 |
32×32→64 multiply | Used for byte offset calc |
fma.rn.f32 |
a*b + c (fused) |
Single rounding, more precise |
ex2.approx.f32 |
exp2f(x) / __expf() |
Hardware exp2 unit |
lg2.approx.f32 |
log2f(x) |
Hardware log2 unit |
rcp.approx.f32 |
1.0f / x |
Hardware reciprocal (SFU) |
max.f32 |
fmaxf(a, b) |
Float max |
bar.sync 0 |
__syncthreads() |
Block barrier |
atom.global.add.f32 |
atomicAdd() |
Atomic float add |
atom.global.cas.b32 |
atomicCAS() |
Compare-and-swap |
setp.lt.f32 %p, a, b |
if (a < b) |
Set predicate |
@%p instruction |
conditional exec | Predicated (branchless) |
In CUDA C you write __expf(x). The compiler generates:
// exp(x) = exp2(x * log2(e))
mul.f32 %f, %x, 0f3FB8AA3B; // x * 1.4426950 (log2(e))
ex2.approx.f32 %f, %f; // hardware exp2 in SFU
The GPU has a Special Function Unit (SFU) that computes exp2, log2, rcp,
rsqrt, and sin/cos in hardware. Everything else is built on top of these.
The shared-memory tree reduction for bpf_reduce_max:
Thread: 0 1 2 3 4 5 6 7 (blockDim=8)
sdata: [3.1, 0.5, 2.8, 1.0, 4.2, 0.1, 3.5, 2.0]
Step s=4: threads 0-3 compare with threads 4-7
sdata: [4.2, 0.5, 3.5, 2.0, -, -, -, -]
Step s=2: threads 0-1 compare with threads 2-3
sdata: [4.2, 2.0, -, -, -, -, -, -]
Step s=1: thread 0 compares with thread 1
sdata: [4.2, -, -, -, -, -, -, -]
Thread 0 → atomicMax to global scalar
CUDA doesn't have native atomicMax for float. We use CAS (compare-and-swap):
// Reinterpret float bits as int for CAS
LOOP:
mov.b32 %r_assumed, %r_old; // save old int bits
mov.b32 %f_old, %r_assumed; // reinterpret as float
max.f32 %f_new, %f_val, %f_old; // float max
mov.b32 %r_new, %f_new; // back to int bits
atom.global.cas.b32 %r_old, [addr], %r_assumed, %r_new;
setp.ne %p, %r_old, %r_assumed; // did someone else write?
@%p bra LOOP; // retry if so
This works because IEEE 754 floats (positive) have the same ordering as integers.
Instead of branching, PTX uses predicates:
setp.lt.f32 %p_lt, %f_cdf_mid, %f_target; // set predicate
@%p_lt add.s32 %r_lo, %r_mid, 1; // execute only if true
@!%p_lt mov.s32 %r_hi, %r_mid; // execute only if false
This avoids warp divergence — all threads execute both instructions but only one actually writes. Critical for the binary search in resampling.
-
PCG32 PRNG — Full PCG XSH-RR in PTX (64-bit state, 16 bytes/particle). Seeded per-particle with unique odd increments.
-
Inverse CDF Normal Generation — Full Acklam rational approximation (degree-6/5 central, degree-6/4 tails). ~1e-9 relative accuracy in float32.
-
Fused Propagate+Weight Kernel — OU transition, Student-t state noise (via host-pregenerated chi² from cuRAND), and Student-t observation log-weights with precomputed lgamma constant. One kernel launch per tick.
-
All 13 BPF Kernels in PTX — init_rng, init_particles, propagate_weight, set_scalar, reduce_max, reduce_sum, exp_sub, scale_wh, compute_loglik, resample, compute_var, gen_noise, silverman_jitter.
-
Thrust Prefix Scan — Resampling CDF still uses
thrust::inclusive_scan. Writing a Blelloch scan in PTX is ~200 lines but would eliminate the last library dependency in the BPF path. -
cuRAND for Chi² Generation — Bulk normal generation + square-sum for Student-t state noise. Fast (adds ~10μs/tick) but means the BPF isn't fully cuRAND-free when nu_state > 0.
- 128 CUDA cores per SM, up from 128 on Ada (SM_100)
- Native
atom.global.add.f32— no CAS needed for float atomicAdd ex2.approx.f32precision: ~23 bits mantissa (same as SM_100)- 256 KB shared memory per SM (configurable L1/shared split)
- Max 1024 threads per block, 32 warps per SM
- PTX ISA 8.8+ required for SM_120 features
CUDA C PTX Kernel
───────────────────────────── ─────────────────────────
bpf_propagate_weight<<<>>> → bpf_propagate_weight
bpf_set_scalar<<<1,1>>> → bpf_set_scalar
bpf_reduce_max<<<>>> → bpf_reduce_max
bpf_exp_sub_dev<<<>>> → bpf_exp_sub
bpf_set_scalar<<<1,1>>> → bpf_set_scalar
bpf_reduce_sum<<<>>> → bpf_reduce_sum
bpf_scale_wh_dev<<<>>> → bpf_scale_wh
bpf_set_scalar<<<1,1>>> → bpf_set_scalar
bpf_reduce_sum<<<>>> → bpf_reduce_sum
bpf_compute_loglik<<<1,1>>> → bpf_compute_loglik
thrust::inclusive_scan → [still thrust — scan is hard]
bpf_resample<<<>>> → bpf_resample
Open source — educational use. Do whatever you want with it.