lowram: Stream matrix A element-by-element to reduce memory by mkannwischer · Pull Request #1019 · pq-code-package/mldsa-native

mkannwischer · 2026-04-05T06:24:46Z

Replace the row-level matrix buffer (mld_polyvecl) with a single-poly
buffer in REDUCE_RAM mode. In the lazy path, matrix elements A[k][l]
are sampled on demand one at a time, and the matrix-vector product
accumulates element-by-element instead of row-by-row.

Restructure polymat into eager/lazy variants following the same pattern
as s1hat/s2hat/t0hat:

mld_polymat_eager: stores full K x L matrix
mld_polymat_lazy: stores rho + single poly_buffer + tmp
mld_polyvec_matrix_expand_eager/_lazy: separate implementations
mld_polyvec_matrix_pointwise_montgomery_eager/_lazy: separate
implementations with CBMC contracts only on the eager variants

Move all polymat-related code from polyvec.h/polyvec.c into
polyvec_lazy.h/polyvec_lazy.c.

Hoisted out from PoC: Reduce large struct allocations to <= 13/17/21 KiB for ML-DSA-44/65/87 #1005
Depends on lowram: Compute h incrementally in signing #1015

oqs-bot · 2026-04-05T07:28:27Z

CBMC Results (ML-DSA-65)

⚠️ Attention Required

Proof	Status	Current	Previous	Change
`polyvecl_pointwise_acc_montgomery_c`	⚠️	473s	295s	+60%

Full Results (184 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2594s	2418s	+7.3%
`polyvecl_pointwise_acc_montgomery_c`	⚠️	473s	295s	+60%
`sign_verify_internal`	✅	261s	289s	-10%
`poly_pointwise_montgomery_c`	✅	165s	164s	+1%
`polyvec_matrix_expand_eager`	✅	164s	-	new
`rej_uniform_native`	✅	157s	154s	+2%
`mld_invntt_layer`	✅	100s	98s	+2%
`mld_attempt_signature_generation`	✅	92s	110s	-16%
`polyvec_matrix_expand_eager_serial`	✅	82s	-	new
`mld_ct_memcmp`	✅	81s	80s	+1%
`mld_ntt_layer`	✅	59s	56s	+5%
`sign_signature_internal`	✅	43s	30s	+43%
`polymat_permute_bitrev_to_custom`	✅	35s	34s	+3%
`mld_compute_t0_t1_tr_from_sk_components`	✅	24s	28s	-14%
`fqmul`	✅	22s	20s	+10%
`poly_chknorm_c`	✅	22s	21s	+5%
`rej_uniform`	✅	21s	22s	-5%
`polyveck_decompose`	✅	18s	18s	+0%
`poly_uniform_4x`	✅	17s	17s	+0%
`poly_uniform_eta_4x`	✅	16s	19s	-16%
`rej_uniform_c`	✅	16s	15s	+7%
`keccakf1600x4_permute_native`	✅	14s	14s	+0%
`mld_ntt_butterfly_block`	✅	14s	12s	+17%
`poly_add`	✅	14s	10s	+40%
`polyt0_unpack`	✅	14s	17s	-18%
`polyvec_matrix_pointwise_montgomery_eager`	✅	14s	-	new
`polyveck_add`	✅	14s	8s	+75%
`keccak_absorb_once_x4`	✅	12s	11s	+9%
`keccakf1600_permute_native`	✅	10s	8s	+25%
`mld_polyvecl_permute_bitrev_to_custom_native`	✅	10s	10s	+0%
`polyveck_ntt`	✅	10s	9s	+11%
`polyvecl_chknorm`	✅	10s	3s	+233%
`keccakf1600_permute`	✅	9s	9s	+0%
`polyveck_caddq`	✅	9s	8s	+12%
`polyveck_power2round`	✅	9s	14s	-36%
`polyveck_use_hint`	✅	9s	24s	-62%
`mld_check_pct`	✅	8s	14s	-43%
`poly_decompose_c`	✅	8s	9s	-11%
`poly_invntt_tomont_c`	✅	8s	6s	+33%
`polyveck_shiftl`	✅	8s	9s	-11%
`polyvecl_ntt`	✅	8s	8s	+0%
`unpack_hints`	✅	8s	7s	+14%
`caddq`	✅	7s	4s	+75%
`keccak_squeezeblocks_x4`	✅	7s	7s	+0%
`mld_compute_pack_z`	✅	7s	6s	+17%
`mld_h`	✅	7s	5s	+40%
`polyveck_invntt_tomont`	✅	7s	7s	+0%
`polyveck_pointwise_poly_montgomery`	✅	7s	8s	-12%
`polyveck_reduce`	✅	7s	5s	+40%
`sign_pk_from_sk`	✅	7s	9s	-22%
`intt_native_x86_64`	✅	6s	4s	+50%
`poly_caddq_c`	✅	6s	5s	+20%
`poly_power2round`	✅	6s	4s	+50%
`polyeta_unpack`	✅	6s	8s	-25%
`polyveck_sub`	✅	6s	6s	+0%
`polyveck_unpack_eta`	✅	6s	5s	+20%
`polyvecl_pointwise_acc_montgomery_native`	✅	6s	5s	+20%
`shake256_init`	✅	6s	1s	+500%
`sign`	✅	6s	9s	-33%
`sign_verify_extmu`	✅	6s	4s	+50%
`unpack_sk`	✅	6s	9s	-33%
`keccak_absorb`	✅	5s	8s	-38%
`keccak_squeeze`	✅	5s	2s	+150%
`mld_sample_s1_s2`	✅	5s	5s	+0%
`mld_sample_s1_s2_serial`	✅	5s	3s	+67%
`montgomery_reduce`	✅	5s	4s	+25%
`poly_challenge`	✅	5s	2s	+150%
`poly_decompose_native`	✅	5s	3s	+67%
`poly_ntt`	✅	5s	4s	+25%
`poly_shiftl`	✅	5s	2s	+150%
`poly_use_hint_native`	✅	5s	3s	+67%
`polyveck_chknorm`	✅	5s	10s	-50%
`rej_eta_native`	✅	5s	6s	-17%
`sign_keypair_internal`	✅	5s	7s	-29%
`sign_open`	✅	5s	5s	+0%
`sign_signature`	✅	5s	4s	+25%
`sign_signature_extmu`	✅	5s	5s	+0%
`keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid`	✅	4s	3s	+33%
`keccak_finalize`	✅	4s	1s	+300%
`keccakf1600x4_extract_bytes`	✅	4s	2s	+100%
`keccakf1600x4_xor_bytes`	✅	4s	3s	+33%
`mld_ct_abs_i32`	✅	4s	1s	+300%
`mld_ct_get_optblocker_i64`	✅	4s	2s	+100%
`mld_keccakf1600_extract_bytes`	✅	4s	2s	+100%
`mld_prepare_domain_separation_prefix`	✅	4s	6s	-33%
`mld_value_barrier_u32`	✅	4s	3s	+33%
`ntt_native_aarch64`	✅	4s	4s	+0%
`poly_chknorm`	✅	4s	2s	+100%
`poly_make_hint`	✅	4s	3s	+33%
`poly_uniform`	✅	4s	4s	+0%
`poly_uniform_gamma1`	✅	4s	2s	+100%
`poly_use_hint`	✅	4s	3s	+33%
`polyeta_pack`	✅	4s	4s	+0%
`polyt0_pack`	✅	4s	2s	+100%
`polyveck_unpack_t0`	✅	4s	4s	+0%
`polyvecl_uniform_gamma1_serial`	✅	4s	5s	-20%
`polyvecl_unpack_eta`	✅	4s	3s	+33%
`polyvecl_unpack_z`	✅	4s	4s	+0%
`polyz_pack`	✅	4s	3s	+33%
`polyz_unpack_c`	✅	4s	4s	+0%
`polyz_unpack_native`	✅	4s	1s	+300%
`sign_signature_pre_hash_internal`	✅	4s	5s	-20%
`sign_verify`	✅	4s	6s	-33%
`sign_verify_pre_hash_shake256`	✅	4s	4s	+0%
`decompose`	✅	3s	4s	-25%
`keccak_f1600_x1_native_aarch64`	✅	3s	4s	-25%
`keccakf1600x4_permute`	✅	3s	3s	+0%
`mld_ct_cmask_neg_i32`	✅	3s	3s	+0%
`mld_ct_cmask_nonzero_u32`	✅	3s	2s	+50%
`mld_ct_get_optblocker_u32`	✅	3s	2s	+50%
`pack_pk`	✅	3s	3s	+0%
`pack_sig_c`	✅	3s	2s	+50%
`pack_sig_z`	✅	3s	3s	+0%
`poly_caddq`	✅	3s	4s	-25%
`poly_caddq_native`	✅	3s	3s	+0%
`poly_chknorm_native_aarch64`	✅	3s	3s	+0%
`poly_invntt_tomont`	✅	3s	3s	+0%
`poly_ntt_c`	✅	3s	4s	-25%
`poly_ntt_native`	✅	3s	3s	+0%
`poly_reduce`	✅	3s	3s	+0%
`poly_sub`	✅	3s	2s	+50%
`poly_uniform_eta`	✅	3s	6s	-50%
`poly_uniform_gamma1_4x`	✅	3s	5s	-40%
`poly_use_hint_c`	✅	3s	5s	-40%
`polyt1_pack`	✅	3s	2s	+50%
`polyt1_unpack`	✅	3s	3s	+0%
`polyveck_pack_eta`	✅	3s	3s	+0%
`polyveck_pack_t0`	✅	3s	2s	+50%
`polyveck_pack_w1`	✅	3s	5s	-40%
`polyvecl_pack_eta`	✅	3s	3s	+0%
`polyvecl_uniform_gamma1`	✅	3s	3s	+0%
`polyw1_pack`	✅	3s	4s	-25%
`power2round`	✅	3s	3s	+0%
`reduce32`	✅	3s	2s	+50%
`rej_eta`	✅	3s	4s	-25%
`rej_eta_c`	✅	3s	4s	-25%
`shake128_squeeze`	✅	3s	3s	+0%
`shake128x4_squeezeblocks`	✅	3s	2s	+50%
`shake256`	✅	3s	3s	+0%
`shake256_squeeze`	✅	3s	4s	-25%
`sign_keypair`	✅	3s	4s	-25%
`sign_signature_pre_hash_shake256`	✅	3s	3s	+0%
`unpack_pk`	✅	3s	2s	+50%
`unpack_sig`	✅	3s	3s	+0%
`use_hint`	✅	3s	5s	-40%
`fqscale`	✅	2s	3s	-33%
`keccak_f1600_x1_native_aarch64_v84a`	✅	2s	2s	+0%
`keccak_f1600_x4_native_aarch64_v84a`	✅	2s	3s	-33%
`keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid`	✅	2s	2s	+0%
`keccak_init`	✅	2s	2s	+0%
`keccakf1600_extract_bytes (big endian)`	✅	2s	3s	-33%
`keccakf1600_xor_bytes`	✅	2s	2s	+0%
`keccakf1600_xor_bytes (big endian)`	✅	2s	2s	+0%
`make_hint`	✅	2s	6s	-67%
`mld_ct_cmask_nonzero_u8`	✅	2s	4s	-50%
`mld_ct_get_optblocker_u8`	✅	2s	3s	-33%
`mld_ct_sel_int32`	✅	2s	1s	+100%
`mld_value_barrier_u8`	✅	2s	4s	-50%
`ntt_native_x86_64`	✅	2s	4s	-50%
`pack_sig_h_poly`	✅	2s	5s	-60%
`pack_sk`	✅	2s	2s	+0%
`pointwise_native_aarch64`	✅	2s	3s	-33%
`pointwise_native_x86_64`	✅	2s	5s	-60%
`poly_caddq_native_aarch64`	✅	2s	3s	-33%
`poly_decompose`	✅	2s	2s	+0%
`poly_invntt_tomont_native`	✅	2s	3s	-33%
`poly_pointwise_montgomery_native`	✅	2s	4s	-50%
`polyvecl_permute_bitrev_to_custom`	✅	2s	3s	-33%
`polyz_unpack`	✅	2s	2s	+0%
`shake128_absorb`	✅	2s	2s	+0%
`shake128_finalize`	✅	2s	2s	+0%
`shake128_init`	✅	2s	3s	-33%
`shake128_release`	✅	2s	4s	-50%
`shake256_finalize`	✅	2s	2s	+0%
`shake256_release`	✅	2s	3s	-33%
`shake256x4_squeezeblocks`	✅	2s	2s	+0%
`sign_verify_pre_hash_internal`	✅	2s	4s	-50%
`sys_check_capability`	✅	2s	4s	-50%
`mld_value_barrier_i64`	✅	1s	3s	-67%
`poly_chknorm_native`	✅	1s	4s	-75%
`poly_pointwise_montgomery`	✅	1s	2s	-50%
`polyvecl_pointwise_acc_montgomery`	✅	1s	3s	-67%
`shake128x4_absorb_once`	✅	1s	3s	-67%
`shake256_absorb`	✅	1s	3s	-67%
`shake256x4_absorb_once`	✅	1s	2s	-50%

oqs-bot · 2026-04-05T07:28:33Z

CBMC Results (ML-DSA-44)

Full Results (184 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	1814s	2061s	-12.0%
`sign_verify_internal`	✅	203s	211s	-4%
`poly_pointwise_montgomery_c`	✅	149s	148s	+1%
`rej_uniform_native`	✅	140s	142s	-1%
`polyvecl_pointwise_acc_montgomery_c`	✅	122s	350s	-65%
`mld_attempt_signature_generation`	✅	88s	105s	-16%
`mld_invntt_layer`	✅	84s	88s	-5%
`mld_ct_memcmp`	✅	75s	75s	+0%
`mld_ntt_layer`	✅	54s	54s	+0%
`sign_signature_internal`	✅	26s	20s	+30%
`polyvec_matrix_expand_eager`	✅	23s	-	new
`rej_uniform`	✅	22s	23s	-4%
`poly_chknorm_c`	✅	20s	19s	+5%
`mld_compute_t0_t1_tr_from_sk_components`	✅	19s	12s	+58%
`poly_uniform_4x`	✅	18s	14s	+29%
`poly_uniform_eta_4x`	✅	18s	16s	+12%
`polyeta_unpack`	✅	18s	16s	+12%
`fqmul`	✅	17s	20s	-15%
`polyt0_unpack`	✅	15s	13s	+15%
`rej_uniform_c`	✅	14s	15s	-7%
`keccakf1600x4_permute_native`	✅	13s	11s	+18%
`mld_check_pct`	✅	13s	12s	+8%
`mld_ntt_butterfly_block`	✅	13s	13s	+0%
`polyvec_matrix_expand_eager_serial`	✅	13s	-	new
`poly_add`	✅	12s	11s	+9%
`polyz_unpack_c`	✅	12s	12s	+0%
`keccak_absorb_once_x4`	✅	11s	11s	+0%
`polymat_permute_bitrev_to_custom`	✅	10s	28s	-64%
`polyvec_matrix_pointwise_montgomery_eager`	✅	10s	-	new
`polyveck_power2round`	✅	10s	6s	+67%
`keccak_absorb`	✅	9s	7s	+29%
`unpack_sk`	✅	9s	9s	+0%
`keccakf1600_permute_native`	✅	8s	7s	+14%
`mld_sample_s1_s2`	✅	8s	5s	+60%
`polyveck_chknorm`	✅	8s	4s	+100%
`polyveck_decompose`	✅	8s	7s	+14%
`sign_pk_from_sk`	✅	8s	6s	+33%
`keccakf1600_permute`	✅	7s	11s	-36%
`mld_polyvecl_permute_bitrev_to_custom_native`	✅	7s	7s	+0%
`poly_caddq_c`	✅	7s	7s	+0%
`polyveck_shiftl`	✅	7s	5s	+40%
`sign`	✅	7s	6s	+17%
`keccak_squeezeblocks_x4`	✅	6s	7s	-14%
`poly_invntt_tomont_c`	✅	6s	7s	-14%
`poly_uniform_gamma1_4x`	✅	6s	3s	+100%
`polyveck_add`	✅	6s	6s	+0%
`polyveck_ntt`	✅	6s	5s	+20%
`polyveck_reduce`	✅	6s	4s	+50%
`polyvecl_ntt`	✅	6s	4s	+50%
`rej_eta_c`	✅	6s	5s	+20%
`shake128_absorb`	✅	6s	2s	+200%
`sign_open`	✅	6s	2s	+200%
`unpack_hints`	✅	6s	5s	+20%
`mld_compute_pack_z`	✅	5s	5s	+0%
`mld_ct_cmask_nonzero_u8`	✅	5s	2s	+150%
`mld_sample_s1_s2_serial`	✅	5s	8s	-38%
`pointwise_native_x86_64`	✅	5s	3s	+67%
`poly_caddq`	✅	5s	3s	+67%
`polyveck_pack_eta`	✅	5s	2s	+150%
`polyveck_pointwise_poly_montgomery`	✅	5s	5s	+0%
`polyveck_sub`	✅	5s	5s	+0%
`polyveck_use_hint`	✅	5s	5s	+0%
`polyvecl_chknorm`	✅	5s	3s	+67%
`polyvecl_pack_eta`	✅	5s	2s	+150%
`polyvecl_uniform_gamma1_serial`	✅	5s	3s	+67%
`polyz_unpack`	✅	5s	2s	+150%
`rej_eta_native`	✅	5s	5s	+0%
`sign_signature`	✅	5s	4s	+25%
`sign_signature_pre_hash_internal`	✅	5s	6s	-17%
`sign_signature_pre_hash_shake256`	✅	5s	6s	-17%
`unpack_pk`	✅	5s	3s	+67%
`intt_native_x86_64`	✅	4s	2s	+100%
`keccak_f1600_x1_native_aarch64_v84a`	✅	4s	2s	+100%
`keccak_finalize`	✅	4s	4s	+0%
`mld_prepare_domain_separation_prefix`	✅	4s	4s	+0%
`montgomery_reduce`	✅	4s	3s	+33%
`pack_pk`	✅	4s	3s	+33%
`poly_challenge`	✅	4s	4s	+0%
`poly_chknorm`	✅	4s	5s	-20%
`poly_chknorm_native_aarch64`	✅	4s	2s	+100%
`poly_decompose_c`	✅	4s	3s	+33%
`poly_decompose_native`	✅	4s	4s	+0%
`poly_invntt_tomont_native`	✅	4s	4s	+0%
`poly_ntt_native`	✅	4s	4s	+0%
`poly_pointwise_montgomery`	✅	4s	6s	-33%
`poly_uniform`	✅	4s	4s	+0%
`poly_uniform_eta`	✅	4s	4s	+0%
`poly_use_hint_c`	✅	4s	4s	+0%
`polyt0_pack`	✅	4s	7s	-43%
`polyveck_caddq`	✅	4s	5s	-20%
`polyveck_unpack_t0`	✅	4s	2s	+100%
`polyvecl_permute_bitrev_to_custom`	✅	4s	2s	+100%
`polyvecl_pointwise_acc_montgomery_native`	✅	4s	4s	+0%
`rej_eta`	✅	4s	5s	-20%
`shake256_finalize`	✅	4s	2s	+100%
`shake256_init`	✅	4s	3s	+33%
`shake256_release`	✅	4s	2s	+100%
`shake256x4_absorb_once`	✅	4s	3s	+33%
`sign_keypair`	✅	4s	3s	+33%
`sign_keypair_internal`	✅	4s	4s	+0%
`sign_signature_extmu`	✅	4s	3s	+33%
`sign_verify`	✅	4s	5s	-20%
`sign_verify_pre_hash_internal`	✅	4s	6s	-33%
`sign_verify_pre_hash_shake256`	✅	4s	6s	-33%
`caddq`	✅	3s	4s	-25%
`fqscale`	✅	3s	5s	-40%
`keccak_f1600_x4_native_aarch64_v84a`	✅	3s	2s	+50%
`keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid`	✅	3s	3s	+0%
`keccak_init`	✅	3s	2s	+50%
`keccakf1600_xor_bytes`	✅	3s	1s	+200%
`keccakf1600x4_extract_bytes`	✅	3s	4s	-25%
`keccakf1600x4_permute`	✅	3s	4s	-25%
`keccakf1600x4_xor_bytes`	✅	3s	2s	+50%
`make_hint`	✅	3s	2s	+50%
`mld_ct_cmask_nonzero_u32`	✅	3s	3s	+0%
`mld_ct_get_optblocker_u32`	✅	3s	3s	+0%
`mld_ct_sel_int32`	✅	3s	2s	+50%
`mld_h`	✅	3s	4s	-25%
`ntt_native_aarch64`	✅	3s	2s	+50%
`ntt_native_x86_64`	✅	3s	3s	+0%
`pack_sig_z`	✅	3s	3s	+0%
`pack_sk`	✅	3s	4s	-25%
`poly_caddq_native_aarch64`	✅	3s	3s	+0%
`poly_chknorm_native`	✅	3s	4s	-25%
`poly_ntt`	✅	3s	4s	-25%
`poly_power2round`	✅	3s	5s	-40%
`poly_reduce`	✅	3s	3s	+0%
`poly_sub`	✅	3s	3s	+0%
`poly_uniform_gamma1`	✅	3s	3s	+0%
`poly_use_hint_native`	✅	3s	3s	+0%
`polyveck_invntt_tomont`	✅	3s	6s	-50%
`polyveck_pack_w1`	✅	3s	3s	+0%
`polyvecl_uniform_gamma1`	✅	3s	4s	-25%
`polyz_pack`	✅	3s	2s	+50%
`shake128_release`	✅	3s	4s	-25%
`shake128x4_squeezeblocks`	✅	3s	4s	-25%
`shake256_squeeze`	✅	3s	2s	+50%
`shake256x4_squeezeblocks`	✅	3s	3s	+0%
`sign_verify_extmu`	✅	3s	4s	-25%
`unpack_sig`	✅	3s	2s	+50%
`use_hint`	✅	3s	4s	-25%
`decompose`	✅	2s	3s	-33%
`keccak_f1600_x1_native_aarch64`	✅	2s	2s	+0%
`keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid`	✅	2s	2s	+0%
`keccakf1600_xor_bytes (big endian)`	✅	2s	3s	-33%
`mld_ct_get_optblocker_i64`	✅	2s	2s	+0%
`mld_ct_get_optblocker_u8`	✅	2s	2s	+0%
`mld_keccakf1600_extract_bytes`	✅	2s	3s	-33%
`mld_value_barrier_i64`	✅	2s	3s	-33%
`mld_value_barrier_u32`	✅	2s	2s	+0%
`mld_value_barrier_u8`	✅	2s	3s	-33%
`pack_sig_c`	✅	2s	5s	-60%
`pack_sig_h_poly`	✅	2s	2s	+0%
`pointwise_native_aarch64`	✅	2s	3s	-33%
`poly_caddq_native`	✅	2s	3s	-33%
`poly_decompose`	✅	2s	2s	+0%
`poly_invntt_tomont`	✅	2s	3s	-33%
`poly_make_hint`	✅	2s	3s	-33%
`poly_ntt_c`	✅	2s	2s	+0%
`poly_pointwise_montgomery_native`	✅	2s	2s	+0%
`poly_use_hint`	✅	2s	2s	+0%
`polyeta_pack`	✅	2s	2s	+0%
`polyt1_pack`	✅	2s	3s	-33%
`polyt1_unpack`	✅	2s	3s	-33%
`polyveck_pack_t0`	✅	2s	3s	-33%
`polyvecl_pointwise_acc_montgomery`	✅	2s	3s	-33%
`polyvecl_unpack_eta`	✅	2s	3s	-33%
`polyvecl_unpack_z`	✅	2s	5s	-60%
`polyw1_pack`	✅	2s	3s	-33%
`polyz_unpack_native`	✅	2s	3s	-33%
`power2round`	✅	2s	4s	-50%
`reduce32`	✅	2s	1s	+100%
`shake128_finalize`	✅	2s	2s	+0%
`shake128_squeeze`	✅	2s	2s	+0%
`shake128x4_absorb_once`	✅	2s	3s	-33%
`shake256`	✅	2s	2s	+0%
`shake256_absorb`	✅	2s	4s	-50%
`keccak_squeeze`	✅	1s	2s	-50%
`keccakf1600_extract_bytes (big endian)`	✅	1s	2s	-50%
`mld_ct_abs_i32`	✅	1s	1s	+0%
`mld_ct_cmask_neg_i32`	✅	1s	2s	-50%
`poly_shiftl`	✅	1s	3s	-67%
`polyveck_unpack_eta`	✅	1s	3s	-67%
`shake128_init`	✅	1s	2s	-50%
`sys_check_capability`	✅	1s	3s	-67%

oqs-bot · 2026-04-05T07:33:03Z

CBMC Results (ML-DSA-87)

⚠️ Attention Required

Proof	Status	Current	Previous	Change
`polyvecl_pointwise_acc_montgomery_c`	⚠️	569s	194s	+193%

Full Results (184 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2573s	2234s	+15.2%
`polyvecl_pointwise_acc_montgomery_c`	⚠️	569s	194s	+193%
`polyvec_matrix_expand_eager`	✅	166s	-	new
`poly_pointwise_montgomery_c`	✅	146s	155s	-6%
`rej_uniform_native`	✅	145s	144s	+1%
`mld_attempt_signature_generation`	✅	125s	97s	+29%
`polyvec_matrix_expand_eager_serial`	✅	115s	-	new
`sign_verify_internal`	✅	110s	144s	-24%
`mld_invntt_layer`	✅	92s	95s	-3%
`mld_ct_memcmp`	✅	74s	73s	+1%
`mld_ntt_layer`	✅	53s	54s	-2%
`sign_signature_internal`	✅	48s	40s	+20%
`polymat_permute_bitrev_to_custom`	✅	29s	25s	+16%
`mld_compute_t0_t1_tr_from_sk_components`	✅	28s	23s	+22%
`rej_uniform`	✅	24s	19s	+26%
`poly_chknorm_c`	✅	21s	20s	+5%
`fqmul`	✅	20s	19s	+5%
`poly_uniform_4x`	✅	16s	16s	+0%
`polyeta_unpack`	✅	16s	18s	-11%
`poly_uniform_eta_4x`	✅	15s	15s	+0%
`rej_uniform_c`	✅	15s	13s	+15%
`polyt0_unpack`	✅	14s	15s	-7%
`keccakf1600x4_permute_native`	✅	13s	13s	+0%
`mld_check_pct`	✅	13s	13s	+0%
`mld_ntt_butterfly_block`	✅	13s	15s	-13%
`keccak_absorb_once_x4`	✅	12s	11s	+9%
`keccakf1600_permute_native`	✅	12s	8s	+50%
`polyvec_matrix_pointwise_montgomery_eager`	✅	12s	-	new
`polyveck_add`	✅	12s	9s	+33%
`polyveck_use_hint`	✅	12s	12s	+0%
`mld_polyvecl_permute_bitrev_to_custom_native`	✅	11s	13s	-15%
`polyveck_decompose`	✅	11s	13s	-15%
`poly_add`	✅	10s	13s	-23%
`polyveck_caddq`	✅	9s	5s	+80%
`polyveck_shiftl`	✅	9s	9s	+0%
`polyvecl_ntt`	✅	9s	7s	+29%
`keccakf1600_permute`	✅	8s	9s	-11%
`mld_compute_pack_z`	✅	8s	7s	+14%
`mld_sample_s1_s2`	✅	8s	5s	+60%
`polyveck_invntt_tomont`	✅	8s	9s	-11%
`polyveck_power2round`	✅	8s	16s	-50%
`polyveck_sub`	✅	8s	7s	+14%
`polyz_unpack_c`	✅	8s	10s	-20%
`rej_eta_native`	✅	8s	4s	+100%
`unpack_sk`	✅	8s	9s	-11%
`keccak_absorb`	✅	7s	7s	+0%
`keccak_squeezeblocks_x4`	✅	7s	7s	+0%
`poly_decompose_c`	✅	7s	9s	-22%
`polyveck_ntt`	✅	7s	9s	-22%
`polyveck_reduce`	✅	7s	7s	+0%
`sign`	✅	7s	7s	+0%
`sign_pk_from_sk`	✅	7s	8s	-12%
`mld_sample_s1_s2_serial`	✅	6s	5s	+20%
`poly_ntt_c`	✅	6s	3s	+100%
`poly_power2round`	✅	6s	4s	+50%
`rej_eta`	✅	6s	5s	+20%
`sign_signature_extmu`	✅	6s	4s	+50%
`sign_signature_pre_hash_internal`	✅	6s	4s	+50%
`sign_signature_pre_hash_shake256`	✅	6s	5s	+20%
`keccakf1600_xor_bytes (big endian)`	✅	5s	4s	+25%
`keccakf1600x4_xor_bytes`	✅	5s	3s	+67%
`mld_value_barrier_i64`	✅	5s	3s	+67%
`ntt_native_x86_64`	✅	5s	3s	+67%
`poly_caddq`	✅	5s	4s	+25%
`poly_invntt_tomont_c`	✅	5s	6s	-17%
`poly_pointwise_montgomery_native`	✅	5s	3s	+67%
`poly_uniform_gamma1_4x`	✅	5s	5s	+0%
`poly_use_hint_native`	✅	5s	4s	+25%
`polyt0_pack`	✅	5s	2s	+150%
`polyveck_chknorm`	✅	5s	7s	-29%
`polyveck_pointwise_poly_montgomery`	✅	5s	5s	+0%
`polyveck_unpack_t0`	✅	5s	6s	-17%
`polyvecl_chknorm`	✅	5s	9s	-44%
`polyvecl_pointwise_acc_montgomery_native`	✅	5s	2s	+150%
`sign_signature`	✅	5s	4s	+25%
`sign_verify_pre_hash_internal`	✅	5s	5s	+0%
`sys_check_capability`	✅	5s	3s	+67%
`unpack_hints`	✅	5s	6s	-17%
`unpack_sig`	✅	5s	5s	+0%
`decompose`	✅	4s	4s	+0%
`intt_native_x86_64`	✅	4s	3s	+33%
`keccak_squeeze`	✅	4s	3s	+33%
`mld_ct_abs_i32`	✅	4s	2s	+100%
`mld_ct_cmask_nonzero_u32`	✅	4s	2s	+100%
`mld_ct_get_optblocker_u32`	✅	4s	3s	+33%
`mld_h`	✅	4s	5s	-20%
`mld_keccakf1600_extract_bytes`	✅	4s	5s	-20%
`mld_value_barrier_u8`	✅	4s	1s	+300%
`pack_sig_c`	✅	4s	7s	-43%
`pointwise_native_aarch64`	✅	4s	2s	+100%
`poly_caddq_c`	✅	4s	6s	-33%
`poly_challenge`	✅	4s	5s	-20%
`poly_chknorm`	✅	4s	2s	+100%
`poly_chknorm_native`	✅	4s	4s	+0%
`poly_decompose`	✅	4s	2s	+100%
`poly_decompose_native`	✅	4s	3s	+33%
`poly_shiftl`	✅	4s	3s	+33%
`poly_sub`	✅	4s	4s	+0%
`poly_uniform`	✅	4s	4s	+0%
`poly_uniform_eta`	✅	4s	6s	-33%
`poly_uniform_gamma1`	✅	4s	3s	+33%
`polyeta_pack`	✅	4s	3s	+33%
`polyt1_unpack`	✅	4s	3s	+33%
`polyveck_unpack_eta`	✅	4s	3s	+33%
`polyvecl_pack_eta`	✅	4s	6s	-33%
`polyvecl_unpack_eta`	✅	4s	3s	+33%
`polyz_unpack`	✅	4s	1s	+300%
`polyz_unpack_native`	✅	4s	3s	+33%
`rej_eta_c`	✅	4s	5s	-20%
`shake256_finalize`	✅	4s	3s	+33%
`shake256_squeeze`	✅	4s	3s	+33%
`sign_keypair_internal`	✅	4s	4s	+0%
`unpack_pk`	✅	4s	4s	+0%
`caddq`	✅	3s	3s	+0%
`fqscale`	✅	3s	1s	+200%
`keccak_f1600_x1_native_aarch64`	✅	3s	2s	+50%
`keccak_f1600_x1_native_aarch64_v84a`	✅	3s	1s	+200%
`keccak_finalize`	✅	3s	3s	+0%
`keccak_init`	✅	3s	2s	+50%
`keccakf1600x4_extract_bytes`	✅	3s	3s	+0%
`mld_ct_sel_int32`	✅	3s	1s	+200%
`mld_prepare_domain_separation_prefix`	✅	3s	2s	+50%
`mld_value_barrier_u32`	✅	3s	1s	+200%
`ntt_native_aarch64`	✅	3s	4s	-25%
`pack_sig_h_poly`	✅	3s	2s	+50%
`pack_sig_z`	✅	3s	2s	+50%
`pack_sk`	✅	3s	3s	+0%
`pointwise_native_x86_64`	✅	3s	3s	+0%
`poly_caddq_native`	✅	3s	5s	-40%
`poly_chknorm_native_aarch64`	✅	3s	6s	-50%
`poly_invntt_tomont`	✅	3s	3s	+0%
`poly_ntt_native`	✅	3s	1s	+200%
`polyt1_pack`	✅	3s	2s	+50%
`polyveck_pack_eta`	✅	3s	4s	-25%
`polyveck_pack_t0`	✅	3s	2s	+50%
`polyveck_pack_w1`	✅	3s	5s	-40%
`polyvecl_pointwise_acc_montgomery`	✅	3s	4s	-25%
`polyvecl_uniform_gamma1`	✅	3s	6s	-50%
`polyz_pack`	✅	3s	3s	+0%
`power2round`	✅	3s	5s	-40%
`reduce32`	✅	3s	4s	-25%
`shake128_init`	✅	3s	2s	+50%
`shake128_release`	✅	3s	2s	+50%
`shake128x4_absorb_once`	✅	3s	3s	+0%
`shake256_init`	✅	3s	4s	-25%
`sign_keypair`	✅	3s	2s	+50%
`sign_open`	✅	3s	7s	-57%
`sign_verify_extmu`	✅	3s	2s	+50%
`sign_verify_pre_hash_shake256`	✅	3s	3s	+0%
`use_hint`	✅	3s	2s	+50%
`keccak_f1600_x4_native_aarch64_v8a_scalar_hybrid`	✅	2s	3s	-33%
`keccak_f1600_x4_native_aarch64_v8a_v84a_scalar_hybrid`	✅	2s	3s	-33%
`keccakf1600_extract_bytes (big endian)`	✅	2s	8s	-75%
`keccakf1600_xor_bytes`	✅	2s	3s	-33%
`make_hint`	✅	2s	4s	-50%
`mld_ct_cmask_nonzero_u8`	✅	2s	2s	+0%
`mld_ct_get_optblocker_u8`	✅	2s	2s	+0%
`montgomery_reduce`	✅	2s	4s	-50%
`pack_pk`	✅	2s	5s	-60%
`poly_caddq_native_aarch64`	✅	2s	4s	-50%
`poly_invntt_tomont_native`	✅	2s	5s	-60%
`poly_make_hint`	✅	2s	4s	-50%
`poly_ntt`	✅	2s	4s	-50%
`poly_pointwise_montgomery`	✅	2s	3s	-33%
`poly_reduce`	✅	2s	2s	+0%
`poly_use_hint`	✅	2s	4s	-50%
`poly_use_hint_c`	✅	2s	5s	-60%
`polyvecl_permute_bitrev_to_custom`	✅	2s	3s	-33%
`polyvecl_uniform_gamma1_serial`	✅	2s	3s	-33%
`polyvecl_unpack_z`	✅	2s	3s	-33%
`polyw1_pack`	✅	2s	2s	+0%
`shake128_squeeze`	✅	2s	3s	-33%
`shake128x4_squeezeblocks`	✅	2s	2s	+0%
`shake256`	✅	2s	2s	+0%
`shake256_absorb`	✅	2s	3s	-33%
`shake256_release`	✅	2s	2s	+0%
`shake256x4_absorb_once`	✅	2s	2s	+0%
`shake256x4_squeezeblocks`	✅	2s	3s	-33%
`sign_verify`	✅	2s	3s	-33%
`keccak_f1600_x4_native_aarch64_v84a`	✅	1s	4s	-75%
`keccakf1600x4_permute`	✅	1s	4s	-75%
`mld_ct_cmask_neg_i32`	✅	1s	2s	-50%
`mld_ct_get_optblocker_i64`	✅	1s	3s	-67%
`shake128_absorb`	✅	1s	4s	-75%
`shake128_finalize`	✅	1s	2s	-50%

Replace the row-level matrix buffer (mld_polyvecl) with a single-poly buffer in REDUCE_RAM mode. In the lazy path, matrix elements A[k][l] are sampled on demand one at a time, and the matrix-vector product accumulates element-by-element instead of row-by-row. Restructure polymat into eager/lazy variants following the same pattern as s1hat/s2hat/t0hat: - mld_polymat_eager: stores full K x L matrix - mld_polymat_lazy: stores rho + single poly_buffer + tmp - mld_polyvec_matrix_expand_eager/_lazy: separate implementations - mld_polyvec_matrix_pointwise_montgomery_eager/_lazy: separate implementations with CBMC contracts only on the eager variants Move all polymat-related code from polyvec.h/polyvec.c into polyvec_lazy.h/polyvec_lazy.c. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>

hanno-becker · 2026-04-14T03:56:47Z

+    const mld_poly *a_kl = mld_polymat_get_poly_lazy(mat, i, 0);
+    mld_poly_pointwise_montgomery(&t->vec[i], a_kl, &v->vec[0]);


The CBMC spec+proof for mld_polymat_get_poly_lazy may be tricky because we return part of mat. It may be easier to remove the return value and have mld_polymat_get_poly_lazy write the desired element to mat->tmp (perhaps renamed to cur or out)?

You already have a separate pointer for the output poly, so my suggestion just reduces to using this directly rather than returning it.

hanno-becker · 2026-04-14T04:09:06Z

 void mld_polyvecl_pointwise_acc_montgomery(mld_poly *w, const mld_polyvecl *u,
                                           const mld_polyvecl *v)


This also means that the corresponding backend functionality can be dropped in the reduced-RAM build?

hanno-becker · 2026-04-14T04:12:21Z

+  /*
+   * We generate four separate seed arrays rather than a single one to work
+   * around limitations in CBMC function contracts dealing with disjoint slices
+   * of the same parent object.
+   */


This comment appears outdated and can be removed on this occasion?

hanno-becker · 2026-04-14T04:13:40Z

+    mld_memcpy(seed_ext[j], rho, MLDSA_SEEDBYTES);
+  }
+
+#if !defined(MLD_CONFIG_SERIAL_FIPS202_ONLY) && !defined(MLD_CONFIG_REDUCE_RAM)


Now we only use this for !MLD_CONFIG_REDUCE_RAM, so it can be simplified to #if !defined(MLD_CONFIG_SERIAL_FIPS202_ONLY)?

hanno-becker · 2026-04-14T04:15:15Z

+    decreases(MLDSA_K - i)
+  )
+  {
+    const mld_polyvecl *row = mld_polymat_get_row_eager(mat, i);


There's little/no value in keeping this as a separate function now that we distinguish between full sampling and element-by-element

hanno-becker · 2026-04-14T04:20:16Z

+      mld_poly_pointwise_montgomery(&mat->tmp, a_kl, &v->vec[l]);
+      mld_poly_add(&t->vec[i], &mat->tmp);


Maybe leave a TODO note that we could get rid of mat->tmp here by multiplying in-place, assuming we strengthen the corresponding CBMC and HOL Light specs.

hanno-becker · 2026-04-14T04:30:47Z

+static MLD_INLINE const mld_poly *mld_polymat_get_poly_lazy(
+    mld_polymat_lazy *mat, unsigned int k, unsigned int l)
+{
+  MLD_ALIGN uint8_t seed_ext[MLD_ALIGN_UP(MLDSA_SEEDBYTES + 2)];
+  mld_memcpy(seed_ext, mat->rho, MLDSA_SEEDBYTES);
+  seed_ext[MLDSA_SEEDBYTES + 0] = (uint8_t)l;
+  seed_ext[MLDSA_SEEDBYTES + 1] = (uint8_t)k;
+  mld_poly_uniform(&mat->poly_buffer, seed_ext);
+  mld_poly_permute_bitrev_to_custom_optional(&mat->poly_buffer);
+  /* @[FIPS204, Section 3.6.3] Destruction of intermediate values. */
+  mld_zeroize(seed_ext, sizeof(seed_ext));
+  return &mat->poly_buffer;
+}


Do we need this function? Can we inline it into its call-site and share its logic with that of eager matrix expansion?

hanno-becker

I'm worried about the complexity we're building up here. There are (too) many small functions which make the code very difficult to oversee and which, I believe, aren't all necessary -- see comments. Let's see if we can clean this up a bit more before merging.

In principle, I'm OK with the optimization, though I don't think it's necessary to meet a 32K RAM target -- assuming all the other optimizations get merged, it seems like the row-by-row expansion is already enough? The latter is less intrusive and more performant since it allows you to still use the faster vector-vector scalar product.

mkannwischer · 2026-04-15T14:00:56Z

In principle, I'm OK with the optimization, though I don't think it's necessary to meet a 32K RAM target -- assuming all the other optimizations get merged, it seems like the row-by-row expansion is already enough? The latter is less intrusive and more performant since it allows you to still use the faster vector-vector scalar product.

If speed is a goal, the first optimization in REDUCE_RAM mode to drop is the recomputation of y (#1031). That costs an L-polyvec and saves a lot of Keccak invocations inside of the main signing loop. I'm fairly sure this outweighs the gains made by vector-vector polymul on most platforms.

mkannwischer added the low-ram label Apr 5, 2026

mkannwischer force-pushed the lowram-stream-a branch 2 times, most recently from 7013003 to 5ccd3c2 Compare April 5, 2026 07:10

mkannwischer force-pushed the lowram-stream-a branch 4 times, most recently from 3acbf60 to 9ba57ff Compare April 5, 2026 12:59

mkannwischer marked this pull request as ready for review April 6, 2026 00:48

mkannwischer requested a review from a team as a code owner April 6, 2026 00:48

mkannwischer assigned hanno-becker Apr 6, 2026

mkannwischer mentioned this pull request Apr 7, 2026

lowmem: Unpack z lazily in verification #1025

Open

mkannwischer force-pushed the lowram-stream-a branch from 9ba57ff to bb6bd8b Compare April 8, 2026 07:09

This was referenced Apr 8, 2026

lowram: Per-row t0/t1 computation in keygen #1030

Open

lowram: Eliminate y vector in REDUCE_RAM mode in sign #1031

Open

mkannwischer commented Apr 9, 2026

View reviewed changes

Comment thread mldsa/src/polyvec_lazy.h

hanno-becker reviewed Apr 14, 2026

View reviewed changes

hanno-becker requested changes Apr 14, 2026

View reviewed changes

		const mld_poly *a_kl = mld_polymat_get_poly_lazy(mat, i, 0);
		mld_poly_pointwise_montgomery(&t->vec[i], a_kl, &v->vec[0]);

		void mld_polyvecl_pointwise_acc_montgomery(mld_poly w, const mld_polyvecl u,
		const mld_polyvecl *v)

		mld_poly_pointwise_montgomery(&mat->tmp, a_kl, &v->vec[l]);
		mld_poly_add(&t->vec[i], &mat->tmp);

Conversation

mkannwischer commented Apr 5, 2026

Uh oh!

oqs-bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBMC Results (ML-DSA-65)

Uh oh!

oqs-bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBMC Results (ML-DSA-44)

Uh oh!

oqs-bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBMC Results (ML-DSA-87)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanno-becker left a comment

Choose a reason for hiding this comment

Uh oh!

mkannwischer commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oqs-bot commented Apr 5, 2026 •

edited

Loading

oqs-bot commented Apr 5, 2026 •

edited

Loading

oqs-bot commented Apr 5, 2026 •

edited

Loading