lowram: Stream matrix A element-by-element to reduce memory#1019
lowram: Stream matrix A element-by-element to reduce memory#1019mkannwischer wants to merge 1 commit intomainfrom
Conversation
7013003 to
5ccd3c2
Compare
CBMC Results (ML-DSA-65)
Full Results (184 proofs)
|
CBMC Results (ML-DSA-44)Full Results (184 proofs)
|
CBMC Results (ML-DSA-87)
Full Results (184 proofs)
|
3acbf60 to
9ba57ff
Compare
Replace the row-level matrix buffer (mld_polyvecl) with a single-poly buffer in REDUCE_RAM mode. In the lazy path, matrix elements A[k][l] are sampled on demand one at a time, and the matrix-vector product accumulates element-by-element instead of row-by-row. Restructure polymat into eager/lazy variants following the same pattern as s1hat/s2hat/t0hat: - mld_polymat_eager: stores full K x L matrix - mld_polymat_lazy: stores rho + single poly_buffer + tmp - mld_polyvec_matrix_expand_eager/_lazy: separate implementations - mld_polyvec_matrix_pointwise_montgomery_eager/_lazy: separate implementations with CBMC contracts only on the eager variants Move all polymat-related code from polyvec.h/polyvec.c into polyvec_lazy.h/polyvec_lazy.c. Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
9ba57ff to
bb6bd8b
Compare
| const mld_poly *a_kl = mld_polymat_get_poly_lazy(mat, i, 0); | ||
| mld_poly_pointwise_montgomery(&t->vec[i], a_kl, &v->vec[0]); |
There was a problem hiding this comment.
The CBMC spec+proof for mld_polymat_get_poly_lazy may be tricky because we return part of mat. It may be easier to remove the return value and have mld_polymat_get_poly_lazy write the desired element to mat->tmp (perhaps renamed to cur or out)?
There was a problem hiding this comment.
You already have a separate pointer for the output poly, so my suggestion just reduces to using this directly rather than returning it.
| void mld_polyvecl_pointwise_acc_montgomery(mld_poly *w, const mld_polyvecl *u, | ||
| const mld_polyvecl *v) |
There was a problem hiding this comment.
This also means that the corresponding backend functionality can be dropped in the reduced-RAM build?
| /* | ||
| * We generate four separate seed arrays rather than a single one to work | ||
| * around limitations in CBMC function contracts dealing with disjoint slices | ||
| * of the same parent object. | ||
| */ |
There was a problem hiding this comment.
This comment appears outdated and can be removed on this occasion?
| mld_memcpy(seed_ext[j], rho, MLDSA_SEEDBYTES); | ||
| } | ||
|
|
||
| #if !defined(MLD_CONFIG_SERIAL_FIPS202_ONLY) && !defined(MLD_CONFIG_REDUCE_RAM) |
There was a problem hiding this comment.
Now we only use this for !MLD_CONFIG_REDUCE_RAM, so it can be simplified to #if !defined(MLD_CONFIG_SERIAL_FIPS202_ONLY)?
| decreases(MLDSA_K - i) | ||
| ) | ||
| { | ||
| const mld_polyvecl *row = mld_polymat_get_row_eager(mat, i); |
There was a problem hiding this comment.
There's little/no value in keeping this as a separate function now that we distinguish between full sampling and element-by-element
| mld_poly_pointwise_montgomery(&mat->tmp, a_kl, &v->vec[l]); | ||
| mld_poly_add(&t->vec[i], &mat->tmp); |
There was a problem hiding this comment.
Maybe leave a TODO note that we could get rid of mat->tmp here by multiplying in-place, assuming we strengthen the corresponding CBMC and HOL Light specs.
| static MLD_INLINE const mld_poly *mld_polymat_get_poly_lazy( | ||
| mld_polymat_lazy *mat, unsigned int k, unsigned int l) | ||
| { | ||
| MLD_ALIGN uint8_t seed_ext[MLD_ALIGN_UP(MLDSA_SEEDBYTES + 2)]; | ||
| mld_memcpy(seed_ext, mat->rho, MLDSA_SEEDBYTES); | ||
| seed_ext[MLDSA_SEEDBYTES + 0] = (uint8_t)l; | ||
| seed_ext[MLDSA_SEEDBYTES + 1] = (uint8_t)k; | ||
| mld_poly_uniform(&mat->poly_buffer, seed_ext); | ||
| mld_poly_permute_bitrev_to_custom_optional(&mat->poly_buffer); | ||
| /* @[FIPS204, Section 3.6.3] Destruction of intermediate values. */ | ||
| mld_zeroize(seed_ext, sizeof(seed_ext)); | ||
| return &mat->poly_buffer; | ||
| } |
There was a problem hiding this comment.
Do we need this function? Can we inline it into its call-site and share its logic with that of eager matrix expansion?
hanno-becker
left a comment
There was a problem hiding this comment.
I'm worried about the complexity we're building up here. There are (too) many small functions which make the code very difficult to oversee and which, I believe, aren't all necessary -- see comments. Let's see if we can clean this up a bit more before merging.
In principle, I'm OK with the optimization, though I don't think it's necessary to meet a 32K RAM target -- assuming all the other optimizations get merged, it seems like the row-by-row expansion is already enough? The latter is less intrusive and more performant since it allows you to still use the faster vector-vector scalar product.
If speed is a goal, the first optimization in |
Replace the row-level matrix buffer (mld_polyvecl) with a single-poly
buffer in REDUCE_RAM mode. In the lazy path, matrix elements A[k][l]
are sampled on demand one at a time, and the matrix-vector product
accumulates element-by-element instead of row-by-row.
Restructure polymat into eager/lazy variants following the same pattern
as s1hat/s2hat/t0hat:
implementations with CBMC contracts only on the eager variants
Move all polymat-related code from polyvec.h/polyvec.c into
polyvec_lazy.h/polyvec_lazy.c.