Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B#1
Open
Improve SGEMM, DGEMM, CGEMM and ZGEMM kernels for RISC-V ZVL128B and ZVL256B#1
Conversation
added 6 commits
December 5, 2025 20:21
These kernels use the same algorithm as the original ones, but improve on them in several respects: - Common code is shared between the two kernel types and architectures, reducing the overall lines of code. - Instead of using auto-generated code with each vector operation repeated multiple times (necessitated in part due to the vector types not being storable in arrays), macros and forced inlining are used to achieve the same effect directly for better clarity and a major reduction in lines of code. - The tiling sizes can be modified within a certain range by modifying the unroll parameters. - The RVV extension is not restricted to having the number of elements in a vector register being a power of two, so the tails of the matrix involving vector operations can be dealt with in a single operation rather than in decreasing powers of two, thereby improving performance. The tcopy operations need to be modified to take this into account.
…6B architectures This applies the changes made to the SGEMM/DGEMM kernels to their complex CGEMM/ZGEMM equivalents.
The original kernels have been deleted with this commit.
…sizes for RISC-V The unroll factor of the TRMM kernels are currently set to those of the equivalent GEMM kernels. As we are not dealing with the TRMM kernels for now, I have added extra xTRMM_UNROLL_(M|N) parameters to allow them to be set independently of the xGEMM_UNROLL_(M|N) parameters. If the new TRMM parameter is not defined, then it defaults back to the original behaviour of using the GEMM parameters.
…tecture Testing has shown that 4x8 or 4x4 performs better than the original 8x8 tiling size for this kernel. As we do not wish to perturb the behaviour of the DTRMM kernel at this point, the DTRMM tiling size is explicitly set to the original 8x8.
If DYNAMIC_ARCH is enabled, then the various xGEMM_UNROLL_(M|N) macros expand to 'gotoblas-><parameter>', which prevent their use in macro conditionals as the value is only defined at runtime. To get around this, we use the xGEMM_UNROLL_(M|N)_DEFAULT macros instead, which should expand to a compile-time constant. As the tiling factors used in the tcopy/ncopy functions are selected by these at compile time as well, it does not result in reduced functionality as changing the tiling at runtime without changing tcopy/ncopy would result in incorrect results anyway.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
These patches improve the performance of the kernels by making use of variable-length vectors in the RISC-V vector extensions.
OpenBLAS generally assumes that SIMD instructions work only with a number of elements in powers of 2, so for example when dealing with a tail of length 7, it is expected that we will do a vector operation with 4 elements, then with 2, and finally 1. On RISC-V this is wasteful as this could be done in just one operation instead of three.
OpenBLAS reorders the matrix layout before it reaches the kernel in order to optimise cache usage, so this needs to be modified first. As only the LHS matrix is read into vector registers, only the tcopy operation needs to be modified to present the tail in one chunk instead of breaking it down into powers of 2.
In the kernels, we can then simply set the vector length on each instruction (which is presented as the last argument in the intrinsics) to the tile size for most of the computation and the remainder in the tail.
There is a complication in that vector types like vfloat32m1_t cannot be placed into arrays by the compiler (presumably because they represent actual registers, not areas of memory), so if we want the same vector operation done N times to N different vectors, it needs to be written out N times instead of using a for loop. This was originally done using a Python code generator to autogenerate some very repetitive code, but for ease of experimentation I moved to using macros instead.
A RISCV_REPEAT macro is used to repeat a vector instruction N times. However, as no loops are allowed, this would necessitate at least one conditional jump (if we use a switch on N with fallthrough reverse-ordered cases) which causes a significant performance penalty on such a tight loop. Instead, I ensured that N is set to a compile-time constant at the call site and force inline it, relying on the compiler's constant propagation and dead-code elimination to remove any extraneous calls.
With this approach, the lines of code in the kernel are dramatically decreased, and I believe is much more readable. I further decreased the numbers of code by factoring out the common code in SGEMM & DGEMM, CGEMM & ZGEMM, gemm_tcopy_* and zgemm_tcopy_* into separate files and including it as required.