Skip to content

(Performance) Optimized x86 and generic q1_0(_g128) dot#10

Open
pl752 wants to merge 7 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_g128_no_nofma
Open

(Performance) Optimized x86 and generic q1_0(_g128) dot#10
pl752 wants to merge 7 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_g128_no_nofma

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented Apr 3, 2026

Hello
This is yet another PR about the fix of the truncation and optimization of the cpu inference.

In this case I have:

  • Replaced a ton of bit-masking operations, removed redundant float multiplication and unrolled the hot inner loop with constant masks for accumulation with signs in arch-agnostic fallback
  • Introduced paths for filling the gap between default fallback and AVX-512 capable CPUs
  • Performed tests to make sure that optimizations don't have effect on precision/correctness
  • Performed various experinments (most yielded worse performance) including:
  • brancless variant of unroll
  • various register and superscalar pipeline pressure options (AVX 2 uses doubled accumulation flow)
  • AVX-512 VNNI
  • explicitly precomputed masks for SIMD

Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow

Benchmarks were performed with:

  • CPU: AMD Ryzen 5 7640HS (at 65w)
  • WSL vm
  • LPDDR5 @ 6400MT JEDEC
  • Model: Bonsai-1.7B.gguf (Q1_0_g128)
  • Threads: 6
Flow pp 512 t/s tg 128 t/s Speedup Notes
Initial* 1.59 0.85 1.0x / 1.0x Slow
Scalar 9.57 7.06 6.0x / 8.3x Explicit byte-oriented unroll
SSSE3 26.13 19.51 16.5x / 22.9x 128-bit specialization
AVX 34.99 27.31 22.1x / 32.1x Mixed-width specialization
AVX2 + FMA 80.02 51.46 50.4x / 60.5x 256-bit specialization
AVX512BW 97.16 60.88 61.3x / 71.5x Leverages new SIMD extensions**
  • * extrapolated from pp 32 / tg 16: 1.659 t/s pp and 0.862 t/s tg, as I was impatient.
  • ** new SIMD instruction kinds improve performance even on AMD Zen4 implementation of AVX-512, which uses 256 bit pipeline twice instead of implementing full 512 bit one

I would appreciate your feedback

jordankzf and others added 7 commits April 3, 2026 13:07
The Q1_0_g128 vec_dot kernel had a bug where `sumi` was declared as
`int` but accumulated `float` partial products (`d1 * sumi_block`),
causing float-to-int truncation that destroyed dot product results
and produced gibberish output on CPU.

Additionally, the x86 kernel was purely scalar (one bit at a time).
This adds an AVX-512BW path that processes 32 elements per iteration
using mask_sub + madd + fma, with a single horizontal reduction at
the end.

Benchmarks (Bonsai-8B, CPU-only, AVX-512):
  Before:  0.73 t/s prompt, 0.65 t/s generation (gibberish output)
  After:  23.2 t/s prompt, 13.5 t/s generation (coherent output)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the ggml label Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants