(Performance) Optimized x86 and generic q1_0(_g128) dot by pl752 · Pull Request #10 · PrismML-Eng/llama.cpp

pl752 · 2026-04-03T09:25:29Z

Hello
This is yet another PR about the ~~fix of the truncation and~~ optimization of the cpu inference.

In this case I have:

Replaced a ton of bit-masking operations, removed redundant float multiplication and unrolled the hot inner loop with constant masks for accumulation with signs in arch-agnostic fallback
Introduced paths for filling the gap between default fallback and AVX-512 capable CPUs
Performed tests to make sure that optimizations don't have effect on precision/correctness
Performed various experinments (most yielded worse performance) including:
brancless variant of unroll
various register and superscalar pipeline pressure options (AVX 2 uses doubled accumulation flow)
AVX-512 VNNI
explicitly precomputed masks for SIMD

Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow

Benchmarks were performed with:

CPU: AMD Ryzen 5 7640HS (at 65w)
WSL vm
LPDDR5 @ 6400MT JEDEC
Model: Bonsai-1.7B.gguf (Q1_0_g128)
Threads: 6

Flow	`pp 512` t/s	`tg 128` t/s	Speedup	Notes
Initial*	1.59	0.85	1.0x / 1.0x	Slow
Scalar	9.57	7.06	6.0x / 8.3x	Explicit byte-oriented unroll
`SSSE3`	26.13	19.51	16.5x / 22.9x	128-bit specialization
`AVX`	34.99	27.31	22.1x / 32.1x	Mixed-width specialization
`AVX2` + `FMA`	80.02	51.46	50.4x / 60.5x	256-bit specialization
`AVX512BW`	97.16	60.88	61.3x / 71.5x	Leverages new SIMD extensions**

* extrapolated from pp 32 / tg 16: 1.659 t/s pp and 0.862 t/s tg, as I was impatient.
** new SIMD instruction kinds improve performance even on AMD Zen4 implementation of AVX-512, which uses 256 bit pipeline twice instead of implementing full 512 bit one

I would appreciate your feedback

The Q1_0_g128 vec_dot kernel had a bug where `sumi` was declared as `int` but accumulated `float` partial products (`d1 * sumi_block`), causing float-to-int truncation that destroyed dot product results and produced gibberish output on CPU. Additionally, the x86 kernel was purely scalar (one bit at a time). This adds an AVX-512BW path that processes 32 elements per iteration using mask_sub + madd + fma, with a single horizontal reduction at the end. Benchmarks (Bonsai-8B, CPU-only, AVX-512): Before: 0.73 t/s prompt, 0.65 t/s generation (gibberish output) After: 23.2 t/s prompt, 13.5 t/s generation (coherent output) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…28 dot

jordankzf and others added 7 commits April 3, 2026 13:07

Removed unnecessary calculations and unrolled accumulation in q1_0_g1…

6b235a3

…28 dot

Added additional x86 SIMD specializations for Q1_0_g128

5be7d71

Added FMA3 and optimized AVX2 pressure

7b1c736

Switched to explicit unroll for x86 fallback for q1_0_g128

ac9f33e

Replicated q1_0_g128 optimizations to other q1_0 flows

2ccdcec

Added comments with flow explainations

b0394d0

github-actions bot added the ggml label Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Performance) Optimized x86 and generic q1_0(_g128) dot#10

(Performance) Optimized x86 and generic q1_0(_g128) dot#10
pl752 wants to merge 7 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_g128_no_nofma

pl752 commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pl752 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In this case I have:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pl752 commented Apr 3, 2026 •

edited

Loading