fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI#6
fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI#6stfurkan wants to merge 2 commits intoPrismML-Eng:prismfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes incorrect Q1_0_g128 × Q8_0 dot-product results on x86 by correcting float/int accumulation and introducing optimized x86 implementations (AVX-512 VNNI, AVX2, scalar fallback) while keeping behavior consistent with the working ARM path.
Changes:
- Fix scalar generic kernel accumulation to avoid float-to-int truncation.
- Replace x86 scalar-only kernel with AVX-512 VNNI and AVX2 implementations plus corrected scalar fallback.
- Simplify bit extraction logic in scalar paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
ggml/src/ggml-cpu/quants.c |
Fixes generic scalar vec_dot accumulation for correct numerical results. |
ggml/src/ggml-cpu/arch/x86/quants.c |
Adds AVX-512 VNNI + AVX2 kernels and fixes scalar fallback accumulation for x86. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 |
The Q1_0_g128 vec_dot kernel for x86 produces garbage output due to a
float-to-int truncation bug: `sumi += d1 * sumi_block` accumulates a
float product into an int, silently truncating the result to zero for
small scale factors. This affects both the generic scalar fallback and
the x86 arch-specific implementation.
The ARM NEON implementation was correct and unaffected.
Changes:
- Fix generic scalar kernel (quants.c): accumulate `d0 * d1 * sumi`
into float, matching the working ARM scalar fallback pattern
- Replace x86 scalar-only kernel with three-tier implementation:
1. AVX-512 VNNI (BW+VL+VNNI): uses mask registers for single-
instruction bit expansion + VPDPBUSD for dot product
2. AVX2: shuffle-based bit expansion + sign_epi8 multiply
3. Scalar fallback: corrected accumulation
Benchmarks on AMD EPYC (Zen 4, 12 vCPU shared):
Before (broken): garbage output at ~0.5 tok/s
Scalar fix: correct output at ~3 tok/s
AVX2: correct output at ~28 tok/s
AVX-512 VNNI: correct output at ~50 tok/s (1.7B model)
ba0e521 to
0b7a2dd
Compare
@khosravipasha Thanks! I rebased the branch on top of your cpu-fixes merge. The KL divergence results are below, the AVX-512 VNNI kernel matches F16 almost exactly (99.949% same top p, near-zero KL divergence). KL Divergence Results (Q1_0_g128 vs F16)AMD EPYC Zen 4 (AVX-512 VNNI kernel), Bonsai-1.7B, wikitext-2-raw, 100 chunks, ctx 512
Q1_0 (non-g128) GGUF doesn't appear to be published on HuggingFace, only Full log |
|
Thanks looks good its close to 0. Its okay for Q1_0, we won't be using it. Also seems llama.cpp people don't like the Q1_0_g128 naming so most likely we will rename the Q1_0_g128 => Q1_0 and remove Q1_0 in future in llama.cpp's main repo. |
|
Here are the pp512/tg128 benchmarks for all three models: Benchmarks (pp512 / tg128)AMD EPYC Zen 4 (12 vCPU shared), AVX-512 VNNI kernel,
12 threads, BLAS backend, shared vCPU (Hetzner CPX52). Note these are with all threads on a single model, in production I run all 3 simultaneously with 4 threads each, which gives roughly half these numbers. Good to know about the Q1_0_g128 → Q1_0 rename. Thanks @khosravipasha |
|
This is slower than PR7 on my i5 box with AVX2 only. llamacpp build with pr6 pr7 |
- AVX2: replace manual int8→int16→int32 reduction with mul_sum_i8_pairs_float() (auto-selects AVXVNNI dpbssd on supported CPUs) - Both paths: accumulate into __m256 float via fmadd_ps, single hsum_float_8 at end (eliminates per-block horizontal int32 sum) - Remove unused variables and constants
|
@zcattacz Thanks for testing! I've updated the AVX2 path, replaced the manual int8→int16→int32 reduction chain with Updated benchmarks (AMD EPYC Zen 4, 12 vCPU shared):
tg128 improved ~20-30% over the previous version. Would be great to see your i5 numbers with this update if you get a chance. |
Builds on the scalar fix from #8 (cpu-fixes) which corrected the float-to-int truncation bug by changing
int sumitofloat sumi. That fix produces correct output but falls back to scalar code on x86 (~3 tok/s).This PR adds SIMD-optimized x86 kernels for Q1_0_g128 to bring x86 CPU performance closer to what ARM NEON achieves.
Changes
arch/x86/quants.c: replace generic scalar delegation with three-tier SIMD implementation:maskz_mov_epi8for single-instruction bit expansion +VPDPBUSDfor dot product accumulationsign_epi8multiplyggml_vec_dot_q1_0_g128_q8_0_genericquants.c): minor cleanup — simplified inner loop using direct bit extraction (bits[j >> 3] >> (j & 7)) and single-level float accumulationmemcpyfor strict-aliasing and alignment safetyARM NEON and CUDA/Metal paths are untouched.
Benchmarks
Hetzner CPX52 (12 vCPU AMD EPYC Zen 4, shared, 24GB RAM)
All models produce correct output. Prompt processing sees similar gains (44.8 / 23.6 / 12.8 tok/s respectively).
Live demo: https://ai.sft.best (temporary, may be taken down)