fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
Conversation
The Q1_0_g128 x86 kernel has two bugs causing gibberish output at 0.25 tok/s on Intel CPUs: 1. Float-to-int truncation: the per-block accumulator was `int`, truncating `d1 * sumi_block` (float * int → float → int). Each Q8_0 block's scale factor was rounded to 0 or ±1, destroying the output. Fix: `float block_sum` accumulator. 2. No SIMD: the x86 path was scalar-only while ARM NEON had full vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float. Results on i5-1135G7 with Bonsai 8B: Before (MSVC): 0.25 tok/s, gibberish output Bug fix only: 3.7 tok/s, correct output Bug fix + AVX2: 6.9 tok/s, correct output Both the x86-specific kernel (arch/x86/quants.c) and the generic fallback (quants.c) are fixed.
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 |
|
Running 8B on i5 box with this PR, I get consistent After swapped out I get consistent Below alternative code on i5 Broadwell gives: |
|
the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM. AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as |
KL Divergence Results — x86 AVX2 (PR #7)Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: System info: Test setup
x86 AVX2 Divergences
Comparison with PR #8 reference (ARM NEON / generic scalar)
The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern ( Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04). |
Summary
The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:
Bug 1: Float-to-int truncation (causes gibberish)
The per-block accumulator is
int, butd1 * sumi_blockproduces afloat. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.Bug 2: No SIMD (causes 0.25 tok/s)
The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same
broadcast → shuffle → cmpeq → mul_sum_i8_pairs_floatpattern from the existingggml_vec_dot_q1_0_q8_0kernel.Results (i5-1135G7, 32GB, Bonsai 8B)
, with is it. and the. and the.... in.........Files changed
ggml/src/ggml-cpu/arch/x86/quants.c— AVX2 kernel + scalar fixggml/src/ggml-cpu/quants.c— generic scalar fallback fixTest plan