fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization by wildcattrio · Pull Request #7 · PrismML-Eng/llama.cpp

wildcattrio · 2026-04-02T17:31:26Z

Summary

The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:

Bug 1: Float-to-int truncation (causes gibberish)

The per-block accumulator is int, but d1 * sumi_block produces a float. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.

// Before (broken):
int sumi = 0;
sumi += d1 * sumi_block;  // float truncated to int

// After (fixed):
float block_sum = 0.0f;
block_sum += d1 * (float)sumi_block;

Bug 2: No SIMD (causes 0.25 tok/s)

The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same broadcast → shuffle → cmpeq → mul_sum_i8_pairs_float pattern from the existing ggml_vec_dot_q1_0_q8_0 kernel.

Results (i5-1135G7, 32GB, Bonsai 8B)

Version	tok/s	Output
Before (MSVC, shipped)	0.25	`, with is it. and the. and the.... in.........`
Bug fix only (scalar)	3.7	Correct
Bug fix + AVX2	6.9	Correct

Files changed

ggml/src/ggml-cpu/arch/x86/quants.c — AVX2 kernel + scalar fix
ggml/src/ggml-cpu/quants.c — generic scalar fallback fix

Test plan

Bonsai 8B: coherent output on Q&A, reasoning, and JSON extraction prompts
Bonsai 4B: coherent output (also tested)
Standard GGUF (Qwen 3.5 4B Q4_K_M): no regression, loads and runs correctly
Benchmark: 3 prompts × 8 configurations, all results in JSON

The Q1_0_g128 x86 kernel has two bugs causing gibberish output at 0.25 tok/s on Intel CPUs: 1. Float-to-int truncation: the per-block accumulator was `int`, truncating `d1 * sumi_block` (float * int → float → int). Each Q8_0 block's scale factor was rounded to 0 or ±1, destroying the output. Fix: `float block_sum` accumulator. 2. No SIMD: the x86 path was scalar-only while ARM NEON had full vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float. Results on i5-1135G7 with Bonsai 8B: Before (MSVC): 0.25 tok/s, gibberish output Bug fix only: 3.7 tok/s, correct output Bug fix + AVX2: 6.9 tok/s, correct output Both the x86-specific kernel (arch/x86/quants.c) and the generic fallback (quants.c) are fixed.

khosravipasha · 2026-04-02T18:46:32Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

zcattacz · 2026-04-03T09:55:47Z

Running 8B on i5 box with this PR, I get consistent
[ Prompt: 1.8 t/s | Generation: 2.2~.2.4 t/s ] performance.

After swapped out defined(__AVX2__) logic with PR4 's xor + sub logic. (PR4 is based on an old commit, difficult to tinker)

I get consistent
[ Prompt: 2.7 t/s | Generation: 2.2~.2.4 t/s ] performance.

Below alternative code on i5 Broadwell gives:
~8 tps prompt parsing and ~6 tps generation for 4B
~4 tps prompt parsing and ~3 tps generation for 8B

#if defined(__AVX2__)

    // Gemini generated
    // Optimized AVX2 path for Q1_0_g128 · Q8_0
    const __m256i ones_8    = _mm256_set1_epi8(1);
    const __m256i ones_16   = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(
        1,2,4,8,16,32,64,(char)-128,
        1,2,4,8,16,32,64,(char)-128,
        1,2,4,8,16,32,64,(char)-128,
        1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();

    // Use TWO independent accumulators to break FMA dependency chains
    __m256 acc0 = _mm256_setzero_ps();
    __m256 acc1 = _mm256_setzero_ps();

    int ib = 0;
    // Unroll outer loop by 2
    for (; ib <= nb - 2; ib += 2) {
        const float d0_0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const float d0_1 = GGML_CPU_FP16_TO_FP32(x[ib+1].d);

        const uint32_t * qs32_0 = (const uint32_t *)x[ib].qs;
        const uint32_t * qs32_1 = (const uint32_t *)x[ib+1].qs;

        const block_q8_0 * y_ptr_0 = &y[ib * 4];
        const block_q8_0 * y_ptr_1 = &y[(ib+1) * 4];

        // Two independent block accumulators
        __m256 acc_block0 = _mm256_setzero_ps();
        __m256 acc_block1 = _mm256_setzero_ps();

        // The compiler will unroll this completely
        for (int k = 0; k < 4; ++k) {
            // Process Block 0
            const float dk_0     = GGML_CPU_FP16_TO_FP32(y_ptr_0[k].d);
            const __m256i qy_0   = _mm256_loadu_si256((const __m256i *)y_ptr_0[k].qs);
            const __m256i mask_0 = _mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32_0[k]), byte_shuf);
            const __m256i sm_0   = _mm256_cmpeq_epi8(_mm256_and_si256(mask_0, bit_masks), zero);
            const __m256i sy_0   = _mm256_sub_epi8(_mm256_xor_si256(qy_0, sm_0), sm_0);
            const __m256i s32_0  = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy_0), ones_16);
            acc_block0 = _mm256_fmadd_ps(_mm256_set1_ps(dk_0), _mm256_cvtepi32_ps(s32_0), acc_block0);

            // Process Block 1 (CPU executes this simultaneously with Block 0)
            const float dk_1     = GGML_CPU_FP16_TO_FP32(y_ptr_1[k].d);
            const __m256i qy_1   = _mm256_loadu_si256((const __m256i *)y_ptr_1[k].qs);
            const __m256i mask_1 = _mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32_1[k]), byte_shuf);
            const __m256i sm_1   = _mm256_cmpeq_epi8(_mm256_and_si256(mask_1, bit_masks), zero);
            const __m256i sy_1   = _mm256_sub_epi8(_mm256_xor_si256(qy_1, sm_1), sm_1);
            const __m256i s32_1  = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy_1), ones_16);
            acc_block1 = _mm256_fmadd_ps(_mm256_set1_ps(dk_1), _mm256_cvtepi32_ps(s32_1), acc_block1);
        }

        // Apply final block scales independently
        acc0 = _mm256_fmadd_ps(_mm256_set1_ps(d0_0), acc_block0, acc0);
        acc1 = _mm256_fmadd_ps(_mm256_set1_ps(d0_1), acc_block1, acc1);
    }

    // Combine accumulators
    __m256 acc = _mm256_add_ps(acc0, acc1);

    // Handle remainder if nb is odd
    for (; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m256 acc_block = _mm256_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk     = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m256i qy   = _mm256_loadu_si256((const __m256i *)y_ptr[k].qs);
            const __m256i mask = _mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[k]), byte_shuf);
            const __m256i sm   = _mm256_cmpeq_epi8(_mm256_and_si256(mask, bit_masks), zero);
            const __m256i sy   = _mm256_sub_epi8(_mm256_xor_si256(qy, sm), sm);
            const __m256i s32  = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16);
            
            acc_block = _mm256_fmadd_ps(_mm256_set1_ps(dk), _mm256_cvtepi32_ps(s32), acc_block);
        }
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }

    // Horizontal reduction
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0),
                                    _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

#else
...
#fi

zcattacz · 2026-04-03T13:54:01Z

the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM.

#elif defined(__SSE4_2__) || defined(__SSSE3__)
    // Optimized SSE4.2/SSSE3 path for Q1_0_g128 · Q8_0
    // This uses 128-bit registers to process 16 elements at a time.
    const __m128i ones_8    = _mm_set1_epi8(1);
    const __m128i ones_16   = _mm_set1_epi16(1);
    
    // This shuffle mask spreads the 1st byte of a register to the first 8 slots,
    // and the 2nd byte to the next 8 slots.
    const __m128i byte_shuf = _mm_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1);
        
    const __m128i bit_masks = _mm_setr_epi8(
        1,2,4,8,16,32,64,(char)128,
        1,2,4,8,16,32,64,(char)128);
    const __m128i zero = _mm_setzero_si128();

    __m128 acc = _mm_setzero_ps();

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m128 acc_block = _mm_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m128 vdk = _mm_set1_ps(dk);
            
            // Load 32 bytes of Q8_0 weights into two 16-byte SSE registers
            const __m128i qy_l = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs));
            const __m128i qy_h = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs + 16));

            const uint32_t bits = qs32[k];

            // Process Lower 16 elements (using lower 16 bits of the mask)
            const __m128i mask_l = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits & 0xFFFF)), byte_shuf);
            const __m128i sm_l   = _mm_cmpeq_epi8(_mm_and_si128(mask_l, bit_masks), zero);
            const __m128i sy_l   = _mm_sub_epi8(_mm_xor_si128(qy_l, sm_l), sm_l);
            const __m128i s32_l  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_l), ones_16);

            // Process Upper 16 elements (using upper 16 bits of the mask)
            const __m128i mask_h = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits >> 16)), byte_shuf);
            const __m128i sm_h   = _mm_cmpeq_epi8(_mm_and_si128(mask_h, bit_masks), zero);
            const __m128i sy_h   = _mm_sub_epi8(_mm_xor_si128(qy_h, sm_h), sm_h);
            const __m128i s32_h  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_h), ones_16);

            // Convert integer sums to float, scale by dk, and add to block accumulator
            const __m128i s32_total = _mm_add_epi32(s32_l, s32_h);
            acc_block = _mm_add_ps(acc_block, _mm_mul_ps(vdk, _mm_cvtepi32_ps(s32_total)));
        }

        // Final scale by d0 (block scale) and add to global accumulator
        acc = _mm_add_ps(acc, _mm_mul_ps(_mm_set1_ps(d0), acc_block));
    }

    // Horizontal reduction of the 4 float lanes in the SSE register
    {
        acc = _mm_add_ps(acc, _mm_movehl_ps(acc, acc));
        acc = _mm_add_ss(acc, _mm_shuffle_ps(acc, acc, 1));
        *s = _mm_cvtss_f32(acc);
    }

AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as Dot Product=2×popcount(A XNOR B)−Total Bits using _mm_popcnt_u64 ...
@khosravipasha, I guess these devices needs a different quantization format :-D ?

wildcattrio · 2026-04-03T18:33:21Z

KL Divergence Results — x86 AVX2 (PR #7)

Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: 8194 (1179bfc82) with Clang 22.1.2.

System info: CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1

Test setup

F16 reference: dequantized from Bonsai-1.7B.gguf (Q1_0_g128) using llama-quantize --allow-requantize
Dataset: wikitext-2-raw, -c 512 --chunks 100
F16 perplexity: PPL = 24.04, 41.5 tok/s prompt processing
Q1_0_g128 perplexity: PPL = 24.09, 27.0 tok/s prompt processing

x86 AVX2 Divergences

Metric	Q1_0_g128 (1.13 BPW)
Same top p	99.220 ± 0.055 %
Mean KLD	0.000224 ± 0.000002
Maximum KLD	0.013303
99.9% KLD	0.002923
99.0% KLD	0.001273
Median KLD	0.000150
1.0% KLD	-0.000008
Minimum KLD	-0.000083
Mean Δp	-0.005 ± 0.002 %
Maximum Δp	4.779 %
99.9% Δp	2.198 %
99.0% Δp	1.122 %
95.0% Δp	0.511 %
Median Δp	-0.000 %
5.0% Δp	-0.534 %
1.0% Δp	-1.120 %
Minimum Δp	-3.440 %
RMS Δp	0.352 ± 0.005 %

Comparison with PR #8 reference (ARM NEON / generic scalar)

Metric	ARM NEON	x86 AVX2 (this PR)
Same top p	99.965 %	99.220 %
Mean KLD	0.000000	0.000224
Maximum KLD	0.000065	0.013303
RMS Δp	0.006 %	0.352 %

The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies d0 * d1 as a combined scale per sub-block, while the scalar reference accumulates d1 * sumi_block per sub-block then multiplies by d0 at the end. This difference in FP associativity accumulates across the 28 layers.

Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern (acc_block += dk * dot, then acc += d0 * acc_block) which more closely matches the scalar reference FP ordering — it may produce better KLD numbers. Worth testing.

Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04).

github-actions bot added the ggml label Apr 2, 2026

zcattacz mentioned this pull request Apr 3, 2026

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel

wildcattrio commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

wildcattrio commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wildcattrio commented Apr 2, 2026

Summary

Bug 1: Float-to-int truncation (causes gibberish)

Bug 2: No SIMD (causes 0.25 tok/s)

Results (i5-1135G7, 32GB, Bonsai 8B)

Files changed

Test plan

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wildcattrio commented Apr 3, 2026

KL Divergence Results — x86 AVX2 (PR #7)

Test setup

x86 AVX2 Divergences

Comparison with PR #8 reference (ARM NEON / generic scalar)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zcattacz commented Apr 3, 2026 •

edited

Loading

zcattacz commented Apr 3, 2026 •

edited

Loading