Skip to content

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7

Open
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel
Open

fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
wildcattrio:fix/x86-q1_0_g128-kernel

Conversation

@wildcattrio
Copy link
Copy Markdown

Summary

The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:

Bug 1: Float-to-int truncation (causes gibberish)

The per-block accumulator is int, but d1 * sumi_block produces a float. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.

// Before (broken):
int sumi = 0;
sumi += d1 * sumi_block;  // float truncated to int

// After (fixed):
float block_sum = 0.0f;
block_sum += d1 * (float)sumi_block;

Bug 2: No SIMD (causes 0.25 tok/s)

The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same broadcast → shuffle → cmpeq → mul_sum_i8_pairs_float pattern from the existing ggml_vec_dot_q1_0_q8_0 kernel.

Results (i5-1135G7, 32GB, Bonsai 8B)

Version tok/s Output
Before (MSVC, shipped) 0.25 , with is it. and the. and the.... in.........
Bug fix only (scalar) 3.7 Correct
Bug fix + AVX2 6.9 Correct

Files changed

  • ggml/src/ggml-cpu/arch/x86/quants.c — AVX2 kernel + scalar fix
  • ggml/src/ggml-cpu/quants.c — generic scalar fallback fix

Test plan

  • Bonsai 8B: coherent output on Q&A, reasoning, and JSON extraction prompts
  • Bonsai 4B: coherent output (also tested)
  • Standard GGUF (Qwen 3.5 4B Q4_K_M): no regression, loads and runs correctly
  • Benchmark: 3 prompts × 8 configurations, all results in JSON

The Q1_0_g128 x86 kernel has two bugs causing gibberish output at
0.25 tok/s on Intel CPUs:

1. Float-to-int truncation: the per-block accumulator was `int`,
   truncating `d1 * sumi_block` (float * int → float → int). Each
   Q8_0 block's scale factor was rounded to 0 or ±1, destroying
   the output. Fix: `float block_sum` accumulator.

2. No SIMD: the x86 path was scalar-only while ARM NEON had full
   vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq
   pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float.

Results on i5-1135G7 with Bonsai 8B:
  Before (MSVC):      0.25 tok/s, gibberish output
  Bug fix only:       3.7  tok/s, correct output
  Bug fix + AVX2:     6.9  tok/s, correct output

Both the x86-specific kernel (arch/x86/quants.c) and the generic
fallback (quants.c) are fixed.
@github-actions github-actions bot added the ggml label Apr 2, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 3, 2026

Running 8B on i5 box with this PR, I get consistent
[ Prompt: 1.8 t/s | Generation: 2.2~.2.4 t/s ] performance.

After swapped out defined(__AVX2__) logic with PR4 's xor + sub logic. (PR4 is based on an old commit, difficult to tinker)

I get consistent
[ Prompt: 2.7 t/s | Generation: 2.2~.2.4 t/s ] performance.

Below alternative code on i5 Broadwell gives:
~8 tps prompt parsing and ~6 tps generation for 4B
~4 tps prompt parsing and ~3 tps generation for 8B

#if defined(__AVX2__)

    // Gemini generated
    // Optimized AVX2 path for Q1_0_g128 · Q8_0
    const __m256i ones_8    = _mm256_set1_epi8(1);
    const __m256i ones_16   = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(
        1,2,4,8,16,32,64,(char)-128,
        1,2,4,8,16,32,64,(char)-128,
        1,2,4,8,16,32,64,(char)-128,
        1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();

    // Use TWO independent accumulators to break FMA dependency chains
    __m256 acc0 = _mm256_setzero_ps();
    __m256 acc1 = _mm256_setzero_ps();

    int ib = 0;
    // Unroll outer loop by 2
    for (; ib <= nb - 2; ib += 2) {
        const float d0_0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const float d0_1 = GGML_CPU_FP16_TO_FP32(x[ib+1].d);

        const uint32_t * qs32_0 = (const uint32_t *)x[ib].qs;
        const uint32_t * qs32_1 = (const uint32_t *)x[ib+1].qs;

        const block_q8_0 * y_ptr_0 = &y[ib * 4];
        const block_q8_0 * y_ptr_1 = &y[(ib+1) * 4];

        // Two independent block accumulators
        __m256 acc_block0 = _mm256_setzero_ps();
        __m256 acc_block1 = _mm256_setzero_ps();

        // The compiler will unroll this completely
        for (int k = 0; k < 4; ++k) {
            // Process Block 0
            const float dk_0     = GGML_CPU_FP16_TO_FP32(y_ptr_0[k].d);
            const __m256i qy_0   = _mm256_loadu_si256((const __m256i *)y_ptr_0[k].qs);
            const __m256i mask_0 = _mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32_0[k]), byte_shuf);
            const __m256i sm_0   = _mm256_cmpeq_epi8(_mm256_and_si256(mask_0, bit_masks), zero);
            const __m256i sy_0   = _mm256_sub_epi8(_mm256_xor_si256(qy_0, sm_0), sm_0);
            const __m256i s32_0  = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy_0), ones_16);
            acc_block0 = _mm256_fmadd_ps(_mm256_set1_ps(dk_0), _mm256_cvtepi32_ps(s32_0), acc_block0);

            // Process Block 1 (CPU executes this simultaneously with Block 0)
            const float dk_1     = GGML_CPU_FP16_TO_FP32(y_ptr_1[k].d);
            const __m256i qy_1   = _mm256_loadu_si256((const __m256i *)y_ptr_1[k].qs);
            const __m256i mask_1 = _mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32_1[k]), byte_shuf);
            const __m256i sm_1   = _mm256_cmpeq_epi8(_mm256_and_si256(mask_1, bit_masks), zero);
            const __m256i sy_1   = _mm256_sub_epi8(_mm256_xor_si256(qy_1, sm_1), sm_1);
            const __m256i s32_1  = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy_1), ones_16);
            acc_block1 = _mm256_fmadd_ps(_mm256_set1_ps(dk_1), _mm256_cvtepi32_ps(s32_1), acc_block1);
        }

        // Apply final block scales independently
        acc0 = _mm256_fmadd_ps(_mm256_set1_ps(d0_0), acc_block0, acc0);
        acc1 = _mm256_fmadd_ps(_mm256_set1_ps(d0_1), acc_block1, acc1);
    }

    // Combine accumulators
    __m256 acc = _mm256_add_ps(acc0, acc1);

    // Handle remainder if nb is odd
    for (; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m256 acc_block = _mm256_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk     = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m256i qy   = _mm256_loadu_si256((const __m256i *)y_ptr[k].qs);
            const __m256i mask = _mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[k]), byte_shuf);
            const __m256i sm   = _mm256_cmpeq_epi8(_mm256_and_si256(mask, bit_masks), zero);
            const __m256i sy   = _mm256_sub_epi8(_mm256_xor_si256(qy, sm), sm);
            const __m256i s32  = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16);
            
            acc_block = _mm256_fmadd_ps(_mm256_set1_ps(dk), _mm256_cvtepi32_ps(s32), acc_block);
        }
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }

    // Horizontal reduction
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0),
                                    _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

#else
...
#fi

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 3, 2026

the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM.

#elif defined(__SSE4_2__) || defined(__SSSE3__)
    // Optimized SSE4.2/SSSE3 path for Q1_0_g128 · Q8_0
    // This uses 128-bit registers to process 16 elements at a time.
    const __m128i ones_8    = _mm_set1_epi8(1);
    const __m128i ones_16   = _mm_set1_epi16(1);
    
    // This shuffle mask spreads the 1st byte of a register to the first 8 slots,
    // and the 2nd byte to the next 8 slots.
    const __m128i byte_shuf = _mm_setr_epi8(
        0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1);
        
    const __m128i bit_masks = _mm_setr_epi8(
        1,2,4,8,16,32,64,(char)128,
        1,2,4,8,16,32,64,(char)128);
    const __m128i zero = _mm_setzero_si128();

    __m128 acc = _mm_setzero_ps();

    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t * qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 * y_ptr = &y[ib * 4];

        __m128 acc_block = _mm_setzero_ps();

        for (int k = 0; k < 4; ++k) {
            const float dk = GGML_CPU_FP16_TO_FP32(y_ptr[k].d);
            const __m128 vdk = _mm_set1_ps(dk);
            
            // Load 32 bytes of Q8_0 weights into two 16-byte SSE registers
            const __m128i qy_l = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs));
            const __m128i qy_h = _mm_loadu_si128((const __m128i *)(y_ptr[k].qs + 16));

            const uint32_t bits = qs32[k];

            // Process Lower 16 elements (using lower 16 bits of the mask)
            const __m128i mask_l = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits & 0xFFFF)), byte_shuf);
            const __m128i sm_l   = _mm_cmpeq_epi8(_mm_and_si128(mask_l, bit_masks), zero);
            const __m128i sy_l   = _mm_sub_epi8(_mm_xor_si128(qy_l, sm_l), sm_l);
            const __m128i s32_l  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_l), ones_16);

            // Process Upper 16 elements (using upper 16 bits of the mask)
            const __m128i mask_h = _mm_shuffle_epi8(_mm_set1_epi16((short)(bits >> 16)), byte_shuf);
            const __m128i sm_h   = _mm_cmpeq_epi8(_mm_and_si128(mask_h, bit_masks), zero);
            const __m128i sy_h   = _mm_sub_epi8(_mm_xor_si128(qy_h, sm_h), sm_h);
            const __m128i s32_h  = _mm_madd_epi16(_mm_maddubs_epi16(ones_8, sy_h), ones_16);

            // Convert integer sums to float, scale by dk, and add to block accumulator
            const __m128i s32_total = _mm_add_epi32(s32_l, s32_h);
            acc_block = _mm_add_ps(acc_block, _mm_mul_ps(vdk, _mm_cvtepi32_ps(s32_total)));
        }

        // Final scale by d0 (block scale) and add to global accumulator
        acc = _mm_add_ps(acc, _mm_mul_ps(_mm_set1_ps(d0), acc_block));
    }

    // Horizontal reduction of the 4 float lanes in the SSE register
    {
        acc = _mm_add_ps(acc, _mm_movehl_ps(acc, acc));
        acc = _mm_add_ss(acc, _mm_shuffle_ps(acc, acc, 1));
        *s = _mm_cvtss_f32(acc);
    }

AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as Dot Product=2×popcount(A XNOR B)−Total Bits using _mm_popcnt_u64 ...
@khosravipasha, I guess these devices needs a different quantization format :-D ?

@wildcattrio
Copy link
Copy Markdown
Author

KL Divergence Results — x86 AVX2 (PR #7)

Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: 8194 (1179bfc82) with Clang 22.1.2.

System info: CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1

Test setup

  • F16 reference: dequantized from Bonsai-1.7B.gguf (Q1_0_g128) using llama-quantize --allow-requantize
  • Dataset: wikitext-2-raw, -c 512 --chunks 100
  • F16 perplexity: PPL = 24.04, 41.5 tok/s prompt processing
  • Q1_0_g128 perplexity: PPL = 24.09, 27.0 tok/s prompt processing

x86 AVX2 Divergences

Metric Q1_0_g128 (1.13 BPW)
Same top p 99.220 ± 0.055 %
Mean KLD 0.000224 ± 0.000002
Maximum KLD 0.013303
99.9% KLD 0.002923
99.0% KLD 0.001273
Median KLD 0.000150
1.0% KLD -0.000008
Minimum KLD -0.000083
Mean Δp -0.005 ± 0.002 %
Maximum Δp 4.779 %
99.9% Δp 2.198 %
99.0% Δp 1.122 %
95.0% Δp 0.511 %
Median Δp -0.000 %
5.0% Δp -0.534 %
1.0% Δp -1.120 %
Minimum Δp -3.440 %
RMS Δp 0.352 ± 0.005 %

Comparison with PR #8 reference (ARM NEON / generic scalar)

Metric ARM NEON x86 AVX2 (this PR)
Same top p 99.965 % 99.220 %
Mean KLD 0.000000 0.000224
Maximum KLD 0.000065 0.013303
RMS Δp 0.006 % 0.352 %

The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies d0 * d1 as a combined scale per sub-block, while the scalar reference accumulates d1 * sumi_block per sub-block then multiplies by d0 at the end. This difference in FP associativity accumulates across the 28 layers.

Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern (acc_block += dk * dot, then acc += d0 * acc_block) which more closely matches the scalar reference FP ordering — it may produce better KLD numbers. Worth testing.

Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants