fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI by stfurkan · Pull Request #6 · PrismML-Eng/llama.cpp

stfurkan · 2026-04-02T17:12:13Z

Builds on the scalar fix from #8 (cpu-fixes) which corrected the float-to-int truncation bug by changing int sumi to float sumi. That fix produces correct output but falls back to scalar code on x86 (~3 tok/s).

This PR adds SIMD-optimized x86 kernels for Q1_0_g128 to bring x86 CPU performance closer to what ARM NEON achieves.

Changes

x86 arch/x86/quants.c: replace generic scalar delegation with three-tier SIMD implementation:
1. AVX-512 VNNI (BW+VL+VNNI): maskz_mov_epi8 for single-instruction bit expansion + VPDPBUSD for dot product accumulation
2. AVX2: shuffle-based bit-to-byte expansion + sign_epi8 multiply
3. Scalar fallback: delegates to ggml_vec_dot_q1_0_g128_q8_0_generic
Generic scalar kernel (quants.c): minor cleanup — simplified inner loop using direct bit extraction (bits[j >> 3] >> (j & 7)) and single-level float accumulation
All loads use memcpy for strict-aliasing and alignment safety

ARM NEON and CUDA/Metal paths are untouched.

Benchmarks

Hetzner CPX52 (12 vCPU AMD EPYC Zen 4, shared, 24GB RAM)

Model	Scalar (before)	AVX-512 VNNI (after)	Speedup
1.7B	~3 tok/s	51.1 tok/s	~17x
4B	~3 tok/s	19.8 tok/s	~7x
8B	~3 tok/s	11.1 tok/s	~4x

All models produce correct output. Prompt processing sees similar gains (44.8 / 23.6 / 12.8 tok/s respectively).

Live demo: https://ai.sft.best (temporary, may be taken down)

Copilot

Pull request overview

Fixes incorrect Q1_0_g128 × Q8_0 dot-product results on x86 by correcting float/int accumulation and introducing optimized x86 implementations (AVX-512 VNNI, AVX2, scalar fallback) while keeping behavior consistent with the working ARM path.

Changes:

Fix scalar generic kernel accumulation to avoid float-to-int truncation.
Replace x86 scalar-only kernel with AVX-512 VNNI and AVX2 implementations plus corrected scalar fallback.
Simplify bit extraction logic in scalar paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`ggml/src/ggml-cpu/quants.c`	Fixes generic scalar `vec_dot` accumulation for correct numerical results.
`ggml/src/ggml-cpu/arch/x86/quants.c`	Adds AVX-512 VNNI + AVX2 kernels and fixes scalar fallback accumulation for x86.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ggml/src/ggml-cpu/arch/x86/quants.c

khosravipasha · 2026-04-02T18:46:10Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

The Q1_0_g128 vec_dot kernel for x86 produces garbage output due to a float-to-int truncation bug: `sumi += d1 * sumi_block` accumulates a float product into an int, silently truncating the result to zero for small scale factors. This affects both the generic scalar fallback and the x86 arch-specific implementation. The ARM NEON implementation was correct and unaffected. Changes: - Fix generic scalar kernel (quants.c): accumulate `d0 * d1 * sumi` into float, matching the working ARM scalar fallback pattern - Replace x86 scalar-only kernel with three-tier implementation: 1. AVX-512 VNNI (BW+VL+VNNI): uses mask registers for single- instruction bit expansion + VPDPBUSD for dot product 2. AVX2: shuffle-based bit expansion + sign_epi8 multiply 3. Scalar fallback: corrected accumulation Benchmarks on AMD EPYC (Zen 4, 12 vCPU shared): Before (broken): garbage output at ~0.5 tok/s Scalar fix: correct output at ~3 tok/s AVX2: correct output at ~28 tok/s AVX-512 VNNI: correct output at ~50 tok/s (1.7B model)

stfurkan · 2026-04-02T20:37:16Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

@khosravipasha Thanks! I rebased the branch on top of your cpu-fixes merge. The KL divergence results are below, the AVX-512 VNNI kernel matches F16 almost exactly (99.949% same top p, near-zero KL divergence).

KL Divergence Results (Q1_0_g128 vs F16)

AMD EPYC Zen 4 (AVX-512 VNNI kernel), Bonsai-1.7B, wikitext-2-raw, 100 chunks, ctx 512

Note: The wikitext-2-raw dataset had some invalid codepoints that caused the tokenizer to crash. Cleaned by removing non-BMP characters before running.

Metric	Value
Same top p	99.949 ± 0.014 %
Mean KL divergence	-0.000000 ± 0.000000
Max KL divergence	0.000031
Mean PPL(Q)/PPL(base)	1.000012 ± 0.000019
RMS Δp	0.000 ± 0.000 %
Prompt processing	50.29 tok/s

Q1_0 (non-g128) GGUF doesn't appear to be published on HuggingFace, only Bonsai-1.7B.gguf (Q1_0_g128) is available at prism-ml/Bonsai-1.7B-gguf. Happy to run it if you can share the file.

Full log


system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 33.76 seconds per pass - ETA 14.07 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1     798.3831 ±  121.7392      -0.00001 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    100.000 ±  0.000 %
   2     620.0014 ±   65.3600       0.00001 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    100.000 ±  0.000 %
   3     645.9309 ±   57.4854       0.00001 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    100.000 ±  0.000 %
   4     674.1265 ±   52.2820       0.00000 ±       -nan      -0.00000 ±    0.00000     0.001 ±  0.000 %    99.902 ±  0.098 %
   5     695.0380 ±   49.2197      -0.00000 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    99.922 ±  0.078 %
   6     705.0255 ±   46.4772       0.00000 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    99.935 ±  0.065 %
   7     698.1185 ±   42.5761       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.944 ±  0.056 %
   8     704.5098 ±   40.2837       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.951 ±  0.049 %
   9     708.6964 ±   38.1745       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.956 ±  0.044 %
  10     712.0216 ±   36.1241       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.961 ±  0.039 %
  11     723.7918 ±   35.1560       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.964 ±  0.036 %
  12     722.6843 ±   33.4567       0.00000 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    99.967 ±  0.033 %
  13     722.3550 ±   32.1003       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.970 ±  0.030 %
  14     721.8709 ±   31.0136       0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.972 ±  0.028 %
  15     724.1042 ±   29.9789      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.974 ±  0.026 %
  16     718.8967 ±   28.6933      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.975 ±  0.025 %
  17     713.8766 ±   27.5841      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.977 ±  0.023 %
  18     716.6543 ±   26.9642      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.031 %
  19     714.1373 ±   26.1189      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.959 ±  0.029 %
  20     716.0243 ±   25.5827      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.961 ±  0.028 %
  21     720.8451 ±   25.2289      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.944 ±  0.032 %
  22     709.7127 ±   24.1366      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.031 %
  23     722.8257 ±   24.1865      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.030 %
  24     731.1260 ±   23.9849      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.028 %
  25     733.6260 ±   23.5931      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.027 %
  26     729.0816 ±   22.9661      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.955 ±  0.026 %
  27     732.0525 ±   22.6660      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.025 %
  28     732.0982 ±   22.2974      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.958 ±  0.024 %
  29     731.2949 ±   21.8910      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.959 ±  0.023 %
  30     728.4758 ±   21.3988      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.961 ±  0.023 %
  31     731.1864 ±   21.1542      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.962 ±  0.022 %
  32     730.5836 ±   20.7615      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.963 ±  0.021 %
  33     729.5381 ±   20.4158      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.964 ±  0.021 %
  34     733.9744 ±   20.2788      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.965 ±  0.020 %
  35     730.8512 ±   19.8602      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.966 ±  0.019 %
  36     732.7395 ±   19.6591      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.967 ±  0.019 %
  37     731.6309 ±   19.3785      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.968 ±  0.018 %
  38     726.9186 ±   18.9506      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.969 ±  0.018 %
  39     725.5439 ±   18.6636      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.970 ±  0.017 %
  40     725.8810 ±   18.4446      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.971 ±  0.017 %
  41     727.3044 ±   18.2676      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.962 ±  0.019 %
  42     727.9197 ±   18.0797      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.963 ±  0.019 %
  43     724.1192 ±   17.7521      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.964 ±  0.018 %
  44     728.2769 ±   17.6831       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.964 ±  0.018 %
  45     727.4466 ±   17.4680       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.019 %
  46     728.4052 ±   17.3137       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.957 ±  0.019 %
  47     726.8054 ±   17.0814       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.958 ±  0.019 %
  48     725.2216 ±   16.8666       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.959 ±  0.018 %
  49     723.2643 ±   16.6521       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.020 %
  50     723.6195 ±   16.4879       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.019 %
  51     724.2560 ±   16.3305       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.954 ±  0.019 %
  52     723.1021 ±   16.1320       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.955 ±  0.018 %
  53     721.3222 ±   15.9305       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.018 %
  54     720.8257 ±   15.7539       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.018 %
  55     720.6670 ±   15.6011       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.957 ±  0.017 %
  56     720.2911 ±   15.4446       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.958 ±  0.017 %
  57     719.3650 ±   15.2757       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.018 %
  58     719.3343 ±   15.1429       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.018 %
  59     716.6217 ±   14.9456       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.018 %
  60     719.6623 ±   14.9223       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.018 %
  61     718.3877 ±   14.7454       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.018 %
  62     721.5333 ±   14.7063       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.018 %
  63     721.6858 ±   14.5951       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.018 %
  64     721.6114 ±   14.4782       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.017 %
  65     721.4266 ±   14.3558       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.017 %
  66     718.7859 ±   14.1840       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.017 %
  67     719.6319 ±   14.1144       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.017 %
  68     719.4142 ±   13.9946       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.017 %
  69     718.6948 ±   13.8686       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.017 %
  70     720.3316 ±   13.8043       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.017 %
  71     720.1921 ±   13.6989       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.017 %
  72     721.1309 ±   13.6219       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.016 %
  73     721.9325 ±   13.5344       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.946 ±  0.017 %
  74     722.5413 ±   13.4512       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.017 %
  75     722.3066 ±   13.3536       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.017 %
  76     724.6891 ±   13.3120       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.016 %
  77     723.4411 ±   13.2000       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.016 %
  78     724.1289 ±   13.1312       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.016 %
  79     723.7522 ±   13.0401       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.016 %
  80     723.1670 ±   12.9406       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.015 %
  81     722.7882 ±   12.8539       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.015 %
  82     721.8618 ±   12.7553       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.015 %
  83     721.8330 ±   12.6808       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.015 %
  84     724.5431 ±   12.6621       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.015 %
  85     726.1077 ±   12.6264       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.954 ±  0.015 %
  86     728.1747 ±   12.5917       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.954 ±  0.014 %
  87     727.5947 ±   12.4961       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.955 ±  0.014 %
  88     728.7140 ±   12.4468       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.015 %
  89     726.3536 ±   12.3251       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.015 %
  90     725.7515 ±   12.2391       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.015 %
  91     725.6283 ±   12.1728       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.015 %
  92     725.0724 ±   12.1026       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.015 %
  93     725.5553 ±   12.0495       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.015 %
  94     727.2092 ±   12.0117       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.014 %
  95     726.7849 ±   11.9400       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.014 %
  96     726.2184 ±   11.8665       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.014 %
  97     723.7844 ±   11.7554       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.014 %
  98     725.5034 ±   11.7319       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.014 %
  99     724.5530 ±   11.6529       0.00001 ±    0.00002      -0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.014 %
 100     724.6320 ±   11.5917       0.00001 ±    0.00002      -0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.014 %

====== Perplexity statistics ======
Mean PPL(Q)                   : 724.631977 ±  11.591689
Mean PPL(base)                : 724.623018 ±  11.591121
Cor(ln(PPL(Q)), ln(PPL(base))): 100.00%
Mean ln(PPL(Q)/PPL(base))     :   0.000012 ±   0.000019
Mean PPL(Q)/PPL(base)         :   1.000012 ±   0.000019
Mean PPL(Q)-PPL(base)         :   0.008959 ±   0.014099

====== KL divergence statistics ======
Mean    KLD:  -0.000000 ±   0.000000
Maximum KLD:   0.000031
99.9%   KLD:   0.000023
99.0%   KLD:   0.000016
95.0%   KLD:   0.000011
90.0%   KLD:   0.000008
Median  KLD:  -0.000000
10.0%   KLD:  -0.000008
 5.0%   KLD:  -0.000011
 1.0%   KLD:  -0.000016
 0.1%   KLD:  -0.000022
Minimum KLD:  -0.000028

====== Token probability statistics ======
Mean    Δp:  0.000 ± 0.000 %
Maximum Δp:  0.010%
99.9%   Δp:  0.002%
99.0%   Δp:  0.001%
95.0%   Δp:  0.000%
90.0%   Δp:  0.000%
75.0%   Δp:  0.000%
Median  Δp:  0.000%
25.0%   Δp: -0.000%
10.0%   Δp: -0.000%
 5.0%   Δp: -0.000%
 1.0%   Δp: -0.001%
 0.1%   Δp: -0.003%
Minimum Δp: -0.018%
RMS Δp    :  0.000 ± 0.000 %
Same top p: 99.949 ± 0.014 %

khosravipasha · 2026-04-02T21:09:53Z

Thanks looks good its close to 0.
do you have tg128 and pp512 speeds as well

Its okay for Q1_0, we won't be using it. Also seems llama.cpp people don't like the Q1_0_g128 naming so most likely we will rename the Q1_0_g128 => Q1_0 and remove Q1_0 in future in llama.cpp's main repo.

stfurkan · 2026-04-02T21:32:08Z

Here are the pp512/tg128 benchmarks for all three models:

Benchmarks (pp512 / tg128)

AMD EPYC Zen 4 (12 vCPU shared), AVX-512 VNNI kernel, -r 3

Model	Size	pp512 (tok/s)	tg128 (tok/s)
Bonsai 1.7B	231 MiB	46.92 ± 2.02	98.16 ± 1.02
Bonsai 4B	540 MiB	30.35 ± 0.21	45.96 ± 0.04
Bonsai 8B	1.07 GiB	23.96 ± 0.03	27.58 ± 0.31

12 threads, BLAS backend, shared vCPU (Hetzner CPX52). Note these are with all threads on a single model, in production I run all 3 simultaneously with 4 threads each, which gives roughly half these numbers.

Good to know about the Q1_0_g128 → Q1_0 rename. Thanks @khosravipasha

zcattacz · 2026-04-03T04:13:10Z

This is slower than PR7 on my i5 box with AVX2 only. llamacpp build with

rm -rf build
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j 4

pr6

llm/llama.cpp-prism$ cd bin && rm cpu && ln -sf pr6 cpu && cd ..
llm/llama.cpp-prism$ BONSAI_MODEL=4B ./scripts/run_llama.sh -c 8192 -p "Hi"
build      : b0-unknown
model      : Bonsai-4B.gguf
modalities : text
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file
> Hi                                                                                                                                       
Hello! I'm Bonsai, a 1-bit AI assistant developed by PrismML. I'm here to help with any questions or tasks you might have. How can I assist you today?
[ Prompt: 3.2 t/s | Generation: 2.9 t/s ]                                                                                                  
>                                                                                                                                          
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 2003 =   540 +    1152 +     311                |

pr7

llm/llama.cpp-prism$ cd bin && rm cpu && ln -sf pr7 cpu && cd ..
llm/llama.cpp-prism$ BONSAI_MODEL=4B ./scripts/run_llama.sh -c 8192 -p "Hi"
build      : b0-unknown
model      : Bonsai-4B.gguf
modalities : text
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file
> Hi                                                                                                                                       
Hello! I'm Bonsai, an AI assistant developed by PrismML. I'm here to help with any questions or tasks you might have. How can I assist you today?
[ Prompt: 5.2 t/s | Generation: 4.5 t/s ]                                                                                                  
>                                                                                                                                          
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 2003 =   540 +    1152 +     311                |

llm/llama.cpp-prism$ BONSAI_MODEL=8B ./scripts/run_llama.sh -c 8192 -p "Hi"
build      : b0-unknown
model      : Bonsai-8B.gguf
modalities : text
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file
> Hi                                                                                                                                       
Hello! I'm Bonsai, an AI assistant developed by PrismML. How can I assist you today?
[ Prompt: 2.7 t/s | Generation: 2.5 t/s ]                                                                                                  
>
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 2571 =  1099 +    1152 +     320                |

- AVX2: replace manual int8→int16→int32 reduction with mul_sum_i8_pairs_float() (auto-selects AVXVNNI dpbssd on supported CPUs) - Both paths: accumulate into __m256 float via fmadd_ps, single hsum_float_8 at end (eliminates per-block horizontal int32 sum) - Remove unused variables and constants

stfurkan · 2026-04-03T09:38:03Z

@zcattacz Thanks for testing! I've updated the AVX2 path, replaced the manual int8→int16→int32 reduction chain with mul_sum_i8_pairs_float() (which auto-selects dpbssd on AVXVNNI CPUs) and switched to a float accumulator with fmadd_ps + single hsum_float_8() at the end. This should be significantly faster on AVX2-only hardware.

Updated benchmarks (AMD EPYC Zen 4, 12 vCPU shared):

Model	pp512 (tok/s)	tg128 (tok/s)
1.7B	46.87 ± 3.08	118.65 ± 0.14
4B	30.87 ± 1.08	59.79 ± 0.39
8B	23.92 ± 0.42	35.03 ± 0.11

tg128 improved ~20-30% over the previous version. Would be great to see your i5 numbers with this update if you get a chance.

zcattacz · 2026-04-03T14:21:16Z

Hi @stfurkan I don't know which one is likely to get worked on, so I commented on both PR. Kinda awkward... The optimization later gave me real speed is documented in a later reply in #7 It's not PR7's way but partly from PR4 with an extra 40%~60% boost in tps.

Copilot AI review requested due to automatic review settings April 2, 2026 17:12

github-actions bot added the ggml label Apr 2, 2026

Copilot started reviewing on behalf of stfurkan April 2, 2026 17:12 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

ggml/src/ggml-cpu/arch/x86/quants.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/arch/x86/quants.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/arch/x86/quants.c Outdated Show resolved Hide resolved

stfurkan force-pushed the fix/q1_0_g128-x86-cpu-kernel branch from ba0e521 to 0b7a2dd Compare April 2, 2026 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI#6

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI#6
stfurkan wants to merge 2 commits intoPrismML-Eng:prismfrom
stfurkan:fix/q1_0_g128-x86-cpu-kernel

stfurkan commented Apr 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

stfurkan commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

stfurkan commented Apr 2, 2026

Uh oh!

zcattacz commented Apr 3, 2026 •

edited

Loading

Uh oh!

stfurkan commented Apr 3, 2026

Uh oh!

zcattacz commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

stfurkan commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Benchmarks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

stfurkan commented Apr 2, 2026

KL Divergence Results (Q1_0_g128 vs F16)

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

stfurkan commented Apr 2, 2026

Benchmarks (pp512 / tg128)

Uh oh!

zcattacz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stfurkan commented Apr 3, 2026

Uh oh!

zcattacz commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stfurkan commented Apr 2, 2026 •

edited

Loading

zcattacz commented Apr 3, 2026 •

edited

Loading