Skip to content

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI#6

Open
stfurkan wants to merge 2 commits intoPrismML-Eng:prismfrom
stfurkan:fix/q1_0_g128-x86-cpu-kernel
Open

fix: Q1_0_g128 x86 CPU kernel - correct output + AVX2/AVX-512 VNNI#6
stfurkan wants to merge 2 commits intoPrismML-Eng:prismfrom
stfurkan:fix/q1_0_g128-x86-cpu-kernel

Conversation

@stfurkan
Copy link
Copy Markdown

@stfurkan stfurkan commented Apr 2, 2026

Builds on the scalar fix from #8 (cpu-fixes) which corrected the float-to-int truncation bug by changing int sumi to float sumi. That fix produces correct output but falls back to scalar code on x86 (~3 tok/s).

This PR adds SIMD-optimized x86 kernels for Q1_0_g128 to bring x86 CPU performance closer to what ARM NEON achieves.

Changes

  • x86 arch/x86/quants.c: replace generic scalar delegation with three-tier SIMD implementation:
    1. AVX-512 VNNI (BW+VL+VNNI): maskz_mov_epi8 for single-instruction bit expansion + VPDPBUSD for dot product accumulation
    2. AVX2: shuffle-based bit-to-byte expansion + sign_epi8 multiply
    3. Scalar fallback: delegates to ggml_vec_dot_q1_0_g128_q8_0_generic
  • Generic scalar kernel (quants.c): minor cleanup — simplified inner loop using direct bit extraction (bits[j >> 3] >> (j & 7)) and single-level float accumulation
  • All loads use memcpy for strict-aliasing and alignment safety

ARM NEON and CUDA/Metal paths are untouched.

Benchmarks

Hetzner CPX52 (12 vCPU AMD EPYC Zen 4, shared, 24GB RAM)

Model Scalar (before) AVX-512 VNNI (after) Speedup
1.7B ~3 tok/s 51.1 tok/s ~17x
4B ~3 tok/s 19.8 tok/s ~7x
8B ~3 tok/s 11.1 tok/s ~4x

All models produce correct output. Prompt processing sees similar gains (44.8 / 23.6 / 12.8 tok/s respectively).

Live demo: https://ai.sft.best (temporary, may be taken down)

Copilot AI review requested due to automatic review settings April 2, 2026 17:12
@github-actions github-actions bot added the ggml label Apr 2, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect Q1_0_g128 × Q8_0 dot-product results on x86 by correcting float/int accumulation and introducing optimized x86 implementations (AVX-512 VNNI, AVX2, scalar fallback) while keeping behavior consistent with the working ARM path.

Changes:

  • Fix scalar generic kernel accumulation to avoid float-to-int truncation.
  • Replace x86 scalar-only kernel with AVX-512 VNNI and AVX2 implementations plus corrected scalar fallback.
  • Simplify bit extraction logic in scalar paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
ggml/src/ggml-cpu/quants.c Fixes generic scalar vec_dot accumulation for correct numerical results.
ggml/src/ggml-cpu/arch/x86/quants.c Adds AVX-512 VNNI + AVX2 kernels and fixes scalar fallback accumulation for x86.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

The Q1_0_g128 vec_dot kernel for x86 produces garbage output due to a
float-to-int truncation bug: `sumi += d1 * sumi_block` accumulates a
float product into an int, silently truncating the result to zero for
small scale factors. This affects both the generic scalar fallback and
the x86 arch-specific implementation.

The ARM NEON implementation was correct and unaffected.

Changes:
- Fix generic scalar kernel (quants.c): accumulate `d0 * d1 * sumi`
  into float, matching the working ARM scalar fallback pattern
- Replace x86 scalar-only kernel with three-tier implementation:
  1. AVX-512 VNNI (BW+VL+VNNI): uses mask registers for single-
     instruction bit expansion + VPDPBUSD for dot product
  2. AVX2: shuffle-based bit expansion + sign_epi8 multiply
  3. Scalar fallback: corrected accumulation

Benchmarks on AMD EPYC (Zen 4, 12 vCPU shared):
  Before (broken): garbage output at ~0.5 tok/s
  Scalar fix:      correct output at ~3 tok/s
  AVX2:            correct output at ~28 tok/s
  AVX-512 VNNI:    correct output at ~50 tok/s (1.7B model)
@stfurkan stfurkan force-pushed the fix/q1_0_g128-x86-cpu-kernel branch from ba0e521 to 0b7a2dd Compare April 2, 2026 18:52
@stfurkan
Copy link
Copy Markdown
Author

stfurkan commented Apr 2, 2026

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

@khosravipasha Thanks! I rebased the branch on top of your cpu-fixes merge. The KL divergence results are below, the AVX-512 VNNI kernel matches F16 almost exactly (99.949% same top p, near-zero KL divergence).

KL Divergence Results (Q1_0_g128 vs F16)

AMD EPYC Zen 4 (AVX-512 VNNI kernel), Bonsai-1.7B, wikitext-2-raw, 100 chunks, ctx 512

Note: The wikitext-2-raw dataset had some invalid codepoints that caused the tokenizer to crash. Cleaned by removing non-BMP characters before running.

Metric Value
Same top p 99.949 ± 0.014 %
Mean KL divergence -0.000000 ± 0.000000
Max KL divergence 0.000031
Mean PPL(Q)/PPL(base) 1.000012 ± 0.000019
RMS Δp 0.000 ± 0.000 %
Prompt processing 50.29 tok/s

Q1_0 (non-g128) GGUF doesn't appear to be published on HuggingFace, only Bonsai-1.7B.gguf (Q1_0_g128) is available at prism-ml/Bonsai-1.7B-gguf. Happy to run it if you can share the file.

Full log

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 33.76 seconds per pass - ETA 14.07 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1     798.3831 ±  121.7392      -0.00001 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    100.000 ±  0.000 %
   2     620.0014 ±   65.3600       0.00001 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    100.000 ±  0.000 %
   3     645.9309 ±   57.4854       0.00001 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    100.000 ±  0.000 %
   4     674.1265 ±   52.2820       0.00000 ±       -nan      -0.00000 ±    0.00000     0.001 ±  0.000 %    99.902 ±  0.098 %
   5     695.0380 ±   49.2197      -0.00000 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    99.922 ±  0.078 %
   6     705.0255 ±   46.4772       0.00000 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    99.935 ±  0.065 %
   7     698.1185 ±   42.5761       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.944 ±  0.056 %
   8     704.5098 ±   40.2837       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.951 ±  0.049 %
   9     708.6964 ±   38.1745       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.956 ±  0.044 %
  10     712.0216 ±   36.1241       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.961 ±  0.039 %
  11     723.7918 ±   35.1560       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.964 ±  0.036 %
  12     722.6843 ±   33.4567       0.00000 ±       -nan       0.00000 ±    0.00000     0.000 ±  0.000 %    99.967 ±  0.033 %
  13     722.3550 ±   32.1003       0.00000 ±       -nan       0.00000 ±    0.00000     0.001 ±  0.000 %    99.970 ±  0.030 %
  14     721.8709 ±   31.0136       0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.972 ±  0.028 %
  15     724.1042 ±   29.9789      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.974 ±  0.026 %
  16     718.8967 ±   28.6933      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.975 ±  0.025 %
  17     713.8766 ±   27.5841      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.977 ±  0.023 %
  18     716.6543 ±   26.9642      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.031 %
  19     714.1373 ±   26.1189      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.959 ±  0.029 %
  20     716.0243 ±   25.5827      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.961 ±  0.028 %
  21     720.8451 ±   25.2289      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.944 ±  0.032 %
  22     709.7127 ±   24.1366      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.031 %
  23     722.8257 ±   24.1865      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.030 %
  24     731.1260 ±   23.9849      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.028 %
  25     733.6260 ±   23.5931      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.027 %
  26     729.0816 ±   22.9661      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.955 ±  0.026 %
  27     732.0525 ±   22.6660      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.025 %
  28     732.0982 ±   22.2974      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.958 ±  0.024 %
  29     731.2949 ±   21.8910      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.959 ±  0.023 %
  30     728.4758 ±   21.3988      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.961 ±  0.023 %
  31     731.1864 ±   21.1542      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.962 ±  0.022 %
  32     730.5836 ±   20.7615      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.963 ±  0.021 %
  33     729.5381 ±   20.4158      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.964 ±  0.021 %
  34     733.9744 ±   20.2788      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.965 ±  0.020 %
  35     730.8512 ±   19.8602      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.966 ±  0.019 %
  36     732.7395 ±   19.6591      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.967 ±  0.019 %
  37     731.6309 ±   19.3785      -0.00000 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.968 ±  0.018 %
  38     726.9186 ±   18.9506      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.969 ±  0.018 %
  39     725.5439 ±   18.6636      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.970 ±  0.017 %
  40     725.8810 ±   18.4446      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.971 ±  0.017 %
  41     727.3044 ±   18.2676      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.962 ±  0.019 %
  42     727.9197 ±   18.0797      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.963 ±  0.019 %
  43     724.1192 ±   17.7521      -0.00001 ±    0.00000       0.00000 ±    0.00000     0.000 ±  0.000 %    99.964 ±  0.018 %
  44     728.2769 ±   17.6831       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.964 ±  0.018 %
  45     727.4466 ±   17.4680       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.019 %
  46     728.4052 ±   17.3137       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.957 ±  0.019 %
  47     726.8054 ±   17.0814       0.00004 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.958 ±  0.019 %
  48     725.2216 ±   16.8666       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.959 ±  0.018 %
  49     723.2643 ±   16.6521       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.020 %
  50     723.6195 ±   16.4879       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.019 %
  51     724.2560 ±   16.3305       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.954 ±  0.019 %
  52     723.1021 ±   16.1320       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.955 ±  0.018 %
  53     721.3222 ±   15.9305       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.018 %
  54     720.8257 ±   15.7539       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.956 ±  0.018 %
  55     720.6670 ±   15.6011       0.00003 ±    0.00004       0.00000 ±    0.00000     0.000 ±  0.000 %    99.957 ±  0.017 %
  56     720.2911 ±   15.4446       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.958 ±  0.017 %
  57     719.3650 ±   15.2757       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.018 %
  58     719.3343 ±   15.1429       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.018 %
  59     716.6217 ±   14.9456       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.018 %
  60     719.6623 ±   14.9223       0.00003 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.018 %
  61     718.3877 ±   14.7454       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.018 %
  62     721.5333 ±   14.7063       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.018 %
  63     721.6858 ±   14.5951       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.018 %
  64     721.6114 ±   14.4782       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.017 %
  65     721.4266 ±   14.3558       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.017 %
  66     718.7859 ±   14.1840       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.017 %
  67     719.6319 ±   14.1144       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.017 %
  68     719.4142 ±   13.9946       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.017 %
  69     718.6948 ±   13.8686       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.017 %
  70     720.3316 ±   13.8043       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.017 %
  71     720.1921 ±   13.6989       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.017 %
  72     721.1309 ±   13.6219       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.016 %
  73     721.9325 ±   13.5344       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.946 ±  0.017 %
  74     722.5413 ±   13.4512       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.017 %
  75     722.3066 ±   13.3536       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.017 %
  76     724.6891 ±   13.3120       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.016 %
  77     723.4411 ±   13.2000       0.00002 ±    0.00003       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.016 %
  78     724.1289 ±   13.1312       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.016 %
  79     723.7522 ±   13.0401       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.016 %
  80     723.1670 ±   12.9406       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.015 %
  81     722.7882 ±   12.8539       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.015 %
  82     721.8618 ±   12.7553       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.015 %
  83     721.8330 ±   12.6808       0.00002 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.015 %
  84     724.5431 ±   12.6621       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.953 ±  0.015 %
  85     726.1077 ±   12.6264       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.954 ±  0.015 %
  86     728.1747 ±   12.5917       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.954 ±  0.014 %
  87     727.5947 ±   12.4961       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.955 ±  0.014 %
  88     728.7140 ±   12.4468       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.015 %
  89     726.3536 ±   12.3251       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.947 ±  0.015 %
  90     725.7515 ±   12.2391       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.015 %
  91     725.6283 ±   12.1728       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.948 ±  0.015 %
  92     725.0724 ±   12.1026       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.015 %
  93     725.5553 ±   12.0495       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.015 %
  94     727.2092 ±   12.0117       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.014 %
  95     726.7849 ±   11.9400       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.950 ±  0.014 %
  96     726.2184 ±   11.8665       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.014 %
  97     723.7844 ±   11.7554       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.951 ±  0.014 %
  98     725.5034 ±   11.7319       0.00001 ±    0.00002       0.00000 ±    0.00000     0.000 ±  0.000 %    99.952 ±  0.014 %
  99     724.5530 ±   11.6529       0.00001 ±    0.00002      -0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.014 %
 100     724.6320 ±   11.5917       0.00001 ±    0.00002      -0.00000 ±    0.00000     0.000 ±  0.000 %    99.949 ±  0.014 %

====== Perplexity statistics ======
Mean PPL(Q)                   : 724.631977 ±  11.591689
Mean PPL(base)                : 724.623018 ±  11.591121
Cor(ln(PPL(Q)), ln(PPL(base))): 100.00%
Mean ln(PPL(Q)/PPL(base))     :   0.000012 ±   0.000019
Mean PPL(Q)/PPL(base)         :   1.000012 ±   0.000019
Mean PPL(Q)-PPL(base)         :   0.008959 ±   0.014099

====== KL divergence statistics ======
Mean    KLD:  -0.000000 ±   0.000000
Maximum KLD:   0.000031
99.9%   KLD:   0.000023
99.0%   KLD:   0.000016
95.0%   KLD:   0.000011
90.0%   KLD:   0.000008
Median  KLD:  -0.000000
10.0%   KLD:  -0.000008
 5.0%   KLD:  -0.000011
 1.0%   KLD:  -0.000016
 0.1%   KLD:  -0.000022
Minimum KLD:  -0.000028

====== Token probability statistics ======
Mean    Δp:  0.000 ± 0.000 %
Maximum Δp:  0.010%
99.9%   Δp:  0.002%
99.0%   Δp:  0.001%
95.0%   Δp:  0.000%
90.0%   Δp:  0.000%
75.0%   Δp:  0.000%
Median  Δp:  0.000%
25.0%   Δp: -0.000%
10.0%   Δp: -0.000%
 5.0%   Δp: -0.000%
 1.0%   Δp: -0.001%
 0.1%   Δp: -0.003%
Minimum Δp: -0.018%
RMS Δp    :  0.000 ± 0.000 %
Same top p: 99.949 ± 0.014 %

@khosravipasha
Copy link
Copy Markdown
Collaborator

Thanks looks good its close to 0.
do you have tg128 and pp512 speeds as well

Its okay for Q1_0, we won't be using it. Also seems llama.cpp people don't like the Q1_0_g128 naming so most likely we will rename the Q1_0_g128 => Q1_0 and remove Q1_0 in future in llama.cpp's main repo.

@stfurkan
Copy link
Copy Markdown
Author

stfurkan commented Apr 2, 2026

Here are the pp512/tg128 benchmarks for all three models:

Benchmarks (pp512 / tg128)

AMD EPYC Zen 4 (12 vCPU shared), AVX-512 VNNI kernel, -r 3

Model Size pp512 (tok/s) tg128 (tok/s)
Bonsai 1.7B 231 MiB 46.92 ± 2.02 98.16 ± 1.02
Bonsai 4B 540 MiB 30.35 ± 0.21 45.96 ± 0.04
Bonsai 8B 1.07 GiB 23.96 ± 0.03 27.58 ± 0.31

12 threads, BLAS backend, shared vCPU (Hetzner CPX52). Note these are with all threads on a single model, in production I run all 3 simultaneously with 4 threads each, which gives roughly half these numbers.

Good to know about the Q1_0_g128 → Q1_0 rename. Thanks @khosravipasha

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 3, 2026

This is slower than PR7 on my i5 box with AVX2 only. llamacpp build with

rm -rf build
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j 4

pr6

llm/llama.cpp-prism$ cd bin && rm cpu && ln -sf pr6 cpu && cd ..
llm/llama.cpp-prism$ BONSAI_MODEL=4B ./scripts/run_llama.sh -c 8192 -p "Hi"
build      : b0-unknown
model      : Bonsai-4B.gguf
modalities : text
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file
> Hi                                                                                                                                       
Hello! I'm Bonsai, a 1-bit AI assistant developed by PrismML. I'm here to help with any questions or tasks you might have. How can I assist you today?
[ Prompt: 3.2 t/s | Generation: 2.9 t/s ]                                                                                                  
>                                                                                                                                          
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 2003 =   540 +    1152 +     311                |

pr7

llm/llama.cpp-prism$ cd bin && rm cpu && ln -sf pr7 cpu && cd ..
llm/llama.cpp-prism$ BONSAI_MODEL=4B ./scripts/run_llama.sh -c 8192 -p "Hi"
build      : b0-unknown
model      : Bonsai-4B.gguf
modalities : text
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file
> Hi                                                                                                                                       
Hello! I'm Bonsai, an AI assistant developed by PrismML. I'm here to help with any questions or tasks you might have. How can I assist you today?
[ Prompt: 5.2 t/s | Generation: 4.5 t/s ]                                                                                                  
>                                                                                                                                          
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 2003 =   540 +    1152 +     311                |

llm/llama.cpp-prism$ BONSAI_MODEL=8B ./scripts/run_llama.sh -c 8192 -p "Hi"
build      : b0-unknown
model      : Bonsai-8B.gguf
modalities : text
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file
> Hi                                                                                                                                       
Hello! I'm Bonsai, an AI assistant developed by PrismML. How can I assist you today?
[ Prompt: 2.7 t/s | Generation: 2.5 t/s ]                                                                                                  
>
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 2571 =  1099 +    1152 +     320                |

- AVX2: replace manual int8→int16→int32 reduction with mul_sum_i8_pairs_float()
  (auto-selects AVXVNNI dpbssd on supported CPUs)
- Both paths: accumulate into __m256 float via fmadd_ps, single hsum_float_8 at end
  (eliminates per-block horizontal int32 sum)
- Remove unused variables and constants
@stfurkan
Copy link
Copy Markdown
Author

stfurkan commented Apr 3, 2026

@zcattacz Thanks for testing! I've updated the AVX2 path, replaced the manual int8→int16→int32 reduction chain with mul_sum_i8_pairs_float() (which auto-selects dpbssd on AVXVNNI CPUs) and switched to a float accumulator with fmadd_ps + single hsum_float_8() at the end. This should be significantly faster on AVX2-only hardware.

Updated benchmarks (AMD EPYC Zen 4, 12 vCPU shared):

Model pp512 (tok/s) tg128 (tok/s)
1.7B 46.87 ± 3.08 118.65 ± 0.14
4B 30.87 ± 1.08 59.79 ± 0.39
8B 23.92 ± 0.42 35.03 ± 0.11

tg128 improved ~20-30% over the previous version. Would be great to see your i5 numbers with this update if you get a chance.

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 3, 2026

Hi @stfurkan I don't know which one is likely to get worked on, so I commented on both PR. Kinda awkward... The optimization later gave me real speed is documented in a later reply in #7 It's not PR7's way but partly from PR4 with an extra 40%~60% boost in tps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants