Skip to content

fix: Q1_0_g128 CPU kernel - correct output and AVX-512 SIMD#3

Open
jordankzf wants to merge 1 commit intoPrismML-Eng:prismfrom
jordankzf:fix/q1_0_g128-cpu-kernel
Open

fix: Q1_0_g128 CPU kernel - correct output and AVX-512 SIMD#3
jordankzf wants to merge 1 commit intoPrismML-Eng:prismfrom
jordankzf:fix/q1_0_g128-cpu-kernel

Conversation

@jordankzf
Copy link
Copy Markdown

Summary

  • Fix float-to-int truncation bug in Q1_0_g128 vec_dot that produced gibberish on CPU
  • Add AVX-512BW SIMD path (was purely scalar before)

Bug

sumi was declared int but accumulated float partial products (d1 * sumi_block), silently truncating to zero for small scale values. Affects both the x86 and generic fallback kernels.

Changes

ggml/src/ggml-cpu/arch/x86/quants.c

  • int sumi -> float sumi in scalar fallback
  • New #if defined(__AVX512BW__) path: sign-extend int8->int16, mask-negate via _mm512_mask_sub_epi16, pairwise reduce via _mm512_madd_epi16, float accumulate via _mm512_fmadd_ps, single horizontal sum at the end

ggml/src/ggml-cpu/quants.c

  • int sumi -> float sumi in generic fallback

Benchmarks (Bonsai-8B, CPU-only, Intel Ice Lake AVX-512)

Prompt Generation Output
Before 0.73 t/s 0.65 t/s Gibberish
After 23.2 t/s 13.5 t/s Coherent

@github-actions github-actions bot added the ggml label Apr 1, 2026
@requeijaum
Copy link
Copy Markdown

Worked as a charm in my Ryzen 5700U.

rafaelfrequiao@ideapad:~$ echo "=== 1. Preparando o repositório com o PR #3 ==="
cd ~/ai-lab               echo "=== 1. Preparando o repositório com o PR #3 ==="
cd ~/ai-laba.cpp-bonsai
rm -rf llama.cpp-bonsaib.com/PrismML-Eng/llama.cpp.git llama.cpp-bonsai
git clone https://github.com/PrismML-Eng/llama.cpp.git llama.cpp-bonsai
cd llama.cpp-bonsai
# Aqui está a mágica: baixando a correção exata do Pull Request 3
# Aqui está a mágica: baixando a correção exata do Pull Request 3
git fetch origin pull/3/head:correcao-cpu
git checkout correcao-cpu
echo "=== 2. Compilando com a correção ==="
echo "=== 2. Compilando com a correção ==="
cmake -B buildbuild -j$(nproc) --target llama-cli llama-server
cmake --build build -j$(nproc) --target llama-cli llama-server
echo "=== 3. Verificando o modelo 8B ==="
echo "=== 3. Verificando o modelo 8B ==="f/8B/Bonsai-8B.gguf ]; then
if [ ! -f ~/ai-lab/Bonsai-demo/models/gguf/8B/Bonsai-8B.gguf ]; then
    echo "Modelo não encontrado. Baixando o Bonsai 8B..."
    mkdir -p ~/ai-lab/Bonsai-demo/models/gguf/8B8B/Bonsai-8B.gguf "https://huggingface.co/prism-ml/Bonsai-8B-gguf/re    curl -L -o ~/ai-lab/Bonsai-demo/models/gguf/8B/Bonsai-8B.gguf "https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf"
fi
echo "=== 4. O Teste de Fogo ==="
echo "=== 4. O Teste de Fogo ==="
./build/bin/llama-cli \mo/models/gguf/8B/*.gguf \
  -m ~/ai-lab/Bonsai-demo/models/gguf/8B/*.gguf \
  -p "A capital do Brasil é " \
  -n 50 \
  -t 8
=== 1. Preparando o repositório com o PR #3 ===
Clonando en 'llama.cpp-bonsai'...
remote: Enumerating objects: 66862, done.
remote: Counting objects: 100% (39/39), done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 66862 (delta 8), reused 2 (delta 2), pack-reused 66823 (from 2)
Recibiendo objetos: 100% (66862/66862), 307.07 MiB | 3.32 MiB/s, listo.
Resolviendo deltas: 100% (47439/47439), listo.
remote: Enumerating objects: 9, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 9 (delta 0), reused 0 (delta 0), pack-reused 2 (from 1)
Desempaquetando objetos: 100% (9/9), 31.66 KiB | 810.00 KiB/s, listo.
Desde https://github.com/PrismML-Eng/llama.cpp
 * [nueva referencia]    refs/pull/3/head -> correcao-cpu
Cambiado a rama 'correcao-cpu'
=== 2. Compilando com a correção ===
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMAKE_BUILD_TYPE=Release
-- Found Git: /usr/bin/git (found version "2.47.3")
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- ggml version: 0.9.7
-- ggml commit:  aec184c6b
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "3.5.4")
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Success
-- OpenSSL found: 3.5.4
-- Generating embedded license file for target: common
-- Configuring done (3.9s)
-- Generating done (0.3s)
-- Build files have been written to: /home/rafaelfrequiao/ai-lab/llama.cpp-bonsai/build
[  0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[  3%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
[  3%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  3%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  3%] Building CXX object vendor/cpp-httplib/CMakeFiles/cpp-httplib.dir/httplib.cpp.o
[  3%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  3%] Built target build_info
[  3%] Linking CXX static library libcpp-httplib.a
[  3%] Built target cpp-httplib
[  3%] Linking CXX shared library ../../bin/libggml-base.so
[  3%] Built target ggml-base
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/unary-ops.cpp.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o
[  5%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/quants.c.o
[  7%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/repack.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/vec.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/traits.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ops.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o
[  9%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/arch/x86/quants.c.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/binary-ops.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/hbm.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/arch/x86/repack.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o
[  9%] Linking CXX shared library ../../bin/libggml-cpu.so
[  9%] Built target ggml-cpu
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-backend-dl.cpp.o
[ 11%] Linking CXX shared library ../../bin/libggml.so
[ 11%] Built target ggml
[ 13%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 13%] Building CXX object src/CMakeFiles/llama.dir/llama-arch.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-chat.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-batch.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-cparams.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-adapter.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-context.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-io.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-hparams.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-graph.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-impl.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache-iswa.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-memory.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-hybrid.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-quant.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-mmap.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-hybrid-iswa.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-model-saver.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-recurrent.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/llama-sampler.cpp.o
[ 23%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
[ 23%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
[ 23%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 25%] Building CXX object src/CMakeFiles/llama.dir/models/arcee.cpp.o
[ 25%] Building CXX object src/CMakeFiles/llama.dir/models/apertus.cpp.o
[ 25%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
[ 25%] Building CXX object src/CMakeFiles/llama.dir/models/arctic.cpp.o
[ 25%] Building CXX object src/CMakeFiles/llama.dir/models/afmoe.cpp.o
[ 25%] Building CXX object src/CMakeFiles/llama.dir/models/arwkv7.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/baichuan.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/bert.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe2.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/chameleon.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/chatglm.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/bitnet.cpp.o
[ 26%] Building CXX object src/CMakeFiles/llama.dir/models/codeshell.cpp.o
[ 28%] Building CXX object src/CMakeFiles/llama.dir/models/bloom.cpp.o
[ 30%] Building CXX object src/CMakeFiles/llama.dir/models/cohere2-iswa.cpp.o
[ 30%] Building CXX object src/CMakeFiles/llama.dir/models/cogvlm.cpp.o
[ 30%] Building CXX object src/CMakeFiles/llama.dir/models/dbrx.cpp.o
[ 30%] Building CXX object src/CMakeFiles/llama.dir/models/command-r.cpp.o
[ 32%] Building CXX object src/CMakeFiles/llama.dir/models/deci.cpp.o
[ 32%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek.cpp.o
[ 32%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek2.cpp.o
[ 32%] Building CXX object src/CMakeFiles/llama.dir/models/delta-net-base.cpp.o
[ 32%] Building CXX object src/CMakeFiles/llama.dir/models/dots1.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/models/dream.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5-moe.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/models/exaone4.cpp.o
[ 34%] Building CXX object src/CMakeFiles/llama.dir/models/eurobert.cpp.o
[ 36%] Building CXX object src/CMakeFiles/llama.dir/models/exaone-moe.cpp.o
[ 36%] Building CXX object src/CMakeFiles/llama.dir/models/falcon-h1.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/gemma-embedding.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/exaone.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/gemma2-iswa.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/gemma.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/falcon.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/gpt2.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/glm4.cpp.o
[ 38%] Building CXX object src/CMakeFiles/llama.dir/models/glm4-moe.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3n-iswa.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/grok.cpp.o
[ 40%] Building CXX object src/CMakeFiles/llama.dir/models/grovemoe.cpp.o
[ 42%] Building CXX object src/CMakeFiles/llama.dir/models/gptneox.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/granite.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-dense.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/models/granite-hybrid.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/jais.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/jais2.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/internlm2.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/kimi-linear.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/jamba.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-moe.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/models/lfm2.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/llada-moe.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/llada.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/maincoder.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/llama.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/models/llama-iswa.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/mamba-base.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/mamba.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/mimo2-iswa.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/minicpm3.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/minimax-m2.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/mpt.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/modern-bert.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/mistral3.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron-h.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/neo-bert.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/olmo2.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/olmo.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/olmoe.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/openelm.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/openai-moe-iswa.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/pangu-embedded.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/orion.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/plamo.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/paddleocr.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/phi2.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/phi3.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/plamo2.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/plm.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/qwen.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2moe.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2vl.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/plamo3.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/qwen35.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/qwen35moe.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3moe.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3next.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl-moe.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/refact.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/rnd1.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6-base.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6qwen2.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7-base.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/seed-oss.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/smallthinker.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/stablelm.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/step35-iswa.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/smollm3.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder2.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/t5-dec.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/t5-enc.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/xverse.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/wavtokenizer-dec.cpp.o
[ 73%] Linking CXX shared library ../bin/libllama.so
[ 73%] Built target llama
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/glm4v.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/clip.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/conformer.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/nemotron-v2-vl.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-audio.cpp.o
[ 75%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/cogvlm.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/kimik25.cpp.o
[ 75%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/internvl.cpp.o
[ 76%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/llava.cpp.o
[ 76%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-helper.cpp.o
[ 76%] Building CXX object common/CMakeFiles/common.dir/chat-parser-xml-toolcall.cpp.o
[ 76%] Building CXX object common/CMakeFiles/common.dir/chat-parser.cpp.o
[ 78%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/kimivl.cpp.o
[ 78%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/llama4.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/minicpmv.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/chat.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/chat-peg-parser.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/paddleocr.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/pixtral.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/qwen3vl.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/siglip.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/whisper-enc.cpp.o
[ 80%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/mobilenetv5.cpp.o
[ 82%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/youtuvl.cpp.o
[ 82%] Building CXX object common/CMakeFiles/common.dir/json-partial.cpp.o
[ 82%] Building CXX object common/CMakeFiles/common.dir/download.cpp.o
[ 84%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/models/qwen2vl.cpp.o
[ 86%] Building CXX object common/CMakeFiles/common.dir/debug.cpp.o
[ 86%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 86%] Building CXX object common/CMakeFiles/common.dir/llguidance.cpp.o
[ 86%] Building CXX object common/CMakeFiles/common.dir/ngram-map.cpp.o
[ 88%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 88%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 88%] Building CXX object common/CMakeFiles/common.dir/preset.cpp.o
[ 88%] Building CXX object common/CMakeFiles/common.dir/ngram-mod.cpp.o
[ 90%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/unicode.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 92%] Linking CXX shared library ../../bin/libmtmd.so
[ 92%] Building CXX object common/CMakeFiles/common.dir/jinja/lexer.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/jinja/runtime.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/jinja/parser.cpp.o
[ 92%] Building CXX object common/CMakeFiles/common.dir/jinja/caps.cpp.o
[ 94%] Building CXX object common/CMakeFiles/common.dir/jinja/value.cpp.o
[ 94%] Building CXX object common/CMakeFiles/common.dir/__/license.cpp.o
[ 94%] Building CXX object common/CMakeFiles/common.dir/jinja/string.cpp.o
[ 96%] Linking CXX static library libcommon.a
[ 96%] Built target mtmd
[ 96%] Built target common
[ 96%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-queue.cpp.o
[ 96%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-task.cpp.o
[ 96%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-context.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-common.cpp.o
[ 98%] Linking CXX static library libserver-context.a
[ 98%] Built target server-context
[100%] Building CXX object tools/cli/CMakeFiles/llama-cli.dir/cli.cpp.o
[100%] Linking CXX executable ../../bin/llama-cli
[100%] Built target llama-cli
[  0%] Built target build_info
[  0%] Built target cpp-httplib
[  3%] Built target ggml-base
[  9%] Built target ggml-cpu
[ 11%] Built target ggml
[ 73%] Built target llama
[ 82%] Built target mtmd
[ 96%] Built target common
[ 98%] Built target server-context
[ 98%] Generating index.html.gz.hpp
[ 98%] Generating loading.html.hpp
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-http.cpp.o
[100%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server.cpp.o
[100%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-models.cpp.o
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server
=== 3. Verificando o modelo 8B ===
=== 4. O Teste de Fogo ===

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8195-aec184c6b
model      : Bonsai-8B.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> A capital do Brasil é 

A capital do Brasil é **Brasília**.

[ Prompt: 0,2 t/s | Generation: 0,2 t/s ]

> 

Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 10627 =  1099 +    9216 +     312                |
rafaelfrequiao@ideapad:~/ai-lab/llama.cpp-bonsai$ 

@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

@jordankzf
Copy link
Copy Markdown
Author

@khosravipasha I can't run the KL divergence tests.

I have llama-perplexity built and the wikitext-2-raw dataset ready.

The Bonsai-1.7B-f16.gguf and Bonsai-1.7B-Q1_0.gguf modes look like they're in private repos. Could you make them public?

Note that Bonsai-1.7B-Q1_0_g128.gguf is publicly available.

The Q1_0_g128 vec_dot kernel had a bug where `sumi` was declared as
`int` but accumulated `float` partial products (`d1 * sumi_block`),
causing float-to-int truncation that destroyed dot product results
and produced gibberish output on CPU.

Additionally, the x86 kernel was purely scalar (one bit at a time).
This adds an AVX-512BW path that processes 32 elements per iteration
using mask_sub + madd + fma, with a single horizontal reduction at
the end.

Benchmarks (Bonsai-8B, CPU-only, AVX-512):
  Before:  0.73 t/s prompt, 0.65 t/s generation (gibberish output)
  After:  23.2 t/s prompt, 13.5 t/s generation (coherent output)
@jordankzf jordankzf force-pushed the fix/q1_0_g128-cpu-kernel branch from aec184c to 082e830 Compare April 3, 2026 09:51
@jordankzf
Copy link
Copy Markdown
Author

jordankzf commented Apr 3, 2026

The f16 GGUF isn't available on HuggingFace so I converted it from prism-ml/Bonsai-1.7B-unpacked (safetensors) using convert_hf_to_gguf.py --outtype f16.

Setup:

f16 reference: converted from prism-ml/Bonsai-1.7B-unpacked safetensors
Q1_0_g128: prism-ml/Bonsai-1.7B-gguf (public GGUF)
Dataset: wikitext-2-raw, 100 chunks, ctx=512
Flags: -ngl 0 (CPU-only)
Results (Q1_0_g128 vs f16):

Same top p: 0.075 +/- 0.017 %
RMS delta p: 51.124 +/- 0.224 %
This is a really bad result. I'm guessing that the unpacked safetensors on HF is a different checkpoint than what was quantized into the Q1_0_g128 GGUF. Neither Bonsai-1.7B-f16 nor Bonsai-1.7B-Q1_0 exist as public repos on HF, or am I missing something?

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented Apr 3, 2026

@jordankzf
Oh my bad I just made those locally and was using the paths, for the fp16 gguf you should be able to download prism-ml/Bonsai-8B-unpacked and convert to a fp16 gguf using llama.cpp's gguf converter tool, then you can run the llama-quantize to get the Q1_0 or Q1_0_g128. You can ignore the Q1_0 one potentially we are not using it in any of the models (but still should work since you can pack group size of 128 to group of 32 since 128 mod 32==0).

Might mean some issue with the kernels, can you run the same command without your changes? In the meantime I will check the 1.7B unpacked weights to see if they are good.

Also might not need to do 100 chunks for this test, few chunks are okay (at least until you get close to 0).

Is the output from the mode cohesive? Try few complicated to see if the kernels are working.

Two options to get the fp16-gguf:

  1. First convert the Bonsai-1.7B-unpacked to Q1_0_g128 using llama-quantize and --pure.
./build/bin/llama-quantize --pure models/Bonsai-1.7B.gguf models/Bonsai-1.7B-Q1_0_g128.gguf Q1_0_g128
  1. Dequantize the Q1_0_g128 using llama-quantize. something like this:
./build/bin/llama-quantize --allow-requantize models/Bonsai-1.7B.gguf models/Bonsai-1.7B-f16.gguf F16

@jordankzf
Copy link
Copy Markdown
Author

Made the changes @khosravipasha

Please have a look!

KL divergence and coherence test results.

Setup:

f16 reference: dequantized from Bonsai-1.7B.gguf via llama-quantize --allow-requantize ... F16
Q1_0_g128: prism-ml/Bonsai-1.7B-gguf
Dataset: wikitext-2-raw, 10 chunks, ctx=512
CPU-only build (no Vulkan), AVX-512 kernel
KL Divergence (Q1_0_g128 vs f16 dequantized from same model):

Same top p: 97.843 +/- 0.288 %
Maximum KLD: 0.021331
RMS delta p: 0.952 +/- 0.030 %
(Earlier results showing 0.075% were from a stale binary that was not rebuilt after source changes.)

Coherence test (Bonsai-1.7B, complex prompts):

Q: Explain the difference between TCP and UDP in networking
A: TCP and UDP are both protocols used for data transmission in networking, but they have fundamental differences in how they handle data, reliabi...

Q: Write a haiku about programming
A: Code flows like rivers, / Through lines of logic, it flows. / A world of possibilities.

Q: What causes ocean tides? Explain briefly.
A: Ocean tides are caused by the gravitational pull of the Moon and the Sun on the Earth oceans...

All responses are coherent and factually correct. 1.7B model runs at 33-40 t/s on CPU (AVX-512).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants