vulkan: add Q1_0_g128 (1-bit ternary) shader support by claudlos · Pull Request #9 · PrismML-Eng/llama.cpp

claudlos · 2026-04-03T00:43:44Z

Add Vulkan shader support for Q1_0_g128 quantization

Summary

This PR adds Vulkan shader support for the GGML_TYPE_Q1_0_g128 (1-bit sign / binary quantization, group size 128) format. The primary validated paths today are dequantization, get_rows, and fused mul_mat_vec. Without these shaders, Q1_0_g128 models fall back to CPU dequantization on Vulkan devices, resulting in ~291 graph splits and extremely poor performance. With this PR, the tested inference path runs almost entirely on GPU with only 2 graph splits.

Performance Results

Metric	Before (CPU fallback)	After (Vulkan shaders)	Speedup
Eval (token gen)	0.28 t/s	23.4 t/s	84x
Prompt eval (135 tokens)	0.31 t/s	38.3 t/s	124x
Graph splits	291	2	-99.3%

Comparison: Qwen2.5-3B Q4_K_M on the same hardware achieves 27.8 t/s — our Bonsai 8B Q1_0_g128 reaches 84% of that speed despite being 2.7x larger, validating the efficiency of 1-bit inference.

Files Changed

New Files

ggml/src/ggml-vulkan/vulkan-shaders/dequant_q1_0_g128.comp — Standalone dequantization compute shader. Each thread processes one 128-element block (16 bytes of packed sign bits + fp16 scale), used for get_rows and general dequantization pipelines.
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q1_0_g128.comp — Custom fused matrix-vector multiply shader. Uses 4 threads per 128-element block (32 elements/thread = one uint32 of packed sign bits). Maps each bit to ±1.0, which compiles to efficient v_cndmask-style code on RDNA GPUs. Accumulates via 8 dot(vec4, vec4) operations per thread with fma for the final scale multiply.

Modified Files

ggml/src/ggml-vulkan/vulkan-shaders/types.glsl — Added block_q1_0_g128 struct definition (fp16 scale d + 16-byte qs array) and DATA_A_Q1_0_G128 preprocessor configuration with QUANT_K=128, QUANT_R=1, QUANT_AUXF=1.
ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.glsl — Added dequantize() and dequantize4() functions for Q1_0_g128 (bit extraction → sign mapping). Added Q1_0_g128 to the single-scale get_dm() path (returns vec2(d, 0)).
ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_funcs.glsl — Added Q1_0_g128 matrix-matrix multiply load path in load_a_to_shmem. Uses branchless FMA dequantization: d*(2*bit - 1) = fma(2d, bit_float, -d) for efficient SIMD utilization.
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp — Registered q1_0_g128 in the type list, marked as legacy quant with LOAD_VEC_A=4. Excluded from coopmat2 flash attention paths (no dequantFunc defined), excluded from integer dot product q8_1 MMQ paths (no Q8_1 quantization mapping exists for 1-bit types).
ggml/src/ggml-vulkan/ggml-vulkan.cpp — Registered Q1_0_g128 pipelines across all shader pipeline arrays: dequant, mul_mat_vec (f32 and f16 B-type variants), mul_mat_vec_id, get_rows, get_rows_f32. Added Q1_0_g128 to supports_op switch cases for GGML_OP_MUL_MAT and GGML_OP_GET_ROWS. Uses rm_iq row multiplier and subgroup16 configuration (matching IQ-type pipeline parameters).

Technical Design Decisions

Custom fused mul_mat_vec shader rather than generic path: The 1-bit format has a unique structure (128 elements packed as 16 bytes of sign bits + single fp16 scale) that doesn't fit the standard mul_mat_vec.comp template. The fused shader avoids intermediate dequantization and directly computes dot products from packed bits.
4 threads per block: Each thread handles 32 elements (one uint32 worth of bits), loading 4 bytes and expanding to 8 vec4 dot products. This maps well to GPU wavefronts.
Excluded from coopmat2/flash attention: Q1_0_g128 is a weights-only quantization (KV cache uses f16). There's no dequantFunc symbol required for the cooperative matrix path, and flash attention operates on KV cache types, not weight types.
Excluded from integer dot MMQ: The Q8_1 quantized matmul path requires a compatible requantization scheme that doesn't exist for 1-bit types.
Branchless FMA in mul_mm_funcs: The matrix-matrix path uses fma(2d, bit_float, -d) instead of conditional selection, which is more efficient for the wider SIMD paths used in batched matmul.

Testing

Model: Bonsai-8B Q1_0_g128 GGUF (1-bit ternary quantized 8B parameter model)
Hardware: AMD Radeon 680M (RDNA2 integrated GPU, 12 Compute Units, 18GB shared VRAM)
Driver: AMD Adrenalin on Windows 11
Test methodology: Interactive text generation via llama-cli with Vulkan backend (-ngl 99)
Correctness validation: Output text is coherent and contextually appropriate
12 shader variants were tested during optimization to arrive at the final design
Prompt evaluation: 135-token prompt, measured throughput
Token generation: Measured sustained generation rate

Known Limitations

No cooperative matrix (coopmat2) support: Q1_0_g128 does not participate in cooperative matrix matmul or flash attention paths. This is by design — these paths require a dequantFunc symbol and Q1_0_g128 is weights-only.
No integer dot product (MMQ) support: The q8_1 integer dot product optimization path is excluded for Q1_0_g128 since no compatible requantization scheme exists.
No F16 B-type mul_mat support: Q1_0_g128 with F16 input tensors is explicitly blocked in supports_op. Only F32 B-type is supported. This avoids the complexity of F16 pipeline wiring for a weights-only quantization format (KV cache uses f16, but input activations are f32).
Tested on RDNA2 only: While the shaders use standard Vulkan compute (no vendor-specific extensions), they have only been validated on AMD RDNA2. Testing on NVIDIA and Intel GPUs is recommended.
Flash Attention: N/A for weight quantization types. Q1_0_g128 models use f16 KV cache, which already has full Vulkan support.

Hardware Tested

GPU: AMD Radeon 680M (RDNA2 iGPU)
Compute Units: 12
VRAM: 18GB shared (system memory)
OS: Windows 11
Vulkan API: 1.3

Add Vulkan compute shader support for the GGML_TYPE_Q1_0_g128 quantization format (1-bit sign / binary quantization, group size 128). New files: - dequant_q1_0_g128.comp: standalone dequantization shader - mul_mat_vec_q1_0_g128.comp: fused matrix-vector multiply shader (4 threads/block, 32 elements/thread, 8x dot(vec4,vec4)) Modified files: - types.glsl: block_q1_0_g128 struct, QUANT_K=128, QUANT_R=1 - dequant_funcs.glsl: dequantize/dequantize4 + single-scale get_dm - mul_mm_funcs.glsl: branchless FMA load path for batched matmul - vulkan-shaders-gen.cpp: type registration, LOAD_VEC_A=4, excluded from coopmat2 flash attention and integer dot product Q8_1 paths - ggml-vulkan.cpp: pipeline registration for dequant, get_rows, mul_mat_vec (f32/f16/id), mul_mat_mat, mul_mat_mat_id, supports_op - test-backend-ops.cpp: Q1_0_g128 test cases for get_rows, mul_mat, mul_mat_id Performance on AMD Radeon 680M (RDNA2 iGPU): eval: 0.28 -> 23.4 t/s (84x), prompt: 0.31 -> 38.3 t/s (124x) graph splits: 291 -> 2

soyelmismo · 2026-04-03T02:15:49Z

fully working in rx470, thanks so much!

khosravipasha · 2026-04-03T05:14:58Z

Nice, thanks this is very good, we had a Vulkan and opencl backends too but did not have time to test and benchmark them properly, so did not release it, I will try to put them in a brach for people to try. This looks great too.

Curios which phone was this?

Also the x86 cpu giving wrong output shoud be fixed now (stilll not optimized to be fast but its correct now), please check this PR #8

claudlos · 2026-04-04T09:29:50Z

Nice, thanks this is very good, we had a Vulkan and opencl backends too but did not have time to test and benchmark them properly, so did not release it, I will try to put them in a brach for people to try. This looks great too.

Curios which phone was this?

Also the x86 cpu giving wrong output shoud be fixed now (stilll not optimized to be fast but its correct now), please check this PR #8

Thanks Khosravipasha I'm happy to help

This was done with a mini computer not a phone. R9 6900HX with AMD Radeon 680M (RDNA2 iGPU)

I edited the PR and removed the notes about x86 cpu dequant.

harish2704 · 2026-04-04T12:42:42Z

It is working on My Vega56 GPU. Thank you very much to @claudlos

github-actions bot added ggml Vulkan testing labels Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: add Q1_0_g128 (1-bit ternary) shader support#9

vulkan: add Q1_0_g128 (1-bit ternary) shader support#9
claudlos wants to merge 1 commit intoPrismML-Eng:prismfrom
claudlos:prism

claudlos commented Apr 3, 2026 •

edited

Loading

Uh oh!

soyelmismo commented Apr 3, 2026 •

edited

Loading

Uh oh!

khosravipasha commented Apr 3, 2026 •

edited

Loading

Uh oh!

claudlos commented Apr 4, 2026

Uh oh!

harish2704 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

claudlos commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Vulkan shader support for Q1_0_g128 quantization

Summary

Performance Results

Files Changed

New Files

Modified Files

Technical Design Decisions

Testing

Known Limitations

Hardware Tested

Uh oh!

soyelmismo commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claudlos commented Apr 4, 2026

Uh oh!

harish2704 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

claudlos commented Apr 3, 2026 •

edited

Loading

soyelmismo commented Apr 3, 2026 •

edited

Loading

khosravipasha commented Apr 3, 2026 •

edited

Loading