feat: port TQ3_0 KV cache from llama-turboquant by carlosfundora · Pull Request #2 · PrismML-Eng/llama.cpp

carlosfundora · 2026-04-01T18:50:56Z

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.

TurboQuant 3-bit (3.5 bpw) KV cache compression: - Per-block WHT rotation with 4-centroid MSE codebook - QJL residual signs for error correction - GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy - CPU: quantize/dequantize with WHT butterfly transform - Flash attention auto-disabled for TQ3_0 K cache Combined with PrismML's Q1_0 GPU inference, this enables 1-bit weights + 3-bit KV cache on a single build.

khosravipasha · 2026-04-02T22:26:28Z

Thanks this is pretty cool. How does it work? It is good?

Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it.
What the vram usage with long context after this change?

carlosfundora · 2026-04-03T06:35:21Z

It works great. I have SGLang nearly wired up for 1-bit support and TurboQuant as well.

carlosfundora · 2026-04-03T06:37:44Z

VRAM usage was reduced by roughly 35%.

khosravipasha · 2026-04-03T10:46:56Z

Oh how does it work with SGlang for 1-bit, was it easy to add support there?

carlosfundora · 2026-04-04T03:37:35Z

So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do.

If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌

khosravipasha · 2026-04-04T18:45:32Z

@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?)

github-actions bot added Nvidia GPU ggml examples labels Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: port TQ3_0 KV cache from llama-turboquant#2

feat: port TQ3_0 KV cache from llama-turboquant#2
carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
carlosfundora:feature/tq3_0-kv-cache

carlosfundora commented Apr 1, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

khosravipasha commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 4, 2026 •

edited

Loading

Uh oh!

khosravipasha commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

carlosfundora commented Apr 1, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 3, 2026

Uh oh!

khosravipasha commented Apr 3, 2026

Uh oh!

carlosfundora commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlosfundora commented Apr 4, 2026 •

edited

Loading