feat: port TQ3_0 KV cache from llama-turboquant#2
feat: port TQ3_0 KV cache from llama-turboquant#2carlosfundora wants to merge 1 commit intoPrismML-Eng:prismfrom
Conversation
TurboQuant 3-bit (3.5 bpw) KV cache compression: - Per-block WHT rotation with 4-centroid MSE codebook - QJL residual signs for error correction - GPU kernels: vec_dot, MMVQ, convert, set-rows, cpy - CPU: quantize/dequantize with WHT butterfly transform - Flash attention auto-disabled for TQ3_0 K cache Combined with PrismML's Q1_0 GPU inference, this enables 1-bit weights + 3-bit KV cache on a single build.
|
Thanks this is pretty cool. How does it work? It is good? Our main focus right now is getting our changes in llama.cpp so might not have time to look into details yet, but love to see if speed/quality output if we have tried it. |
|
VRAM usage was reduced by roughly 35%. |
|
Oh how does it work with SGlang for 1-bit, was it easy to add support there? |
|
So so, patience and preparation was key. I also crafted a few agents to run methodical research and debugging on a very detailed and slow scale during smoke tests. It ran well for chat in SGLang, I bechmarked it and moved on to implement P-EAGLE though, so I've been training heads for the models all day. I haven't yet tried it on any coding tasks but I'm excited to see how they do. If you guys would be kind enough to release a 1/2 B or .3-.6 range 1-bit quant that would be amazing and make it much easier for me to rapidly work on creating PRs for advanced speculative decoding architectures. 🧠🤌 |
|
@carlosfundora Sounds exciting, yeah good ideas, lets chat more on the discord-server next week (I think you were there right?) |

TurboQuant 3-bit (3.5 bpw) KV cache compression combined with PrismML's Q1_0 GPU inference. Ported the TQ3_0 implementation from llama-turboquant, tested on ROCm gfx1030.