From 65416cda051d0579cb1622c1dc91a8e74e6d138d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kacper=20=C5=81ukawski?= Date: Mon, 30 Mar 2026 17:53:35 +0200 Subject: [PATCH 1/4] Add tutorial for KV cache compression with TurboQuant --- index.toml | 12 ++ ...oQuant_Quantization_with_HuggingFace.ipynb | 199 ++++++++++++++++++ 2 files changed, 211 insertions(+) create mode 100644 tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb diff --git a/index.toml b/index.toml index 2b92622..b8310be 100644 --- a/index.toml +++ b/index.toml @@ -249,3 +249,15 @@ completion_time = "20 min" created_at = 2025-12-30 dependencies = ["haystack-experimental>=0.16.0", "datasets"] featured = true + +[[tutorial]] +title = "Compress the KV Cache with TurboQuant and Haystack" +description = "Use TurboQuant KV cache compression to run large LLMs on consumer GPUs with significant memory reduction" +level = "advanced" +weight = 12 +notebook = "49_TurboQuant_Quantization_with_HuggingFace.ipynb" +aliases = [] +completion_time = "20 min" +created_at = 2026-03-30 +dependencies = ["haystack-ai", "turboquant-vllm", "transformers"] +featured = false diff --git a/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb b/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb new file mode 100644 index 0000000..292789a --- /dev/null +++ b/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb @@ -0,0 +1,199 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "# Compress the KV Cache with TurboQuant and Haystack\n\n- **Level**: Advanced\n- **Time to complete**: 20 min\n- **Nodes Used**: [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator)\n- **Goal**: Apply TurboQuant KV cache compression to a local LLM and measure its memory and throughput impact with Haystack." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Overview\n\nEvery time an LLM generates a token, it reads and writes a **key-value (KV) cache** - a growing table of intermediate activations that lets the model attend to previous tokens without recomputing them. On long contexts or large models, this cache becomes the dominant consumer of GPU memory.\n\n[TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) is a KV cache compression algorithm from Google Research (ICLR 2026) that shrinks those vectors to 3–4 bits per coordinate without any retraining. It works in two stages:\n\n1. **PolarQuant** - a random orthogonal rotation maps cache vectors to a more uniform distribution, then quantizes them in polar coordinates using Lloyd-Max optimal centroids.\n2. **QJL** (Quantized Johnson-Lindenstrauss) - a single extra bit per vector corrects residual errors in attention score computation, preserving accuracy at extreme compression ratios.\n\nThe result: KV memory can drop from 1,639 MiB to 435 MiB (3.76x) on an RTX 4090, with ≥6x reduction validated on server hardware, and near-identical output quality.\n\nIn this tutorial you will wire TurboQuant's `CompressedDynamicCache` into Haystack's [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator), run a generation, and measure time-to-first-token, throughput, and live VRAM usage." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Installing Haystack and TurboQuant\n\nFirst, let's install `haystack-ai` and `turboquant-vllm`, which provides the `CompressedDynamicCache` wrapper." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "pip install -q haystack-ai turboquant-vllm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Setting Up a Streaming Callback\n\nTo measure **time-to-first-token (TTFT)** and throughput, we pass a streaming callback that timestamps each arriving token. The first call marks TTFT; the last marks the end of generation." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "first_token_time = None\n", + "last_token_time = None\n", + "\n", + "def timing_callback(chunk):\n", + " global first_token_time, last_token_time\n", + " now = time.perf_counter()\n", + " if first_token_time is None:\n", + " first_token_time = now\n", + " last_token_time = now" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compressing the KV Cache\n", + "\n", + "Next, let's create the compressed cache. We start with HuggingFace's standard `DynamicCache` and wrap it with `CompressedDynamicCache`, which intercepts cache writes and applies TurboQuant compression in place.\n", + "\n", + "Two parameters control the compression:\n", + "- `head_dim` - the dimensionality of each attention head's key/value vectors\n", + "- `bits` - the target bit-width per coordinate\n", + "\n", + "> **Note**: Pass the original `cache` object to the generator - not `compressed`. `CompressedDynamicCache` modifies `cache` internally, so both variables point to the same compressed state." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import DynamicCache\n", + "from turboquant_vllm import CompressedDynamicCache\n", + "\n", + "# The CompressedDynamicCache modifies the DynamicCache internally,\n", + "# so we pass the same `cache` instance to both the generator,\n", + "# and not `compressed` directly.\n", + "cache = DynamicCache()\n", + "compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Initializing the Generator\n\nNow let's set up [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator) with a selected model, like `Qwen/Qwen3-4B-Thinking-2507`. We pass the compressed `cache` via `generation_kwargs` so that every decoding step writes through TurboQuant." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.components.generators.chat import HuggingFaceLocalChatGenerator\n", + "from haystack.utils import Secret\n", + "\n", + "generator = HuggingFaceLocalChatGenerator(\n", + " model=\"Qwen/Qwen3-4B-Thinking-2507\",\n", + " task=\"text-generation\",\n", + " token=Secret.from_env_var(\"HF_TOKEN\"),\n", + " generation_kwargs={\n", + " \"past_key_values\": cache,\n", + " \"use_cache\": True,\n", + " },\n", + " streaming_callback=timing_callback,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running the Generator\n", + "\n", + "Let's run a generation and record the total wall time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.dataclasses import ChatMessage\n", + "\n", + "start = time.perf_counter()\n", + "output = generator.run(messages=[\n", + " ChatMessage.from_user(\"What is the capital of France?\"),\n", + "])\n", + "total_time = time.perf_counter() - start" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "reply = output[\"replies\"][0]\n", + "print(reply.text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Reading the Metrics\n\nThree metrics to check:\n\n- **TTFT** (time-to-first-token) - latency to the first output token; a proxy for perceived responsiveness.\n- **Throughput** (tok/s) - tokens decoded per second. TurboQuant's memory savings reduce cache read pressure, which can improve this on memory-bandwidth-bound hardware.\n- **Total time** - end-to-end wall time including model loading overhead." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tokens = reply.meta[\"usage\"][\"completion_tokens\"]\n", + "if first_token_time is not None and last_token_time is not None:\n", + " generation_time = last_token_time - first_token_time\n", + " print(f\"TTFT: {first_token_time - start:.3f}s\")\n", + " print(f\"Tokens: {tokens}\")\n", + " print(f\"Speed: {tokens / generation_time:.1f} tok/s\")\n", + "print(f\"Total time: {total_time:.3f}s\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Checking VRAM Usage\n\n`vram_bytes()` returns the byte footprint of all compressed KV tensors. Compare it against an uncompressed `DynamicCache` to verify the reduction reported in the TurboQuant paper." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "compressed.vram_bytes()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "🎉 Congratulations! You've successfully run a local LLM with TurboQuant KV cache compression through Haystack and measured its real-world memory and throughput impact." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file From 0b86083b7d9dd3fa7bbc4260c096d7325fa5bb42 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kacper=20=C5=81ukawski?= Date: Mon, 30 Mar 2026 17:57:25 +0200 Subject: [PATCH 2/4] Make it clear that we use unofficial turboquant implementation --- ...oQuant_Quantization_with_HuggingFace.ipynb | 60 ++++++++++++++++--- 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb b/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb index 292789a..7000487 100644 --- a/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb +++ b/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb @@ -3,17 +3,41 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Compress the KV Cache with TurboQuant and Haystack\n\n- **Level**: Advanced\n- **Time to complete**: 20 min\n- **Nodes Used**: [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator)\n- **Goal**: Apply TurboQuant KV cache compression to a local LLM and measure its memory and throughput impact with Haystack." + "source": [ + "# Compress the KV Cache with TurboQuant and Haystack\n", + "\n", + "- **Level**: Advanced\n", + "- **Time to complete**: 20 min\n", + "- **Nodes Used**: [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator)\n", + "- **Goal**: Apply TurboQuant KV cache compression to a local LLM and measure its memory and throughput impact with Haystack." + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "## Overview\n\nEvery time an LLM generates a token, it reads and writes a **key-value (KV) cache** - a growing table of intermediate activations that lets the model attend to previous tokens without recomputing them. On long contexts or large models, this cache becomes the dominant consumer of GPU memory.\n\n[TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) is a KV cache compression algorithm from Google Research (ICLR 2026) that shrinks those vectors to 3–4 bits per coordinate without any retraining. It works in two stages:\n\n1. **PolarQuant** - a random orthogonal rotation maps cache vectors to a more uniform distribution, then quantizes them in polar coordinates using Lloyd-Max optimal centroids.\n2. **QJL** (Quantized Johnson-Lindenstrauss) - a single extra bit per vector corrects residual errors in attention score computation, preserving accuracy at extreme compression ratios.\n\nThe result: KV memory can drop from 1,639 MiB to 435 MiB (3.76x) on an RTX 4090, with ≥6x reduction validated on server hardware, and near-identical output quality.\n\nIn this tutorial you will wire TurboQuant's `CompressedDynamicCache` into Haystack's [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator), run a generation, and measure time-to-first-token, throughput, and live VRAM usage." + "source": [ + "## Overview\n", + "\n", + "Every time an LLM generates a token, it reads and writes a **key-value (KV) cache** - a growing table of intermediate activations that lets the model attend to previous tokens without recomputing them. On long contexts or large models, this cache becomes the dominant consumer of GPU memory.\n", + "\n", + "[TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) is a KV cache compression algorithm from Google Research (ICLR 2026) that shrinks those vectors to 3–4 bits per coordinate without any retraining. It works in two stages:\n", + "\n", + "1. **PolarQuant** - a random orthogonal rotation maps cache vectors to a more uniform distribution, then quantizes them in polar coordinates using Lloyd-Max optimal centroids.\n", + "2. **QJL** (Quantized Johnson-Lindenstrauss) - a single extra bit per vector corrects residual errors in attention score computation, preserving accuracy at extreme compression ratios.\n", + "\n", + "The result: KV memory can drop from 1,639 MiB to 435 MiB (3.76x) on an RTX 4090, with ≥6x reduction validated on server hardware, and near-identical output quality.\n", + "\n", + "In this tutorial you will use [`turboquant-vllm`](https://github.com/Alberto-Codes/turboquant-vllm), a community implementation of the TurboQuant algorithm, to wire `CompressedDynamicCache` into Haystack's [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator), run a generation, and measure time-to-first-token, throughput, and live VRAM usage." + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "## Installing Haystack and TurboQuant\n\nFirst, let's install `haystack-ai` and `turboquant-vllm`, which provides the `CompressedDynamicCache` wrapper." + "source": [ + "## Installing Haystack and TurboQuant\n", + "\n", + "First, let's install `haystack-ai` and [`turboquant-vllm`](https://github.com/Alberto-Codes/turboquant-vllm), a community implementation of the TurboQuant algorithm that provides the `CompressedDynamicCache` wrapper." + ] }, { "cell_type": "code", @@ -29,7 +53,11 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Setting Up a Streaming Callback\n\nTo measure **time-to-first-token (TTFT)** and throughput, we pass a streaming callback that timestamps each arriving token. The first call marks TTFT; the last marks the end of generation." + "source": [ + "## Setting Up a Streaming Callback\n", + "\n", + "To measure **time-to-first-token (TTFT)** and throughput, we pass a streaming callback that timestamps each arriving token. The first call marks TTFT, while the last marks the end of generation." + ] }, { "cell_type": "code", @@ -84,7 +112,11 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Initializing the Generator\n\nNow let's set up [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator) with a selected model, like `Qwen/Qwen3-4B-Thinking-2507`. We pass the compressed `cache` via `generation_kwargs` so that every decoding step writes through TurboQuant." + "source": [ + "## Initializing the Generator\n", + "\n", + "Now let's set up [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator) with a selected model, like `Qwen/Qwen3-4B-Thinking-2507`. We pass the compressed `cache` via `generation_kwargs` so that every decoding step writes through TurboQuant." + ] }, { "cell_type": "code", @@ -144,7 +176,15 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Reading the Metrics\n\nThree metrics to check:\n\n- **TTFT** (time-to-first-token) - latency to the first output token; a proxy for perceived responsiveness.\n- **Throughput** (tok/s) - tokens decoded per second. TurboQuant's memory savings reduce cache read pressure, which can improve this on memory-bandwidth-bound hardware.\n- **Total time** - end-to-end wall time including model loading overhead." + "source": [ + "## Reading the Metrics\n", + "\n", + "Three metrics to check:\n", + "\n", + "- **TTFT** (time-to-first-token) - latency to the first output token - a proxy for perceived responsiveness.\n", + "- **Throughput** (tok/s) - tokens decoded per second. TurboQuant's memory savings reduce cache read pressure, which can improve this on memory-bandwidth-bound hardware.\n", + "- **Total time** - end-to-end wall time including model loading overhead." + ] }, { "cell_type": "code", @@ -164,7 +204,11 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Checking VRAM Usage\n\n`vram_bytes()` returns the byte footprint of all compressed KV tensors. Compare it against an uncompressed `DynamicCache` to verify the reduction reported in the TurboQuant paper." + "source": [ + "## Checking VRAM Usage\n", + "\n", + "`vram_bytes()` returns the byte footprint of all compressed KV tensors. Compare it against an uncompressed `DynamicCache` to verify the reduction reported in the TurboQuant paper." + ] }, { "cell_type": "code", @@ -196,4 +240,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} From abc01d4105e7559a26d34794af28c35adc973911 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kacper=20=C5=81ukawski?= Date: Tue, 31 Mar 2026 11:57:03 +0200 Subject: [PATCH 3/4] Sspecify Python version in tutorial configuration --- index.toml | 1 + 1 file changed, 1 insertion(+) diff --git a/index.toml b/index.toml index b8310be..3555649 100644 --- a/index.toml +++ b/index.toml @@ -261,3 +261,4 @@ completion_time = "20 min" created_at = 2026-03-30 dependencies = ["haystack-ai", "turboquant-vllm", "transformers"] featured = false +python_version = "3.12" \ No newline at end of file From 47a33b0b448b2df94d6719266ed57ffa7584bc97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kacper=20=C5=81ukawski?= Date: Tue, 31 Mar 2026 12:00:08 +0200 Subject: [PATCH 4/4] Remove HF_TOKEN ref --- tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb b/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb index 7000487..8a286cd 100644 --- a/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb +++ b/tutorials/49_TurboQuant_Quantization_with_HuggingFace.ipynb @@ -130,7 +130,6 @@ "generator = HuggingFaceLocalChatGenerator(\n", " model=\"Qwen/Qwen3-4B-Thinking-2507\",\n", " task=\"text-generation\",\n", - " token=Secret.from_env_var(\"HF_TOKEN\"),\n", " generation_kwargs={\n", " \"past_key_values\": cache,\n", " \"use_cache\": True,\n",