From 5d66c423fed001d8cc63deeb9760a97f37405c04 Mon Sep 17 00:00:00 2001
From: Snehal Verma <snehalv@google.com>
Date: Thu, 2 Apr 2026 21:43:56 +0000
Subject: [PATCH 1/2] DeepSeek V3.2 user guide update

---
 tests/end_to_end/tpu/deepseek/Run_DeepSeek.md | 83 ++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md
index 10651f7186..8795ec6c26 100644
--- a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md
+++ b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md
@@ -20,7 +20,9 @@ DeepSeek is a novel family of open-weights sparse MoE models by DeepSeek AI. The
 
 * DeepSeek-V3 features advanced techniques, including Multi-Head Latent Attention (MLA), finer-grained and shared experts, Multi-Token Prediction (MTP), and FP8 mixed precision designed for enhanced efficiency and performance.
 
-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).
 
 * DeepSeek R1 also uses V3 architecture. It utilizes cold-start data and large-scale reinforcement learning to incentivize chain-of-thought reasoning without relying solely on supervised fine-tuning.
 
@@ -54,12 +56,91 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
     dataset_type=synthetic
 ```
 
+## Indexer training
+DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. 
+
+1. **Dense Warmup Stage**
+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \
+    per_device_batch_size=4 \
+    enable_checkpointing=false \
+    model_name=deepseek3-671b \
+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \
+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \
+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=False \
+    indexer_loss_scaling_factor={some non-zero value} \
+    trainable_parameters_mask=['.*indexer.*']
+```
+2. **Sparse Training Stage**
+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \
+    per_device_batch_size=4 \
+    enable_checkpointing=false \
+    model_name=deepseek3-671b \
+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \
+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \
+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=True \
+    indexer_loss_scaling_factor={some non-zero value} \
+```
 
 ## Checkpoint conversion
 To get started, follow the instructions at HuggingFace ([V3](https://huggingface.co/deepseek-ai/DeepSeek-V3), [V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)) to download the model. Currently for V3, V3.1, and R1, it uses mixed precision fp8 & bf16 weights. To convert all FP8 weights to BF16, use the script [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/utils/ckpt_scripts/deepseek_fp8_to_bf16.py). Once downloaded and converted to BF16:
 * run [convert_deepseek_family_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_ckpt.py) to convert the checkpoint for MaxText compatibility in [Orbax](https://orbax.readthedocs.io/en/latest/guides/checkpoint/orbax_checkpoint_101.html) for training and fine-tuning. When converting a checkpoint with MTP layers (like DeepSeek-V3), be sure to add the `--enable_mtp` flag to process them correctly.
 * run [convert_deepseek_family_unscanned_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_unscanned_ckpt.py) to convert the checkpoint to unscanned version in Orbax for decoding.
 
+## Checkpoint conversion for V3.2
+> **Note:** These steps are required because Transformers code for V3.2 is not yet available.
+
+### 1. Download Model Weights
+Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment.
+* **Target Directory:** `LOCAL_WEIGHTS`
+
+### 2. Dequantize Weights
+Convert the weights from FP8 to BF16 using the official DeepSeek script.
+* **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py)
+* **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS`
+
+### 3. Convert to MaxText
+Execute the following command to finalize the conversion. Ensure your environment variables (`$BASE_OUTPUT_PATH`, `$HF_TOKEN`, and `$DEQUANTIZED_LOCAL_WEIGHTS`) are exported before running.
+
+```bash
+python3 -m maxtext.checkpoint_conversion.to_maxtext \
+    src/maxtext/configs/base.yml \
+    model_name=deepseek3.2-671b \
+    scan_layers=true \
+    attention=dot_product \
+    base_output_directory=$BASE_OUTPUT_PATH \
+    hf_access_token=$HF_TOKEN \
+    hardware=cpu \
+    skip_jax_distributed_system=True \
+    --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \
+    --eager_load_method=safetensors \
+    --save_dtype=bfloat16
+```
 
 ## Fine-tuning
 

From d00c55e0121876d7a2dc2b1138bf730dcfafb480 Mon Sep 17 00:00:00 2001
From: Snehal Verma <snehalv@google.com>
Date: Wed, 8 Apr 2026 21:47:16 +0000
Subject: [PATCH 2/2] resolve 1st batch of comments on DeepSeek-V3.2 user guide

---
 tests/end_to_end/tpu/deepseek/Run_DeepSeek.md | 171 +++++++++---------
 1 file changed, 89 insertions(+), 82 deletions(-)

diff --git a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md
index 8795ec6c26..d60a8c2cdd 100644
--- a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md
+++ b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md
@@ -22,7 +22,7 @@ DeepSeek is a novel family of open-weights sparse MoE models by DeepSeek AI. The
 
 * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
 
-* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).
+* DeepSeek-V3.2 introduces [DeepSeek Sparse Attention](https://arxiv.org/pdf/2512.02556) (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.
 
 * DeepSeek R1 also uses V3 architecture. It utilizes cold-start data and large-scale reinforcement learning to incentivize chain-of-thought reasoning without relying solely on supervised fine-tuning.
 
@@ -56,92 +56,11 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
     dataset_type=synthetic
 ```
 
-## Indexer training
-DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. 
-
-1. **Dense Warmup Stage**
-```sh
-python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
-    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
-    run_name=matmul_pre_training \
-    per_device_batch_size=4 \
-    enable_checkpointing=false \
-    model_name=deepseek3-671b \
-    ici_fsdp_parallelism=128 \
-    steps=5 \
-    max_target_length=1024 \
-    async_checkpointing=false \
-    tokenizer_type=huggingface \
-    tokenizer_path=deepseek-ai/DeepSeek-V3 \
-    attention=flash \
-    dtype=bfloat16 \
-    weight_dtype=bfloat16 \
-    megablox=False \
-    sparse_matmul=False \
-    dataset_type=synthetic \
-    indexer_sparse_training=False \
-    indexer_loss_scaling_factor={some non-zero value} \
-    trainable_parameters_mask=['.*indexer.*']
-```
-2. **Sparse Training Stage**
-```sh
-python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
-    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
-    run_name=matmul_pre_training \
-    per_device_batch_size=4 \
-    enable_checkpointing=false \
-    model_name=deepseek3-671b \
-    ici_fsdp_parallelism=128 \
-    steps=5 \
-    max_target_length=1024 \
-    async_checkpointing=false \
-    tokenizer_type=huggingface \
-    tokenizer_path=deepseek-ai/DeepSeek-V3 \
-    attention=flash \
-    dtype=bfloat16 \
-    weight_dtype=bfloat16 \
-    megablox=False \
-    sparse_matmul=False \
-    dataset_type=synthetic \
-    indexer_sparse_training=True \
-    indexer_loss_scaling_factor={some non-zero value} \
-```
-
 ## Checkpoint conversion
 To get started, follow the instructions at HuggingFace ([V3](https://huggingface.co/deepseek-ai/DeepSeek-V3), [V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)) to download the model. Currently for V3, V3.1, and R1, it uses mixed precision fp8 & bf16 weights. To convert all FP8 weights to BF16, use the script [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/utils/ckpt_scripts/deepseek_fp8_to_bf16.py). Once downloaded and converted to BF16:
 * run [convert_deepseek_family_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_ckpt.py) to convert the checkpoint for MaxText compatibility in [Orbax](https://orbax.readthedocs.io/en/latest/guides/checkpoint/orbax_checkpoint_101.html) for training and fine-tuning. When converting a checkpoint with MTP layers (like DeepSeek-V3), be sure to add the `--enable_mtp` flag to process them correctly.
 * run [convert_deepseek_family_unscanned_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_unscanned_ckpt.py) to convert the checkpoint to unscanned version in Orbax for decoding.
 
-## Checkpoint conversion for V3.2
-> **Note:** These steps are required because Transformers code for V3.2 is not yet available.
-
-### 1. Download Model Weights
-Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment.
-* **Target Directory:** `LOCAL_WEIGHTS`
-
-### 2. Dequantize Weights
-Convert the weights from FP8 to BF16 using the official DeepSeek script.
-* **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py)
-* **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS`
-
-### 3. Convert to MaxText
-Execute the following command to finalize the conversion. Ensure your environment variables (`$BASE_OUTPUT_PATH`, `$HF_TOKEN`, and `$DEQUANTIZED_LOCAL_WEIGHTS`) are exported before running.
-
-```bash
-python3 -m maxtext.checkpoint_conversion.to_maxtext \
-    src/maxtext/configs/base.yml \
-    model_name=deepseek3.2-671b \
-    scan_layers=true \
-    attention=dot_product \
-    base_output_directory=$BASE_OUTPUT_PATH \
-    hf_access_token=$HF_TOKEN \
-    hardware=cpu \
-    skip_jax_distributed_system=True \
-    --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \
-    --eager_load_method=safetensors \
-    --save_dtype=bfloat16
-```
-
 ## Fine-tuning
 
 After you have a MaxText compatible checkpoint, you could fine-tune it with different datasets.
@@ -297,3 +216,91 @@ To run MMLU benchmarks and validate the model's performance, follow the instruct
 * Dropping implementation with flag `sparse_matmul=False` and reasonable `capacity_factor`, commonly used from 1 to 1.25.
 
 See more examples in scripts for [V3](v3-671b/test_deepseek.sh) and [V2-Lite](v2-16b/test_deepseek.sh).
+
+## DeepSeek-V3.2
+
+### Continued pre-training for V3.2 Sparse Attention
+**DeepSeek Sparse Attention (DSA)** enhances the Multi-Head Latent Attention (MLA) architecture by introducing a **Lightning Indexer**, which selects the top-$k$ tokens for attention. DeepSeek-V3.2 is instantiated from DeepSeek-V3.1 and undergoes continued pre-training to adapt this indexer via a two-stage strategy: **Dense Warm-up** and **Sparse Training**.  
+
+1. **Dense Warmup Stage**
+The indexer is trained exclusively using dense indexer loss while all other model parameters remain frozen.  
+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    model_name=deepseek3.2-671b \
+    run_name=matmul_pre_training \
+    per_device_batch_size=4 \
+    enable_checkpointing=false \
+    model_name=deepseek3-671b \
+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    tokenizer_path=deepseek-ai/DeepSeek-V3.2 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3.2 \
+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=True \
+    sparse_matmul=True \
+    dataset_type=synthetic \
+    indexer_sparse_training=False \
+    indexer_loss_scaling_factor=0.01 \ # Must be non-zero to activate indexer training. Default in base.yaml is 0.
+    trainable_parameters_mask=['.*indexer.*']
+```
+2. **Sparse Training Stage**
+The indexer is trained with sparse indexer loss, while the remaining model parameters are unfrozen and updated using standard language modeling loss.  
+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    model_name=deepseek3.2-671b \
+    per_device_batch_size=4 \
+    enable_checkpointing=false \
+    model_name=deepseek3-671b \
+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3.2 \
+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=True \
+    sparse_matmul=True \
+    dataset_type=synthetic \
+    indexer_sparse_training=True \
+    indexer_loss_scaling_factor=0.01 \ # Must be non-zero to activate indexer training. Default in base.yaml is 0.
+```
+
+### Checkpoint conversion for V3.2
+> **Note:** These steps are required because Transformers code for V3.2 is not yet available.
+
+#### 1. Download Model Weights
+Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment. Weights are provided in FP8.
+`hf download deepseek-ai/DeepSeek-V3.2 --local-dir <local_fp8_path>`
+
+#### 2. Dequantize Weights
+* **Script:**
+Convert the weights from FP8 to BF16 using script [deepseek_fp8_to_bf16.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/deepseek_fp8_to_bf16.py) on CPU:  
+
+python3 -m maxtext.checkpoint_conversion.standalone_scripts.deepseek_fp8_to_bf16 --input-fp8-hf-path=<local_fp8_path> --output-bf16-hf-path=<local_bf16_path>  
+
+Alternatively, we can use the official DeepSeek script [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert on GPU.  
+
+#### 3. Convert to MaxText-compatible Orbax format
+Execute the following command to finalize the conversion. Ensure your environment variables (`$BASE_OUTPUT_PATH`, `$HF_TOKEN`, and `$DEQUANTIZED_LOCAL_WEIGHTS`) are exported before running.
+Setting `scan_layers=true` generates scanned Orbax format for training and fine-tuning.  Setting `scan_layers=false` unscanned format in Orbax for decoding. 
+```bash
+python3 -m maxtext.checkpoint_conversion.to_maxtext \
+    src/maxtext/configs/base.yml \
+    model_name=deepseek3.2-671b \
+    scan_layers=true \
+    attention=dot_product \
+    base_output_directory=$BASE_OUTPUT_PATH \
+    hf_access_token=$HF_TOKEN \
+    hardware=cpu \
+    skip_jax_distributed_system=True \
+    --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \
+    --eager_load_method=safetensors \
+    --save_dtype=bfloat16
+```