DeepSeek V3.2 user guide update by snehalv2002 · Pull Request #3565 · AI-Hypercomputer/maxtext

snehalv2002 · 2026-04-03T15:45:01Z

Updating the user guide for DeepSeek-V3.2. Explains new feature updates and updates instructions on multi-stage lightning indexer training and checkpoint conversion.

github-actions · 2026-04-03T16:42:37Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This pull request updates the DeepSeek user guide to include instructions for the new DeepSeek-V3.2 model, specifically focusing on indexer training and checkpoint conversion. The updates are timely and provide clear steps for users to leverage the latest sparse attention features.

🔍 General Feedback

Consistency: Ensure that the model names (deepseek3.2-671b) and tokenizer paths (deepseek-ai/DeepSeek-V3.2) are consistent across all stages of the guide.
Syntax: Be careful with trailing backslashes in shell command examples, as they can cause errors if users copy-paste the last line.
Clarity: Using concrete example values (like 0.1 for scaling factors) is generally more user-friendly than placeholders in curly braces.

github-actions · 2026-04-03T16:44:25Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


🟢 Use standard Big O notation with parentheses instead of square brackets.

Suggested change

* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).

* DeepSeek-V3.2 replaces vanilla attention (O(L^2) where L is number of tokens) with DeepSeek Sparse Attention (O(L * k) where k is some number of sparsely selected tokens).

github-actions · 2026-04-03T16:44:26Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \


🟠 Use the `deepseek3.2-671b` model config, as it contains the necessary indexer configuration for this training.

Suggested change

run_name=matmul_pre_training \

model_name=deepseek3.2-671b \

github-actions · 2026-04-03T16:44:27Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \


🟡 Use the correct tokenizer path for DeepSeek-V3.2.

Suggested change

async_checkpointing=false \

tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

github-actions · 2026-04-03T16:44:28Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \
+    attention=flash \
+    dtype=bfloat16 \


🟡 Provide a concrete example value (like 0.1) instead of a placeholder in curly braces, as placeholders can be confusing in documentation examples.

Suggested change

dtype=bfloat16 \

indexer_loss_scaling_factor=0.1 \

github-actions · 2026-04-03T16:44:29Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \
+    per_device_batch_size=4 \


🟠 Use the `deepseek3.2-671b` model config for the sparse training stage as well.

Suggested change

per_device_batch_size=4 \

model_name=deepseek3.2-671b \

github-actions · 2026-04-03T16:44:30Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \


🟡 Same as above, use the V3.2 tokenizer.

Suggested change

tokenizer_type=huggingface \

tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

github-actions · 2026-04-03T16:44:31Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \
+    sparse_matmul=False \


🔴 The command should not end with a trailing backslash if it is the last line. Additionally, for indexer-only training, the `trainable_parameters_mask` should be present in both stages to isolate the indexer.

Suggested change

sparse_matmul=False \

indexer_loss_scaling_factor=0.1 \

trainable_parameters_mask=['.*indexer.*']

Rohan-Bierneni · 2026-04-03T16:40:13Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


Instead of "replace vanilla attention", it would be better to say "improves MLA attention". The complexity is still O(L^2) but the indexer is added on top of MLA attention that Deepseek uses from V3 onwards.

+1 Let's mention something similar like bellow, and please feel free to modify:

DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.

Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.

with hyperlink to paper: https://arxiv.org/pdf/2512.02556

Rohan-Bierneni · 2026-04-03T16:41:36Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

    dataset_type=synthetic
 ```

+## Indexer training


Highlight that this is only for V3.2 Sparse Attention in the heading itself

Rohan-Bierneni · 2026-04-03T16:42:59Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=False \
+    indexer_loss_scaling_factor={some non-zero value} \


Replace with default value in base.yml. And add a comment saying can replace with non-zero value.

Or we could put a small value, like 0.01

Rohan-Bierneni · 2026-04-03T16:43:11Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=True \
+    indexer_loss_scaling_factor={some non-zero value} \


Same as comment above

Rohan-Bierneni · 2026-04-03T16:45:07Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    megablox=False \
+    sparse_matmul=False \


We should probably have this set to True in the sparse training stage. These flags control which MoE strategy to use.

Rohan-Bierneni · 2026-04-03T16:45:58Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \


Is there a difference in V3 vs V3.2 tokenizer path in HF? If not then this is fine.

No difference, but let's update to v3.2 to avoid confusion

should use tokenizer_path=deepseek-ai/DeepSeek-V3.2

Rohan-Bierneni · 2026-04-03T16:46:55Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+## Indexer training
+DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. 
+
+1. **Dense Warmup Stage**


Can you include a comment that in dense warmup stage, all model weights are frozen except the indexer weights.

RissyRan

Thanks for your 1st PR!!!

One more thing, could you update the PR desperation to follow our default template? One example: here

RissyRan · 2026-04-03T17:03:50Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


+1 Let's mention something similar like bellow, and please feel free to modify:

DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.

Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.

RissyRan · 2026-04-03T17:05:23Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    sparse_matmul=False \
+    dataset_type=synthetic \
+    indexer_sparse_training=False \
+    indexer_loss_scaling_factor={some non-zero value} \


Or we could put a small value, like 0.01

RissyRan · 2026-04-03T17:15:18Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \


let's use v3.2 tokenizer path

RissyRan · 2026-04-03T17:15:50Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \


Let's use sparse_matmul=True and megablox=True

RissyRan · 2026-04-03T17:16:27Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=deepseek-ai/DeepSeek-V3 \


No difference, but let's update to v3.2 to avoid confusion

RissyRan · 2026-04-03T17:17:50Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+* **Target Directory:** `LOCAL_WEIGHTS`
+
+### 2. Dequantize Weights
+Convert the weights from FP8 to BF16 using the official DeepSeek script.


@shuningjin could you help check this part?

Perseus14 · 2026-04-04T21:12:17Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

Could we also add a section on decoding for v3.2?

shuningjin

Thanks for the update! Might be good to organize deepseek3.2 into a self-contained section for clarity, and add more explanation on continued pre-training for indexer.

shuningjin · 2026-04-07T18:16:43Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

@@ -1 +1 @@
 <!--


I would prefer that we organize all commands of deepseek3.2 under one section for clarity. Like

## DeepSeek V3.2 ### Checkpoint conversion ### Indexer training ### Decode

shuningjin · 2026-04-07T18:18:44Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

-* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency.
+
+* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).


with hyperlink to paper: https://arxiv.org/pdf/2512.02556

shuningjin · 2026-04-07T18:21:36Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    ici_fsdp_parallelism=128 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \


shuningjin · 2026-04-07T18:21:44Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \


shuningjin · 2026-04-07T18:22:48Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+    run_name=matmul_pre_training \
+    per_device_batch_size=4 \


shuningjin · 2026-04-07T18:40:53Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+
+### 1. Download Model Weights
+Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment.
+* **Target Directory:** `LOCAL_WEIGHTS`


Would be better to be specific

The model weights are quantized in FP8. hf download deepseek-ai/DeepSeek-V3.2 --local-dir <local_fp8_path>

shuningjin · 2026-04-07T18:46:24Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+### 2. Dequantize Weights
+Convert the weights from FP8 to BF16 using the official DeepSeek script.
+* **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py)
+* **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS`


Convert the weights from FP8 to BF16 using script [deepseek_fp8_to_bf16.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/deepseek_fp8_to_bf16.py) on CPU: python3 -m maxtext.checkpoint_conversion.standalone_scripts.deepseek_fp8_to_bf16 --input-fp8-hf-path=<local_fp8_path> --output-bf16-hf-path=<local_bf16_path> Alternatively, we can use the official DeepSeek script [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert on GPU.

shuningjin · 2026-04-07T18:48:53Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+* **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py)
+* **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS`
+
+### 3. Convert to MaxText


Convert to MaxText-compatible Orbax format

shuningjin · 2026-04-07T19:09:45Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

+    --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \
+    --eager_load_method=safetensors \
+    --save_dtype=bfloat16
+```


Might be good to add

Setting `scan_layers=true` generates scanned Orbax format for training and fine-tuning. Setting `scan_layers=false` unscanned format in Orbax for decoding.

shuningjin · 2026-04-07T20:56:33Z

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

    dataset_type=synthetic
 ```

+## Indexer training


Might be good to elaborate the training, ref. Here are some suggestions:

(1) Indexer training -> Continued pre-training

(2)

**DeepSeek Sparse Attention (DSA)** enhances the Multi-Head Latent Attention (MLA) architecture by introducing a **Lightning Indexer**, which selects the top-$k$ tokens for attention. DeepSeek-V3.2 is instantiated from DeepSeek-V3.1 and undergoes continued pre-training to adapt this indexer via a two-stage strategy: **Dense Warm-up** and **Sparse Training**.

(3)

1. Dense Warm-up Stage: The indexer is trained exclusively using dense indexer loss while all other model parameters remain frozen.

(4)

2. Sparse Training Stage: The indexer is trained with sparse indexer loss, while the remaining model parameters are unfrozen and updated using standard language modeling loss.

DeepSeek V3.2 user guide update

5d66c42

snehalv2002 requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners April 3, 2026 15:45

snehalv2002 closed this Apr 3, 2026

snehalv2002 reopened this Apr 3, 2026

snehalv2002 marked this pull request as draft April 3, 2026 15:47

RissyRan marked this pull request as ready for review April 3, 2026 16:42

RissyRan added the gemini-review label Apr 3, 2026

github-actions bot reviewed Apr 3, 2026

View reviewed changes

Rohan-Bierneni reviewed Apr 3, 2026

View reviewed changes

RissyRan reviewed Apr 3, 2026

View reviewed changes

Perseus14 reviewed Apr 4, 2026

View reviewed changes

tests/end_to_end/tpu/deepseek/Run_DeepSeek.md

Copy link
Copy Markdown

Contributor

Perseus14 Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a section on decoding for v3.2?

shuningjin reviewed Apr 7, 2026

View reviewed changes

	* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens).
	* DeepSeek-V3.2 replaces vanilla attention (O(L^2) where L is number of tokens) with DeepSeek Sparse Attention (O(L * k) where k is some number of sparsely selected tokens).

	async_checkpointing=false \
	tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

	tokenizer_type=huggingface \
	tokenizer_path=deepseek-ai/DeepSeek-V3.2 \

	sparse_matmul=False \
	indexer_loss_scaling_factor=0.1 \
	trainable_parameters_mask=['.indexer.']

Conversation

snehalv2002 commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuningjin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

shuningjin left a comment •

edited

Loading

shuningjin Apr 7, 2026 •

edited

Loading