Conversation
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This pull request updates the DeepSeek user guide to include instructions for the new DeepSeek-V3.2 model, specifically focusing on indexer training and checkpoint conversion. The updates are timely and provide clear steps for users to leverage the latest sparse attention features.
🔍 General Feedback
- Consistency: Ensure that the model names (
deepseek3.2-671b) and tokenizer paths (deepseek-ai/DeepSeek-V3.2) are consistent across all stages of the guide. - Syntax: Be careful with trailing backslashes in shell command examples, as they can cause errors if users copy-paste the last line.
- Clarity: Using concrete example values (like
0.1for scaling factors) is generally more user-friendly than placeholders in curly braces.
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). | |
| * DeepSeek-V3.2 replaces vanilla attention (O(L^2) where L is number of tokens) with DeepSeek Sparse Attention (O(L * k) where k is some number of sparsely selected tokens). |
| ```sh | ||
| python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| run_name=matmul_pre_training \ |
There was a problem hiding this comment.
| run_name=matmul_pre_training \ | |
| model_name=deepseek3.2-671b \ |
| ici_fsdp_parallelism=128 \ | ||
| steps=5 \ | ||
| max_target_length=1024 \ | ||
| async_checkpointing=false \ |
There was a problem hiding this comment.
| async_checkpointing=false \ | |
| tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ |
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ | ||
| attention=flash \ | ||
| dtype=bfloat16 \ |
There was a problem hiding this comment.
| dtype=bfloat16 \ | |
| indexer_loss_scaling_factor=0.1 \ |
| python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| run_name=matmul_pre_training \ | ||
| per_device_batch_size=4 \ |
There was a problem hiding this comment.
| per_device_batch_size=4 \ | |
| model_name=deepseek3.2-671b \ |
| steps=5 \ | ||
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ |
There was a problem hiding this comment.
| tokenizer_type=huggingface \ | |
| tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ |
| dtype=bfloat16 \ | ||
| weight_dtype=bfloat16 \ | ||
| megablox=False \ | ||
| sparse_matmul=False \ |
There was a problem hiding this comment.
| sparse_matmul=False \ | |
| indexer_loss_scaling_factor=0.1 \ | |
| trainable_parameters_mask=['.*indexer.*'] |
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
Instead of "replace vanilla attention", it would be better to say "improves MLA attention". The complexity is still O(L^2) but the indexer is added on top of MLA attention that Deepseek uses from V3 onwards.
There was a problem hiding this comment.
+1 Let's mention something similar like bellow, and please feel free to modify:
DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.
Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.
There was a problem hiding this comment.
with hyperlink to paper: https://arxiv.org/pdf/2512.02556
| dataset_type=synthetic | ||
| ``` | ||
|
|
||
| ## Indexer training |
There was a problem hiding this comment.
Highlight that this is only for V3.2 Sparse Attention in the heading itself
| sparse_matmul=False \ | ||
| dataset_type=synthetic \ | ||
| indexer_sparse_training=False \ | ||
| indexer_loss_scaling_factor={some non-zero value} \ |
There was a problem hiding this comment.
Replace with default value in base.yml. And add a comment saying can replace with non-zero value.
There was a problem hiding this comment.
Or we could put a small value, like 0.01
| sparse_matmul=False \ | ||
| dataset_type=synthetic \ | ||
| indexer_sparse_training=True \ | ||
| indexer_loss_scaling_factor={some non-zero value} \ |
There was a problem hiding this comment.
Same as comment above
| megablox=False \ | ||
| sparse_matmul=False \ |
There was a problem hiding this comment.
We should probably have this set to True in the sparse training stage. These flags control which MoE strategy to use.
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ |
There was a problem hiding this comment.
Is there a difference in V3 vs V3.2 tokenizer path in HF? If not then this is fine.
There was a problem hiding this comment.
No difference, but let's update to v3.2 to avoid confusion
There was a problem hiding this comment.
should use tokenizer_path=deepseek-ai/DeepSeek-V3.2
| ## Indexer training | ||
| DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. | ||
|
|
||
| 1. **Dense Warmup Stage** |
There was a problem hiding this comment.
Can you include a comment that in dense warmup stage, all model weights are frozen except the indexer weights.
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
+1 Let's mention something similar like bellow, and please feel free to modify:
DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios.
Let's remove the complexity to avoid any confusion, as Indexer also has L^2 for selection. We could direct readers to paper.
| sparse_matmul=False \ | ||
| dataset_type=synthetic \ | ||
| indexer_sparse_training=False \ | ||
| indexer_loss_scaling_factor={some non-zero value} \ |
There was a problem hiding this comment.
Or we could put a small value, like 0.01
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ |
There was a problem hiding this comment.
let's use v3.2 tokenizer path
| attention=flash \ | ||
| dtype=bfloat16 \ | ||
| weight_dtype=bfloat16 \ | ||
| megablox=False \ |
There was a problem hiding this comment.
Let's use sparse_matmul=True and megablox=True
| max_target_length=1024 \ | ||
| async_checkpointing=false \ | ||
| tokenizer_type=huggingface \ | ||
| tokenizer_path=deepseek-ai/DeepSeek-V3 \ |
There was a problem hiding this comment.
No difference, but let's update to v3.2 to avoid confusion
| * **Target Directory:** `LOCAL_WEIGHTS` | ||
|
|
||
| ### 2. Dequantize Weights | ||
| Convert the weights from FP8 to BF16 using the official DeepSeek script. |
There was a problem hiding this comment.
Could we also add a section on decoding for v3.2?
| @@ -1 +1 @@ | |||
| <!-- | |||
There was a problem hiding this comment.
I would prefer that we organize all commands of deepseek3.2 under one section for clarity. Like
## DeepSeek V3.2
### Checkpoint conversion
### Indexer training
### Decode
| * DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
| * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. | ||
|
|
||
| * DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). |
There was a problem hiding this comment.
with hyperlink to paper: https://arxiv.org/pdf/2512.02556
| ici_fsdp_parallelism=128 \ | ||
| steps=5 \ | ||
| max_target_length=1024 \ | ||
| async_checkpointing=false \ |
| ```sh | ||
| python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| run_name=matmul_pre_training \ |
| python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| run_name=matmul_pre_training \ | ||
| per_device_batch_size=4 \ |
|
|
||
| ### 1. Download Model Weights | ||
| Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment. | ||
| * **Target Directory:** `LOCAL_WEIGHTS` |
There was a problem hiding this comment.
Would be better to be specific
The model weights are quantized in FP8.
hf download deepseek-ai/DeepSeek-V3.2 --local-dir <local_fp8_path>
| ### 2. Dequantize Weights | ||
| Convert the weights from FP8 to BF16 using the official DeepSeek script. | ||
| * **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) | ||
| * **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS` |
There was a problem hiding this comment.
Convert the weights from FP8 to BF16 using script [deepseek_fp8_to_bf16.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/deepseek_fp8_to_bf16.py) on CPU:
python3 -m maxtext.checkpoint_conversion.standalone_scripts.deepseek_fp8_to_bf16 --input-fp8-hf-path=<local_fp8_path> --output-bf16-hf-path=<local_bf16_path>
Alternatively, we can use the official DeepSeek script [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert on GPU.
| * **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) | ||
| * **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS` | ||
|
|
||
| ### 3. Convert to MaxText |
There was a problem hiding this comment.
Convert to MaxText-compatible Orbax format
| --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \ | ||
| --eager_load_method=safetensors \ | ||
| --save_dtype=bfloat16 | ||
| ``` |
There was a problem hiding this comment.
Might be good to add
Setting `scan_layers=true` generates scanned Orbax format for training and fine-tuning. Setting `scan_layers=false` unscanned format in Orbax for decoding.
| dataset_type=synthetic | ||
| ``` | ||
|
|
||
| ## Indexer training |
There was a problem hiding this comment.
Might be good to elaborate the training, ref. Here are some suggestions:
(1) Indexer training -> Continued pre-training
(2)
**DeepSeek Sparse Attention (DSA)** enhances the Multi-Head Latent Attention (MLA) architecture by introducing a **Lightning Indexer**, which selects the top-$k$ tokens for attention. DeepSeek-V3.2 is instantiated from DeepSeek-V3.1 and undergoes continued pre-training to adapt this indexer via a two-stage strategy: **Dense Warm-up** and **Sparse Training**.
(3)
1. Dense Warm-up Stage:
The indexer is trained exclusively using dense indexer loss while all other model parameters remain frozen.
(4)
2. Sparse Training Stage:
The indexer is trained with sparse indexer loss, while the remaining model parameters are unfrozen and updated using standard language modeling loss.
Updating the user guide for DeepSeek-V3.2. Explains new feature updates and updates instructions on multi-stage lightning indexer training and checkpoint conversion.