From 5d66c423fed001d8cc63deeb9760a97f37405c04 Mon Sep 17 00:00:00 2001 From: Snehal Verma Date: Thu, 2 Apr 2026 21:43:56 +0000 Subject: [PATCH 1/2] DeepSeek V3.2 user guide update --- tests/end_to_end/tpu/deepseek/Run_DeepSeek.md | 83 ++++++++++++++++++- 1 file changed, 82 insertions(+), 1 deletion(-) diff --git a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md index 10651f7186..8795ec6c26 100644 --- a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md +++ b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md @@ -20,7 +20,9 @@ DeepSeek is a novel family of open-weights sparse MoE models by DeepSeek AI. The * DeepSeek-V3 features advanced techniques, including Multi-Head Latent Attention (MLA), finer-grained and shared experts, Multi-Token Prediction (MTP), and FP8 mixed precision designed for enhanced efficiency and performance. -* DeepSeek V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. +* DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. + +* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). * DeepSeek R1 also uses V3 architecture. It utilizes cold-start data and large-scale reinforcement learning to incentivize chain-of-thought reasoning without relying solely on supervised fine-tuning. @@ -54,12 +56,91 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ dataset_type=synthetic ``` +## Indexer training +DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. + +1. **Dense Warmup Stage** +```sh +python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + run_name=matmul_pre_training \ + per_device_batch_size=4 \ + enable_checkpointing=false \ + model_name=deepseek3-671b \ + ici_fsdp_parallelism=128 \ + steps=5 \ + max_target_length=1024 \ + async_checkpointing=false \ + tokenizer_type=huggingface \ + tokenizer_path=deepseek-ai/DeepSeek-V3 \ + attention=flash \ + dtype=bfloat16 \ + weight_dtype=bfloat16 \ + megablox=False \ + sparse_matmul=False \ + dataset_type=synthetic \ + indexer_sparse_training=False \ + indexer_loss_scaling_factor={some non-zero value} \ + trainable_parameters_mask=['.*indexer.*'] +``` +2. **Sparse Training Stage** +```sh +python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + run_name=matmul_pre_training \ + per_device_batch_size=4 \ + enable_checkpointing=false \ + model_name=deepseek3-671b \ + ici_fsdp_parallelism=128 \ + steps=5 \ + max_target_length=1024 \ + async_checkpointing=false \ + tokenizer_type=huggingface \ + tokenizer_path=deepseek-ai/DeepSeek-V3 \ + attention=flash \ + dtype=bfloat16 \ + weight_dtype=bfloat16 \ + megablox=False \ + sparse_matmul=False \ + dataset_type=synthetic \ + indexer_sparse_training=True \ + indexer_loss_scaling_factor={some non-zero value} \ +``` ## Checkpoint conversion To get started, follow the instructions at HuggingFace ([V3](https://huggingface.co/deepseek-ai/DeepSeek-V3), [V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)) to download the model. Currently for V3, V3.1, and R1, it uses mixed precision fp8 & bf16 weights. To convert all FP8 weights to BF16, use the script [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/utils/ckpt_scripts/deepseek_fp8_to_bf16.py). Once downloaded and converted to BF16: * run [convert_deepseek_family_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_ckpt.py) to convert the checkpoint for MaxText compatibility in [Orbax](https://orbax.readthedocs.io/en/latest/guides/checkpoint/orbax_checkpoint_101.html) for training and fine-tuning. When converting a checkpoint with MTP layers (like DeepSeek-V3), be sure to add the `--enable_mtp` flag to process them correctly. * run [convert_deepseek_family_unscanned_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_unscanned_ckpt.py) to convert the checkpoint to unscanned version in Orbax for decoding. +## Checkpoint conversion for V3.2 +> **Note:** These steps are required because Transformers code for V3.2 is not yet available. + +### 1. Download Model Weights +Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment. +* **Target Directory:** `LOCAL_WEIGHTS` + +### 2. Dequantize Weights +Convert the weights from FP8 to BF16 using the official DeepSeek script. +* **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) +* **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS` + +### 3. Convert to MaxText +Execute the following command to finalize the conversion. Ensure your environment variables (`$BASE_OUTPUT_PATH`, `$HF_TOKEN`, and `$DEQUANTIZED_LOCAL_WEIGHTS`) are exported before running. + +```bash +python3 -m maxtext.checkpoint_conversion.to_maxtext \ + src/maxtext/configs/base.yml \ + model_name=deepseek3.2-671b \ + scan_layers=true \ + attention=dot_product \ + base_output_directory=$BASE_OUTPUT_PATH \ + hf_access_token=$HF_TOKEN \ + hardware=cpu \ + skip_jax_distributed_system=True \ + --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \ + --eager_load_method=safetensors \ + --save_dtype=bfloat16 +``` ## Fine-tuning From d00c55e0121876d7a2dc2b1138bf730dcfafb480 Mon Sep 17 00:00:00 2001 From: Snehal Verma Date: Wed, 8 Apr 2026 21:47:16 +0000 Subject: [PATCH 2/2] resolve 1st batch of comments on DeepSeek-V3.2 user guide --- tests/end_to_end/tpu/deepseek/Run_DeepSeek.md | 171 +++++++++--------- 1 file changed, 89 insertions(+), 82 deletions(-) diff --git a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md index 8795ec6c26..d60a8c2cdd 100644 --- a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md +++ b/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md @@ -22,7 +22,7 @@ DeepSeek is a novel family of open-weights sparse MoE models by DeepSeek AI. The * DeepSeek-V3.1 shares the same architecture as V3, but features an improved checkpoint that supports hybrid thinking modes, improved performance in agentic tasks, and higher thinking efficiency. -* DeepSeek-V3.2 replaces vanilla attention (O[L^2] where L is number of tokens) with DeepSeek Sparse Attention (O[L * k] where k is some number of sparsely selected tokens). +* DeepSeek-V3.2 introduces [DeepSeek Sparse Attention](https://arxiv.org/pdf/2512.02556) (DSA), successfully reduces computational complexity while preserving model performance in long-context scenarios. * DeepSeek R1 also uses V3 architecture. It utilizes cold-start data and large-scale reinforcement learning to incentivize chain-of-thought reasoning without relying solely on supervised fine-tuning. @@ -56,92 +56,11 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ dataset_type=synthetic ``` -## Indexer training -DeepSeek-V3.2 introduces deepseek sparse attention. Training the lightning indexer to achieve sparsity is a 2 stage process. - -1. **Dense Warmup Stage** -```sh -python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ - base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ - run_name=matmul_pre_training \ - per_device_batch_size=4 \ - enable_checkpointing=false \ - model_name=deepseek3-671b \ - ici_fsdp_parallelism=128 \ - steps=5 \ - max_target_length=1024 \ - async_checkpointing=false \ - tokenizer_type=huggingface \ - tokenizer_path=deepseek-ai/DeepSeek-V3 \ - attention=flash \ - dtype=bfloat16 \ - weight_dtype=bfloat16 \ - megablox=False \ - sparse_matmul=False \ - dataset_type=synthetic \ - indexer_sparse_training=False \ - indexer_loss_scaling_factor={some non-zero value} \ - trainable_parameters_mask=['.*indexer.*'] -``` -2. **Sparse Training Stage** -```sh -python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ - base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ - run_name=matmul_pre_training \ - per_device_batch_size=4 \ - enable_checkpointing=false \ - model_name=deepseek3-671b \ - ici_fsdp_parallelism=128 \ - steps=5 \ - max_target_length=1024 \ - async_checkpointing=false \ - tokenizer_type=huggingface \ - tokenizer_path=deepseek-ai/DeepSeek-V3 \ - attention=flash \ - dtype=bfloat16 \ - weight_dtype=bfloat16 \ - megablox=False \ - sparse_matmul=False \ - dataset_type=synthetic \ - indexer_sparse_training=True \ - indexer_loss_scaling_factor={some non-zero value} \ -``` - ## Checkpoint conversion To get started, follow the instructions at HuggingFace ([V3](https://huggingface.co/deepseek-ai/DeepSeek-V3), [V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)) to download the model. Currently for V3, V3.1, and R1, it uses mixed precision fp8 & bf16 weights. To convert all FP8 weights to BF16, use the script [here](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/utils/ckpt_scripts/deepseek_fp8_to_bf16.py). Once downloaded and converted to BF16: * run [convert_deepseek_family_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_ckpt.py) to convert the checkpoint for MaxText compatibility in [Orbax](https://orbax.readthedocs.io/en/latest/guides/checkpoint/orbax_checkpoint_101.html) for training and fine-tuning. When converting a checkpoint with MTP layers (like DeepSeek-V3), be sure to add the `--enable_mtp` flag to process them correctly. * run [convert_deepseek_family_unscanned_ckpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/convert_deepseek_family_unscanned_ckpt.py) to convert the checkpoint to unscanned version in Orbax for decoding. -## Checkpoint conversion for V3.2 -> **Note:** These steps are required because Transformers code for V3.2 is not yet available. - -### 1. Download Model Weights -Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment. -* **Target Directory:** `LOCAL_WEIGHTS` - -### 2. Dequantize Weights -Convert the weights from FP8 to BF16 using the official DeepSeek script. -* **Script:** [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) -* **Output Directory:** `DEQUANTIZED_LOCAL_WEIGHTS` - -### 3. Convert to MaxText -Execute the following command to finalize the conversion. Ensure your environment variables (`$BASE_OUTPUT_PATH`, `$HF_TOKEN`, and `$DEQUANTIZED_LOCAL_WEIGHTS`) are exported before running. - -```bash -python3 -m maxtext.checkpoint_conversion.to_maxtext \ - src/maxtext/configs/base.yml \ - model_name=deepseek3.2-671b \ - scan_layers=true \ - attention=dot_product \ - base_output_directory=$BASE_OUTPUT_PATH \ - hf_access_token=$HF_TOKEN \ - hardware=cpu \ - skip_jax_distributed_system=True \ - --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \ - --eager_load_method=safetensors \ - --save_dtype=bfloat16 -``` - ## Fine-tuning After you have a MaxText compatible checkpoint, you could fine-tune it with different datasets. @@ -297,3 +216,91 @@ To run MMLU benchmarks and validate the model's performance, follow the instruct * Dropping implementation with flag `sparse_matmul=False` and reasonable `capacity_factor`, commonly used from 1 to 1.25. See more examples in scripts for [V3](v3-671b/test_deepseek.sh) and [V2-Lite](v2-16b/test_deepseek.sh). + +## DeepSeek-V3.2 + +### Continued pre-training for V3.2 Sparse Attention +**DeepSeek Sparse Attention (DSA)** enhances the Multi-Head Latent Attention (MLA) architecture by introducing a **Lightning Indexer**, which selects the top-$k$ tokens for attention. DeepSeek-V3.2 is instantiated from DeepSeek-V3.1 and undergoes continued pre-training to adapt this indexer via a two-stage strategy: **Dense Warm-up** and **Sparse Training**. + +1. **Dense Warmup Stage** +The indexer is trained exclusively using dense indexer loss while all other model parameters remain frozen. +```sh +python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ + model_name=deepseek3.2-671b \ + run_name=matmul_pre_training \ + per_device_batch_size=4 \ + enable_checkpointing=false \ + model_name=deepseek3-671b \ + ici_fsdp_parallelism=128 \ + steps=5 \ + tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ + async_checkpointing=false \ + tokenizer_type=huggingface \ + tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ + attention=flash \ + dtype=bfloat16 \ + weight_dtype=bfloat16 \ + megablox=True \ + sparse_matmul=True \ + dataset_type=synthetic \ + indexer_sparse_training=False \ + indexer_loss_scaling_factor=0.01 \ # Must be non-zero to activate indexer training. Default in base.yaml is 0. + trainable_parameters_mask=['.*indexer.*'] +``` +2. **Sparse Training Stage** +The indexer is trained with sparse indexer loss, while the remaining model parameters are unfrozen and updated using standard language modeling loss. +```sh +python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + model_name=deepseek3.2-671b \ + per_device_batch_size=4 \ + enable_checkpointing=false \ + model_name=deepseek3-671b \ + ici_fsdp_parallelism=128 \ + steps=5 \ + max_target_length=1024 \ + async_checkpointing=false \ + tokenizer_type=huggingface \ + tokenizer_path=deepseek-ai/DeepSeek-V3.2 \ + attention=flash \ + dtype=bfloat16 \ + weight_dtype=bfloat16 \ + megablox=True \ + sparse_matmul=True \ + dataset_type=synthetic \ + indexer_sparse_training=True \ + indexer_loss_scaling_factor=0.01 \ # Must be non-zero to activate indexer training. Default in base.yaml is 0. +``` + +### Checkpoint conversion for V3.2 +> **Note:** These steps are required because Transformers code for V3.2 is not yet available. + +#### 1. Download Model Weights +Download the Hugging Face weights from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) to your local environment. Weights are provided in FP8. +`hf download deepseek-ai/DeepSeek-V3.2 --local-dir ` + +#### 2. Dequantize Weights +* **Script:** +Convert the weights from FP8 to BF16 using script [deepseek_fp8_to_bf16.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/standalone_scripts/deepseek_fp8_to_bf16.py) on CPU: + +python3 -m maxtext.checkpoint_conversion.standalone_scripts.deepseek_fp8_to_bf16 --input-fp8-hf-path= --output-bf16-hf-path= + +Alternatively, we can use the official DeepSeek script [fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert on GPU. + +#### 3. Convert to MaxText-compatible Orbax format +Execute the following command to finalize the conversion. Ensure your environment variables (`$BASE_OUTPUT_PATH`, `$HF_TOKEN`, and `$DEQUANTIZED_LOCAL_WEIGHTS`) are exported before running. +Setting `scan_layers=true` generates scanned Orbax format for training and fine-tuning. Setting `scan_layers=false` unscanned format in Orbax for decoding. +```bash +python3 -m maxtext.checkpoint_conversion.to_maxtext \ + src/maxtext/configs/base.yml \ + model_name=deepseek3.2-671b \ + scan_layers=true \ + attention=dot_product \ + base_output_directory=$BASE_OUTPUT_PATH \ + hf_access_token=$HF_TOKEN \ + hardware=cpu \ + skip_jax_distributed_system=True \ + --hf_model_path=$DEQUANTIZED_LOCAL_WEIGHTS \ + --eager_load_method=safetensors \ + --save_dtype=bfloat16 +```