Deepseekv3.2 by blueswhen · Pull Request #1246 · ModelTC/LightLLM

blueswhen · 2026-03-26T10:01:55Z

No description provided.

gemini-code-assist · 2026-03-26T10:02:28Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability to support the DeepSeek-V3.2 model by integrating its specialized attention and KV cache mechanisms. The core of these changes revolves around the adoption of FlashMLA for sparse attention and FP8 quantization for the KV cache, aiming to optimize memory usage and improve inference performance. The updates span from Docker build configurations to core attention logic and KV cache management, ensuring seamless operation with the new model.

Highlights

FlashMLA Integration for DeepSeek-V3.2: Incorporated the FlashMLA library, a key dependency for DeepSeek-V3.2's attention mechanism, by adding it to the Docker build process.
FP8 Sparse KV Cache Support: Introduced a new KV cache manager and attention backend specifically designed for DeepSeek-V3.2's DSA (DeepSeek Attention) using FP8 quantization and sparse attention with FlashMLA.
Dynamic KV Cache Manager Selection: Enhanced the system to dynamically select the appropriate KV cache manager, including the new FP8 sparse manager, based on the llm_kv_type configuration.
Triton Kernel for FP8 KV Copy: Added a specialized Triton kernel to efficiently handle the copying and quantization of KV cache data into the FlashMLA FP8 format.
API and Configuration Updates: Extended the command-line interface and internal configuration to support fp8kv_dsa as a new option for KV cache types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek-V3.2 DSA-specific FlashMLA FP8 sparse KV cache. Key changes include adding a new fp8kv_dsa option for KV cache type, implementing a dedicated memory manager (FP8PerTokenGroupQuantDeepseek3_2MemoryManager), and integrating new Triton kernels for FP8 KV cache operations. The Dockerfile is updated to install FlashMLA, and attention and transformer layer inference logic are modified to utilize the new FP8 sparse attention backend. Feedback includes suggestions to optimize the Docker image size by cleaning up build artifacts, remove a duplicate fp8.py file, replace magic numbers with named constants or configuration in the new memory manager and Triton kernel, and refine the att_state type hint in transformer_layer_infer.py for better type safety.

gemini-code-assist · 2026-03-26T10:04:51Z

docker/Dockerfile

+    cd /root/FlashMLA && \
+    git checkout ${FLASH_MLA_REF} && \
+    git submodule update --init --recursive && \
+    FLASH_MLA_DISABLE_SM100=1 pip install --no-cache-dir .


To reduce the final Docker image size, it's a good practice to clean up build-time dependencies and source files within the same RUN layer. After installing FlashMLA, the cloned repository at /root/FlashMLA is no longer needed and can be removed.

FLASH_MLA_DISABLE_SM100=1 pip install --no-cache-dir . && rm -rf /root/FlashMLA

gemini-code-assist · 2026-03-26T10:04:51Z

lightllm/common/basemodel/attention/nsa/fp8.py

@@ -0,0 +1,187 @@
+import dataclasses


This file appears to be a duplicate of lightllm/common/basemodel/attention/nsa/fp8_flashmla_sparse.py. The rest of the codebase imports from fp8_flashmla_sparse.py, so this file seems to be unused and can be removed to avoid code duplication and potential confusion.

gemini-code-assist · 2026-03-26T10:04:51Z

lightllm/common/kv_cache_mem_manager/fp8_per_token_group_quant_deepseek3_2mem_manager.py

+    flashmla_bytes_per_token = 656
+    indexer_bytes_per_token = 132
+    kv_head_dim = 576


This class uses several magic numbers (e.g., 656, 132, 576). These numbers seem to be related to the model architecture but are hardcoded. It would improve maintainability and readability to define them as named constants at the top of the file or, even better, pass them in from the model configuration during initialization. This would make the code more flexible for future model variations.

gemini-code-assist · 2026-03-26T10:04:51Z

lightllm/models/deepseek3_2/layer_infer/transformer_layer_infer.py

        q_lora: torch.Tensor,
        infer_state: Deepseek2InferStateInfo,
-        att_state: Union[NsaFlashMlaSparsePrefillAttState, NsaFlashMlaSparseDecodeAttState],
+        att_state: Any,


Using Any for the type hint of att_state loses type information, which can make the code harder to understand and maintain. It would be better to use a more specific type, like a Union of the possible state types, or define a common base class for all attention states and use that as the type hint. This improves code clarity and allows static analysis tools to catch potential errors.

gemini-code-assist · 2026-03-26T10:04:51Z

lightllm/models/deepseek3_2/triton_kernel/destindex_copy_kv_flashmla_fp8.py

+        start = tile_idx * 128
+        end = start + 128
+        tile = kv_nope[:, start:end]
+        scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / 448.0, 1e-4).log2().ceil())


The magic number 448.0 is used here, which corresponds to the maximum value of float8_e4m3fn. It's better to use torch.finfo(torch.float8_e4m3fn).max to avoid magic numbers and improve code clarity and robustness against future changes in the data type.

Suggested change

scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / 448.0, 1e-4).log2().ceil())

scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / torch.finfo(torch.float8_e4m3fn).max, 1e-4).log2().ceil())

钮圣虓 added 2 commits March 26, 2026 15:57

feat: fp8 dsa support

91b7c29

refine

14f0df0

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

remove test files

d150428

blueswhen force-pushed the deepseekv3.2 branch 13 times, most recently from 9ec2bd2 to b303281 Compare March 27, 2026 12:07

perf

a505f8c

blueswhen force-pushed the deepseekv3.2 branch from b303281 to a505f8c Compare March 27, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseekv3.2#1246

Deepseekv3.2#1246
blueswhen wants to merge 4 commits intomainfrom
deepseekv3.2

blueswhen commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / 448.0, 1e-4).log2().ceil())
	scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / torch.finfo(torch.float8_e4m3fn).max, 1e-4).log2().ceil())

Conversation

blueswhen commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant