Skip to content

Deepseekv3.2#1246

Open
blueswhen wants to merge 4 commits intomainfrom
deepseekv3.2
Open

Deepseekv3.2#1246
blueswhen wants to merge 4 commits intomainfrom
deepseekv3.2

Conversation

@blueswhen
Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability to support the DeepSeek-V3.2 model by integrating its specialized attention and KV cache mechanisms. The core of these changes revolves around the adoption of FlashMLA for sparse attention and FP8 quantization for the KV cache, aiming to optimize memory usage and improve inference performance. The updates span from Docker build configurations to core attention logic and KV cache management, ensuring seamless operation with the new model.

Highlights

  • FlashMLA Integration for DeepSeek-V3.2: Incorporated the FlashMLA library, a key dependency for DeepSeek-V3.2's attention mechanism, by adding it to the Docker build process.
  • FP8 Sparse KV Cache Support: Introduced a new KV cache manager and attention backend specifically designed for DeepSeek-V3.2's DSA (DeepSeek Attention) using FP8 quantization and sparse attention with FlashMLA.
  • Dynamic KV Cache Manager Selection: Enhanced the system to dynamically select the appropriate KV cache manager, including the new FP8 sparse manager, based on the llm_kv_type configuration.
  • Triton Kernel for FP8 KV Copy: Added a specialized Triton kernel to efficiently handle the copying and quantization of KV cache data into the FlashMLA FP8 format.
  • API and Configuration Updates: Extended the command-line interface and internal configuration to support fp8kv_dsa as a new option for KV cache types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepSeek-V3.2 DSA-specific FlashMLA FP8 sparse KV cache. Key changes include adding a new fp8kv_dsa option for KV cache type, implementing a dedicated memory manager (FP8PerTokenGroupQuantDeepseek3_2MemoryManager), and integrating new Triton kernels for FP8 KV cache operations. The Dockerfile is updated to install FlashMLA, and attention and transformer layer inference logic are modified to utilize the new FP8 sparse attention backend. Feedback includes suggestions to optimize the Docker image size by cleaning up build artifacts, remove a duplicate fp8.py file, replace magic numbers with named constants or configuration in the new memory manager and Triton kernel, and refine the att_state type hint in transformer_layer_infer.py for better type safety.

cd /root/FlashMLA && \
git checkout ${FLASH_MLA_REF} && \
git submodule update --init --recursive && \
FLASH_MLA_DISABLE_SM100=1 pip install --no-cache-dir .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To reduce the final Docker image size, it's a good practice to clean up build-time dependencies and source files within the same RUN layer. After installing FlashMLA, the cloned repository at /root/FlashMLA is no longer needed and can be removed.

    FLASH_MLA_DISABLE_SM100=1 pip install --no-cache-dir . && rm -rf /root/FlashMLA

@@ -0,0 +1,187 @@
import dataclasses
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file appears to be a duplicate of lightllm/common/basemodel/attention/nsa/fp8_flashmla_sparse.py. The rest of the codebase imports from fp8_flashmla_sparse.py, so this file seems to be unused and can be removed to avoid code duplication and potential confusion.

Comment on lines +13 to +15
flashmla_bytes_per_token = 656
indexer_bytes_per_token = 132
kv_head_dim = 576
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This class uses several magic numbers (e.g., 656, 132, 576). These numbers seem to be related to the model architecture but are hardcoded. It would improve maintainability and readability to define them as named constants at the top of the file or, even better, pass them in from the model configuration during initialization. This would make the code more flexible for future model variations.

q_lora: torch.Tensor,
infer_state: Deepseek2InferStateInfo,
att_state: Union[NsaFlashMlaSparsePrefillAttState, NsaFlashMlaSparseDecodeAttState],
att_state: Any,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Any for the type hint of att_state loses type information, which can make the code harder to understand and maintain. It would be better to use a more specific type, like a Union of the possible state types, or define a common base class for all attention states and use that as the type hint. This improves code clarity and allows static analysis tools to catch potential errors.

start = tile_idx * 128
end = start + 128
tile = kv_nope[:, start:end]
scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / 448.0, 1e-4).log2().ceil())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 448.0 is used here, which corresponds to the maximum value of float8_e4m3fn. It's better to use torch.finfo(torch.float8_e4m3fn).max to avoid magic numbers and improve code clarity and robustness against future changes in the data type.

Suggested change
scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / 448.0, 1e-4).log2().ceil())
scale = torch.pow(2, torch.clamp_min(tile.abs().amax(dim=-1).float() / torch.finfo(torch.float8_e4m3fn).max, 1e-4).log2().ceil())

@blueswhen blueswhen force-pushed the deepseekv3.2 branch 13 times, most recently from 9ec2bd2 to b303281 Compare March 27, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant