[Cherry-pick][Optimization] enable trtllm_all_reduce fusion kernel in glm model#7219
[Cherry-pick][Optimization] enable trtllm_all_reduce fusion kernel in glm model#7219BingooYang wants to merge 18 commits intoPaddlePaddle:release/2.5from
Conversation
|
Thanks for your contribution! |
fastdeploy-bot
left a comment
There was a problem hiding this comment.
📋 Review 摘要
PR 概述:为 GLM-Air-4.5 模型启用 trtllm_all_reduce fusion kernel,通过新增 flashinfer 融合算子提升性能。
变更范围:model_executor/layers/、model_executor/models/glm4_moe.py、config/、engine/
影响面 Tag:[Optimization] [Models] [OP]
📝 PR 规范检查
PR 符合规范:
- 标题包含有效 Tag
[Optimization] - Motivation 和 Modifications 描述清晰
- Checklist 项目完整
- 已提供 Usage 和 Accuracy Tests
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | normalization.py:249-252 |
fusion 返回 None 时会导致 AttributeError |
| 🟡 建议 | normalization.py:249, linear.py:941, flashinfer_comm_fusion.py:87,118 |
max_token_num 硬编码 2048,建议从配置读取 |
总体评价
PR 实现了 trtllm_all_reduce fusion kernel 的接入,测试覆盖了主要路径。但存在一个关键的 bug:当 flashinfer 不可用时,fusion 函数返回 (None, None),但调用方没有正确处理这种情况,会导致运行时错误。建议修复后合并。
| # enable trtllm all reduce fusion | ||
| elif self.enable_all_reduce_fusion and x.shape[0] <= 2048: | ||
| norm_out = flashinfer_allreduce_residual_rmsnorm( | ||
| fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps |
There was a problem hiding this comment.
🔴 Bug 当 flashinfer_allreduce_residual_rmsnorm 返回 (None, None) 时(flashinfer 不可用或 workspace 未初始化),代码会尝试访问 norm_out[0].astype(),导致 AttributeError: 'NoneType' object has no attribute 'astype'。
建议修复方式:
# enable trtllm all reduce fusion
elif self.enable_all_reduce_fusion and x.shape[0] <= 2048:
norm_out = flashinfer_allreduce_residual_rmsnorm(
fd_config=self.fd_config, input_tensor=x, residual=residual_input, weight=self.weight, eps=self.eps
)
# Check if fusion succeeded, fallback to normal path if not
if norm_out[0] is None or norm_out[1] is None:
norm_out = self.norm_func(
x,
norm_weight=self.weight,
norm_bias=None,
epsilon=self.eps,
begin_norm_axis=self.begin_norm_axis,
bias=self.bias,
residual=residual_input,
quant_scale=(-1 if self.quant_scale is None else self.quant_scale),
quant_round_type=self.quant_round_type,
quant_max_bound=self.quant_max_bound,
quant_min_bound=self.quant_min_bound,
)| residual: paddle.Tensor, | ||
| weight: paddle.Tensor, | ||
| eps: float = 1e-6, | ||
| max_token_num: int = 2048, |
There was a problem hiding this comment.
🟡 建议 max_token_num 在多处硬编码为 2048,限制了配置灵活性。建议从 FDConfig 中读取此参数。
影响位置:
linear.py:941-out.shape[0] <= 2048normalization.py:249-x.shape[0] <= 2048flashinfer_comm_fusion.py:87-max_token_num: int = 2048(默认参数)flashinfer_comm_fusion.py:118-max_token_num: int = 2048(默认参数)
建议在 FDConfig 中添加 flashinfer_allreduce_max_token_num 字段,统一配置。
Motivation
FD接入trtllm_allreduce_fusion算子
Modifications
Usage or Command
H卡和B卡本地测试均通过
python -m fastdeploy.entrypoints.openai.api_server --model /root/paddlejob/workspace/bingoo/model/GLM-4.5-Air --tensor-parallel-size 4 --port 8185 --max-num-batched-tokens 2048 --enable-flashinfer-allreduce-fusion
Accuracy Tests
python -m paddle.distributed.launch --gpus=0,1 ./FastDeploy/tests/layers/test_rms_allreduce_fusion.py
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.