Add perf model coverage for DeepEP EP communication ops#520
Add perf model coverage for DeepEP EP communication ops#520
Conversation
There was a problem hiding this comment.
Pull request overview
Adds performance-model coverage and categorization for DeepEP expert-parallel token-routing communication ops so they’re no longer reported as other in TraceLens’ perf reports.
Changes:
- Introduces a new
EPCommperf-model base class plus four DeepEP-specific subclasses withflops() = 0and a bytes-moved model derived from(num_tokens, hidden_dim, dtype). - Registers
DeepEPDispatch/Combine(and their backward variants) inop_to_perf_model_class_mapand categorizes them under the newEP_Communicationcategory. - Adds a new pytest module with mapping, categorization, and
bytes()/dtype coverage tests for the DeepEP ops.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
TraceLens/PerfModel/perf_model.py |
Adds EPComm + DeepEP subclasses and implements a bytes-moved model for EP communication ops. |
TraceLens/PerfModel/torch_op_mapping.py |
Maps DeepEP op names to the new perf-model classes and introduces the EP_Communication category. |
tests/test_deepep_ops.py |
New unit tests validating mapping, categorization, and bytes()/flops() behavior for DeepEP ops. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Closes #540, closes #541, closes #542, closes #543 - Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing primus_turbo perf models (reuses same Input Dims layout) (#540) - Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn, DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541) - Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine, TokenPermuteMaskMap, _OperationFuserAutogradFunction) as MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520) - Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543) - Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543) Made-with: Cursor
03cf456 to
a8fdca8
Compare
Closes #540, closes #541, closes #542, closes #543 - Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing primus_turbo perf models (reuses same Input Dims layout) (#540) - Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn, DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541) - Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine, TokenPermuteMaskMap, _OperationFuserAutogradFunction) as MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520) - Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543) - Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543) Made-with: Cursor
Closes #540, closes #541, closes #542, closes #543 - Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing primus_turbo perf models (reuses same Input Dims layout) (#540) - Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn, DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541) - Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine, TokenPermuteMaskMap, _OperationFuserAutogradFunction) as MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520) - Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543) - Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543) Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Closes #540, closes #541, closes #542, closes #543 - Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing primus_turbo perf models (reuses same Input Dims layout) (#540) - Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn, DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541) - Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine, TokenPermuteMaskMap, _OperationFuserAutogradFunction) as MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520) - Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543) - Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543) Made-with: Cursor
9d3a26e to
ed7bd83
Compare
Add EPComm base class and four concrete classes (deepep_dispatch, deepep_combine, deepep_dispatch_backward, deepep_combine_backward) to perf_model.py. Register all four ops in torch_op_mapping.py under the new EP_Communication category so they are no longer counted as 'other' in perf reports. bytes() = token-tensor volume (num_tokens * hidden_dim * bpe); flops() = 0 (pure all-to-all communication, no matrix multiply). Closes #518
- Remove unused `import pytest` from test_deepep_ops.py. - Update categorize_torch_op docstring to list EP_Communication and record_param_comms as valid return values. - Remove silent BF16 fallback in EPComm._parse_token_tensor: bpe is now kept as None when the dtype string is absent or unrecognised, and bytes() propagates None rather than producing a silently wrong bandwidth estimate. Add test_deepep_unknown_dtype_returns_none to cover this path.
Copilot review: EPComm lacked flops_bwd()/bytes_bwd() methods which would cause AttributeError when TreePerfAnalyzer computes backward perf metrics. Added both (delegate to flops()/bytes() — symmetric for comm ops). Also corrected docstring: TFLOPS/s is 0, not nan, when flops()=0 and kernel time > 0.
ed7bd83 to
b0eafa2
Compare
DeepEP ops (DeepEPDispatch, DeepEPCombine, and backward variants) are Primus/Megatron-specific and don't belong in core TraceLens alongside universal aten:: ops. - Keep EPComm base class in perf_model.py (reusable abstraction) - Move four concrete subclasses to example_megatron_extension.py - Register via perf_model_extension and dict_cat2names_extension - Remove DeepEP entries from core op_to_perf_model_class_map - Update tests to import from extension, verify not in core mapping
…extension" This reverts commit 69d8505.
Fixes #533 The performance model is currently bare-bones, based on O(n) operations being strictly necessary for these reductions. It could be more tuned but that would depend significantly on the implementation and the theoretical limit is something like n/2 + log(n) at a minimum which is not likely significantly more accurate --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Add string-typed groupby columns as tie-breakers so rows with the same duration sum have reproducible ordering. Regenerate all reference xlsx.
e509dff to
9791cbb
Compare
…cation Made-with: Cursor
Made-with: Cursor
Review NotesLooks solid overall — clean model, good test coverage. A few known limitations to document as TODOs for follow-up: Known Limitations
Please add a note in the RequestCould you DM @ajassani an example DeepEP trace? Want to verify what metadata is available in the trace event args. |
Address review feedback from @ajassani: - Effective vs true interconnect BW (includes straggler sync/wait) - Label as interconnect BW, not HBM (need Effective Interconnect TB/s column) - bytes() may be an upper bound (local-expert tokens don't cross interconnect)
|
Thanks for the review @ajassani — all three known limitations are now documented in the EPComm docstring (commit 8a5f105):
Will DM you an example DeepEP trace. |
Closes #518
Summary
DeepEPDispatch,DeepEPCombine,DeepEPDispatchBackward, andDeepEPCombineBackwardare all-to-all token-routing ops used by Primus Turbo's Deep Expert Parallelism (EP) in MoE models such as DeepSeek V2 Lite and GPT-OSS-20B. They accounted for ~23.8% and ~7.9% of total GPU time respectively, but had zero perf model coverage and were counted asother. This PR adds anEP_Communicationcategory and bandwidth model for all four ops.Changes
perf_model.py— newEPCommbase class and four concrete subclasses:EPComm— base for EP communication ops;flops() = 0,bytes() = num_tokens × hidden_dim × bpe(pure all-to-all, no matrix multiply).deepep_dispatch— parsesDeepEPDispatch;Input Dims[0]=(num_tokens_local, hidden_dim).deepep_combine— parsesDeepEPCombine;Input Dims[0]=(num_tokens_dispatched, hidden_dim).deepep_dispatch_backward/deepep_combine_backward— backward passes; same shape conventions as their forward counterparts in reverse.torch_op_mapping.py— registers all four ops inop_to_perf_model_class_map, addsEPComm: "EP_Communication"todict_base_class2category, and adds anEP_Communicationbranch incategorize_torch_opso these ops are no longer counted asother.Tests
tests/test_deepep_ops.py— 19 tests covering op mapping,EP_Communicationcategorization,bytes()values for all four ops (BF16 and FP32),flops() = 0, relative magnitude (combine > dispatch), unknown dtype handling, and inheritance fromEPComm.Note
The EPComm base class currently sits in perf_model.py. The docstring explains the placement rationale: