Add perf model coverage for DeepEP EP communication ops by gphuang · Pull Request #520 · AMD-AGI/TraceLens

gphuang · 2026-03-10T10:14:03Z

Closes #518

Summary

DeepEPDispatch, DeepEPCombine, DeepEPDispatchBackward, and DeepEPCombineBackward are all-to-all token-routing ops used by Primus Turbo's Deep Expert Parallelism (EP) in MoE models such as DeepSeek V2 Lite and GPT-OSS-20B. They accounted for ~23.8% and ~7.9% of total GPU time respectively, but had zero perf model coverage and were counted as other. This PR adds an EP_Communication category and bandwidth model for all four ops.

Changes

perf_model.py — new EPComm base class and four concrete subclasses:

EPComm — base for EP communication ops; flops() = 0, bytes() = num_tokens × hidden_dim × bpe (pure all-to-all, no matrix multiply).
deepep_dispatch — parses DeepEPDispatch; Input Dims[0] = (num_tokens_local, hidden_dim).
deepep_combine — parses DeepEPCombine; Input Dims[0] = (num_tokens_dispatched, hidden_dim).
deepep_dispatch_backward / deepep_combine_backward — backward passes; same shape conventions as their forward counterparts in reverse.

torch_op_mapping.py — registers all four ops in op_to_perf_model_class_map, adds EPComm: "EP_Communication" to dict_base_class2category, and adds an EP_Communication branch in categorize_torch_op so these ops are no longer counted as other.

Tests

tests/test_deepep_ops.py — 19 tests covering op mapping, EP_Communication categorization, bytes() values for all four ops (BF16 and FP32), flops() = 0, relative magnitude (combine > dispatch), unknown dtype handling, and inheritance from EPComm.

Note

The EPComm base class currently sits in perf_model.py. The docstring explains the placement rationale:

EPComm ops are bandwidth-bound (TB/s), not compute-bound — flops() = 0 is intentional
The natural alternative would be alongside NcclAnalyser, but DeepEP uses intra-node shared memory, not NCCL
There's no existing EP-specific comm analysis module to extend
Placing it in perf_model.py gives it a named category (EP_Communication) and bandwidth metric in the report
This is consistent with other Primus-specific ops (MoEDispatch, causal_conv1d, FusedRoPE, CrossEntropy) which also live in core as simple perf models

Copilot

Pull request overview

Adds performance-model coverage and categorization for DeepEP expert-parallel token-routing communication ops so they’re no longer reported as other in TraceLens’ perf reports.

Changes:

Introduces a new EPComm perf-model base class plus four DeepEP-specific subclasses with flops() = 0 and a bytes-moved model derived from (num_tokens, hidden_dim, dtype).
Registers DeepEPDispatch/Combine (and their backward variants) in op_to_perf_model_class_map and categorizes them under the new EP_Communication category.
Adds a new pytest module with mapping, categorization, and bytes()/dtype coverage tests for the DeepEP ops.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`TraceLens/PerfModel/perf_model.py`	Adds `EPComm` + DeepEP subclasses and implements a bytes-moved model for EP communication ops.
`TraceLens/PerfModel/torch_op_mapping.py`	Maps DeepEP op names to the new perf-model classes and introduces the `EP_Communication` category.
`tests/test_deepep_ops.py`	New unit tests validating mapping, categorization, and `bytes()`/`flops()` behavior for DeepEP ops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Closes #540, closes #541, closes #542, closes #543 - Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing primus_turbo perf models (reuses same Input Dims layout) (#540) - Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn, DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541) - Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine, TokenPermuteMaskMap, _OperationFuserAutogradFunction) as MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520) - Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543) - Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543) Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Closes #540, closes #541, closes #542, closes #543 - Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing primus_turbo perf models (reuses same Input Dims layout) (#540) - Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn, DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541) - Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine, TokenPermuteMaskMap, _OperationFuserAutogradFunction) as MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520) - Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543) - Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543) Made-with: Cursor

Add EPComm base class and four concrete classes (deepep_dispatch, deepep_combine, deepep_dispatch_backward, deepep_combine_backward) to perf_model.py. Register all four ops in torch_op_mapping.py under the new EP_Communication category so they are no longer counted as 'other' in perf reports. bytes() = token-tensor volume (num_tokens * hidden_dim * bpe); flops() = 0 (pure all-to-all communication, no matrix multiply). Closes #518

- Remove unused `import pytest` from test_deepep_ops.py. - Update categorize_torch_op docstring to list EP_Communication and record_param_comms as valid return values. - Remove silent BF16 fallback in EPComm._parse_token_tensor: bpe is now kept as None when the dtype string is absent or unrecognised, and bytes() propagates None rather than producing a silently wrong bandwidth estimate. Add test_deepep_unknown_dtype_returns_none to cover this path.

…nt placement

Copilot review: EPComm lacked flops_bwd()/bytes_bwd() methods which would cause AttributeError when TreePerfAnalyzer computes backward perf metrics. Added both (delegate to flops()/bytes() — symmetric for comm ops). Also corrected docstring: TFLOPS/s is 0, not nan, when flops()=0 and kernel time > 0.

DeepEP ops (DeepEPDispatch, DeepEPCombine, and backward variants) are Primus/Megatron-specific and don't belong in core TraceLens alongside universal aten:: ops. - Keep EPComm base class in perf_model.py (reusable abstraction) - Move four concrete subclasses to example_megatron_extension.py - Register via perf_model_extension and dict_cat2names_extension - Remove DeepEP entries from core op_to_perf_model_class_map - Update tests to import from extension, verify not in core mapping

…extension" This reverts commit 69d8505.

Fixes #533 The performance model is currently bare-bones, based on O(n) operations being strictly necessary for these reductions. It could be more tuned but that would depend significantly on the implementation and the theoretical limit is something like n/2 + log(n) at a minimum which is not likely significantly more accurate --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…s-communication

Add string-typed groupby columns as tie-breakers so rows with the same duration sum have reproducible ordering. Regenerate all reference xlsx.

…cation Made-with: Cursor

Made-with: Cursor

…s-communication

ajassani · 2026-04-01T15:51:11Z

Review Notes

Looks solid overall — clean model, good test coverage. A few known limitations to document as TODOs for follow-up:

Known Limitations

Effective vs true interconnect BW — The current TB/s is effective throughput (bytes / total kernel time), which includes straggler sync/wait time. True interconnect BW (excluding sync overhead) requires tree perf / NCCL analyzer integration — planned for later.
Label as interconnect BW, not HBM — These ops move data over the interconnect, not HBM. Please add a column named Effective Interconnect TB/s in the report to distinguish from HBM-based TB/s used by other ops. "Effective" signals that the link isn't active for the full duration (straggler wait, sync overhead).
Bytes may be an upper bound — bytes() assumes every token crosses the interconnect, but tokens routed to a local expert stay on-rank. Need to check what routing metadata is actually available in real traces.

Please add a note in the EPComm docstring flagging these so they are tracked.

Request

Could you DM @ajassani an example DeepEP trace? Want to verify what metadata is available in the trace event args.

@ajassani

Address review feedback from @ajassani: - Effective vs true interconnect BW (includes straggler sync/wait) - Label as interconnect BW, not HBM (need Effective Interconnect TB/s column) - bytes() may be an upper bound (local-expert tokens don't cross interconnect)

gphuang · 2026-04-02T06:26:03Z

Thanks for the review @ajassani — all three known limitations are now documented in the EPComm docstring (commit 8a5f105):

Effective vs true interconnect BW — noted that current TB/s includes straggler sync/wait; true interconnect BW requires tree-perf / NCCL-analyzer integration (planned follow-up).
Label as interconnect BW, not HBM — documented the need for an "Effective Interconnect TB/s" column to distinguish from HBM-based TB/s.
bytes() upper bound — noted that tokens routed to local experts stay on-rank; need to check what routing metadata is available in real trace args.

Will DM you an example DeepEP trace.

…s-communication

gphuang linked an issue Mar 10, 2026 that may be closed by this pull request

Add coverage for DeepEP* expert-parallel ops (Communication / EP) #518

Open

3 tasks

gphuang marked this pull request as ready for review March 10, 2026 10:23

Copilot AI review requested due to automatic review settings March 10, 2026 10:23

Copilot started reviewing on behalf of gphuang March 10, 2026 10:25 View session

gphuang self-assigned this Mar 10, 2026

gphuang added the perf_model Add performance model for calculating TFLOPS/s and TB/s label Mar 10, 2026

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread tests/test_deepep_ops.py Outdated

Comment thread TraceLens/PerfModel/torch_op_mapping.py

Comment thread TraceLens/PerfModel/perf_model.py

gphuang requested a review from Copilot March 10, 2026 11:23

Copilot started reviewing on behalf of gphuang March 10, 2026 11:25 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread TraceLens/PerfModel/torch_op_mapping.py

gphuang changed the title ~~feat: add EP_Communication perf model for DeepEP dispatch/combine ops~~ feat: add coverage for Deep EP Communication ops Mar 10, 2026

gphuang requested review from ajassani and olehtika March 11, 2026 10:23

gphuang marked this pull request as draft March 11, 2026 11:46

gphuang changed the title ~~feat: add coverage for Deep EP Communication ops~~ feat: add coverage for Deep EP ops (Communication ) Mar 11, 2026

gphuang changed the title ~~feat: add coverage for Deep EP ops (Communication )~~ feat: add coverage for Deep EP ops Mar 11, 2026

gphuang changed the title ~~feat: add coverage for Deep EP ops~~ feat: add coverage for DeepEP ops Mar 11, 2026

gphuang mentioned this pull request Mar 11, 2026

Add perf models for Primus op groups #516

Open

gphuang mentioned this pull request Mar 18, 2026

Add MoE_comm category for dispatch/combine ops #542

Open

gphuang mentioned this pull request Mar 19, 2026

Add perf model and categorization for CK grouped GEMM, SSM, MoE, RoPE, and CrossEntropy ops #550

Merged

1 task

gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from 03cf456 to a8fdca8 Compare March 19, 2026 11:56

gphuang changed the title ~~feat: add coverage for DeepEP ops~~ Add perf model coverage for DeepEP EP communication ops Mar 19, 2026

gphuang marked this pull request as ready for review March 19, 2026 13:36

gphuang requested a review from Copilot March 19, 2026 15:00

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Comment thread TraceLens/PerfModel/perf_model.py

Comment thread TraceLens/PerfModel/perf_model.py

gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from 9d3a26e to ed7bd83 Compare March 24, 2026 08:47

gphuang added 4 commits March 24, 2026 09:00

refactor: add get_maf_type/get_compute_precision to EPComm and docume…

e0e1100

…nt placement

gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from ed7bd83 to b0eafa2 Compare March 24, 2026 09:00

gphuang marked this pull request as draft March 25, 2026 07:02

Revert "refactor: move DeepEP concrete classes from core to Megatron …

fb6c4e6

…extension" This reverts commit 69d8505.

gphuang marked this pull request as ready for review March 25, 2026 07:21

gabeweisz and others added 3 commits March 25, 2026 15:17

Merge branch 'main' into 518-add-perf-model-coverage-for-deepep-ep-op…

22f6551

…s-communication

fix: deterministic short_kernels_summary sort order

9791cbb

Add string-typed groupby columns as tie-breakers so rows with the same duration sum have reproducible ordering. Regenerate all reference xlsx.

gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from e509dff to 9791cbb Compare March 26, 2026 12:40

gphuang added 2 commits March 26, 2026 12:53

trigger CI

e78b79c

trigger CI

4a3a049

gphuang closed this Mar 26, 2026

gphuang reopened this Mar 26, 2026

gphuang and others added 4 commits March 26, 2026 13:04

Merge main into 518-add-perf-model-coverage-for-deepep-ep-ops-communi…

0065403

…cation Made-with: Cursor

fix: Black formatting

e52a8c0

Made-with: Cursor

Merge origin/main: adopt CSV tests and simplified deterministic sort

b718d43

Merge branch 'main' into 518-add-perf-model-coverage-for-deepep-ep-op…

746050b

…s-communication

gphuang added 3 commits April 7, 2026 09:35

Merge branch 'main' into 518-add-perf-model-coverage-for-deepep-ep-op…

d3f8790

…s-communication

Merge branch 'main' into 518-add-perf-model-coverage-for-deepep-ep-op…

b40142a

…s-communication

Merge branch 'main' into 518-add-perf-model-coverage-for-deepep-ep-op…

bcedcd9

…s-communication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add perf model coverage for DeepEP EP communication ops#520

Add perf model coverage for DeepEP EP communication ops#520
gphuang wants to merge 19 commits intomainfrom
518-add-perf-model-coverage-for-deepep-ep-ops-communication

gphuang commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ajassani commented Apr 1, 2026

Uh oh!

gphuang commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gphuang commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

ajassani commented Apr 1, 2026

Review Notes

Known Limitations

Request

Uh oh!

gphuang commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gphuang commented Mar 10, 2026 •

edited

Loading