Skip to content

Add perf model coverage for DeepEP EP communication ops#520

Open
gphuang wants to merge 19 commits intomainfrom
518-add-perf-model-coverage-for-deepep-ep-ops-communication
Open

Add perf model coverage for DeepEP EP communication ops#520
gphuang wants to merge 19 commits intomainfrom
518-add-perf-model-coverage-for-deepep-ep-ops-communication

Conversation

@gphuang
Copy link
Copy Markdown
Contributor

@gphuang gphuang commented Mar 10, 2026

Closes #518

Summary

DeepEPDispatch, DeepEPCombine, DeepEPDispatchBackward, and DeepEPCombineBackward are all-to-all token-routing ops used by Primus Turbo's Deep Expert Parallelism (EP) in MoE models such as DeepSeek V2 Lite and GPT-OSS-20B. They accounted for ~23.8% and ~7.9% of total GPU time respectively, but had zero perf model coverage and were counted as other. This PR adds an EP_Communication category and bandwidth model for all four ops.

Changes

perf_model.py — new EPComm base class and four concrete subclasses:

  • EPComm — base for EP communication ops; flops() = 0, bytes() = num_tokens × hidden_dim × bpe (pure all-to-all, no matrix multiply).
  • deepep_dispatch — parses DeepEPDispatch; Input Dims[0] = (num_tokens_local, hidden_dim).
  • deepep_combine — parses DeepEPCombine; Input Dims[0] = (num_tokens_dispatched, hidden_dim).
  • deepep_dispatch_backward / deepep_combine_backward — backward passes; same shape conventions as their forward counterparts in reverse.

torch_op_mapping.py — registers all four ops in op_to_perf_model_class_map, adds EPComm: "EP_Communication" to dict_base_class2category, and adds an EP_Communication branch in categorize_torch_op so these ops are no longer counted as other.

Tests

tests/test_deepep_ops.py — 19 tests covering op mapping, EP_Communication categorization, bytes() values for all four ops (BF16 and FP32), flops() = 0, relative magnitude (combine > dispatch), unknown dtype handling, and inheritance from EPComm.

Note

The EPComm base class currently sits in perf_model.py. The docstring explains the placement rationale:

  • EPComm ops are bandwidth-bound (TB/s), not compute-bound — flops() = 0 is intentional
  • The natural alternative would be alongside NcclAnalyser, but DeepEP uses intra-node shared memory, not NCCL
  • There's no existing EP-specific comm analysis module to extend
  • Placing it in perf_model.py gives it a named category (EP_Communication) and bandwidth metric in the report
  • This is consistent with other Primus-specific ops (MoEDispatch, causal_conv1d, FusedRoPE, CrossEntropy) which also live in core as simple perf models

@gphuang gphuang linked an issue Mar 10, 2026 that may be closed by this pull request
3 tasks
@gphuang gphuang marked this pull request as ready for review March 10, 2026 10:23
Copilot AI review requested due to automatic review settings March 10, 2026 10:23
@gphuang gphuang self-assigned this Mar 10, 2026
@gphuang gphuang added the perf_model Add performance model for calculating TFLOPS/s and TB/s label Mar 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds performance-model coverage and categorization for DeepEP expert-parallel token-routing communication ops so they’re no longer reported as other in TraceLens’ perf reports.

Changes:

  • Introduces a new EPComm perf-model base class plus four DeepEP-specific subclasses with flops() = 0 and a bytes-moved model derived from (num_tokens, hidden_dim, dtype).
  • Registers DeepEPDispatch/Combine (and their backward variants) in op_to_perf_model_class_map and categorizes them under the new EP_Communication category.
  • Adds a new pytest module with mapping, categorization, and bytes()/dtype coverage tests for the DeepEP ops.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
TraceLens/PerfModel/perf_model.py Adds EPComm + DeepEP subclasses and implements a bytes-moved model for EP communication ops.
TraceLens/PerfModel/torch_op_mapping.py Maps DeepEP op names to the new perf-model classes and introduces the EP_Communication category.
tests/test_deepep_ops.py New unit tests validating mapping, categorization, and bytes()/flops() behavior for DeepEP ops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_deepep_ops.py Outdated
Comment thread TraceLens/PerfModel/torch_op_mapping.py
Comment thread TraceLens/PerfModel/perf_model.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TraceLens/PerfModel/torch_op_mapping.py
@gphuang gphuang changed the title feat: add EP_Communication perf model for DeepEP dispatch/combine ops feat: add coverage for Deep EP Communication ops Mar 10, 2026
@gphuang gphuang requested review from ajassani and olehtika March 11, 2026 10:23
@gphuang gphuang marked this pull request as draft March 11, 2026 11:46
@gphuang gphuang changed the title feat: add coverage for Deep EP Communication ops feat: add coverage for Deep EP ops (Communication ) Mar 11, 2026
@gphuang gphuang changed the title feat: add coverage for Deep EP ops (Communication ) feat: add coverage for Deep EP ops Mar 11, 2026
@gphuang gphuang changed the title feat: add coverage for Deep EP ops feat: add coverage for DeepEP ops Mar 11, 2026
gphuang added a commit that referenced this pull request Mar 19, 2026
Closes #540, closes #541, closes #542, closes #543

- Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing
  primus_turbo perf models (reuses same Input Dims layout) (#540)
- Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn,
  DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541)
- Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine,
  TokenPermuteMaskMap, _OperationFuserAutogradFunction) as
  MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520)
- Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543)
- Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543)

Made-with: Cursor
@gphuang gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from 03cf456 to a8fdca8 Compare March 19, 2026 11:56
@gphuang gphuang changed the title feat: add coverage for DeepEP ops Add perf model coverage for DeepEP EP communication ops Mar 19, 2026
gphuang added a commit that referenced this pull request Mar 19, 2026
Closes #540, closes #541, closes #542, closes #543

- Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing
  primus_turbo perf models (reuses same Input Dims layout) (#540)
- Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn,
  DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541)
- Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine,
  TokenPermuteMaskMap, _OperationFuserAutogradFunction) as
  MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520)
- Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543)
- Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543)

Made-with: Cursor
@gphuang gphuang marked this pull request as ready for review March 19, 2026 13:36
gphuang added a commit that referenced this pull request Mar 19, 2026
Closes #540, closes #541, closes #542, closes #543

- Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing
  primus_turbo perf models (reuses same Input Dims layout) (#540)
- Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn,
  DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541)
- Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine,
  TokenPermuteMaskMap, _OperationFuserAutogradFunction) as
  MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520)
- Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543)
- Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543)

Made-with: Cursor
@gphuang gphuang requested a review from Copilot March 19, 2026 15:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread TraceLens/PerfModel/perf_model.py
Comment thread TraceLens/PerfModel/perf_model.py
gphuang added a commit that referenced this pull request Mar 24, 2026
Closes #540, closes #541, closes #542, closes #543

- Map ck_grouped_gemm and ck_grouped_gemm_variable_k to existing
  primus_turbo perf models (reuses same Input Dims layout) (#540)
- Categorize SSM/Mamba ops (MambaSplitConv1dScanCombinedFn,
  DaoAILab::_causal_conv1d_*_cpp) as SSM_fwd/SSM_bwd (#541)
- Categorize MoE dispatch/combine ops (MoEDispatch, MoECombine,
  TokenPermuteMaskMap, _OperationFuserAutogradFunction) as
  MoE_comm_fwd/MoE_comm_bwd (#542, reuses EPComm perf model from #518/#520)
- Categorize FusedRoPEFunc as RoPE_fwd/RoPE_bwd (#543)
- Categorize CrossEntropyFunction as CrossEntropy_fwd/CrossEntropy_bwd (#543)

Made-with: Cursor
@gphuang gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from 9d3a26e to ed7bd83 Compare March 24, 2026 08:47
gphuang added 4 commits March 24, 2026 09:00
Add EPComm base class and four concrete classes (deepep_dispatch,
deepep_combine, deepep_dispatch_backward, deepep_combine_backward) to
perf_model.py.  Register all four ops in torch_op_mapping.py under the
new EP_Communication category so they are no longer counted as 'other'
in perf reports.

bytes() = token-tensor volume (num_tokens * hidden_dim * bpe);
flops() = 0 (pure all-to-all communication, no matrix multiply).

Closes #518
- Remove unused `import pytest` from test_deepep_ops.py.
- Update categorize_torch_op docstring to list EP_Communication and
  record_param_comms as valid return values.
- Remove silent BF16 fallback in EPComm._parse_token_tensor: bpe is
  now kept as None when the dtype string is absent or unrecognised,
  and bytes() propagates None rather than producing a silently wrong
  bandwidth estimate. Add test_deepep_unknown_dtype_returns_none to
  cover this path.
Copilot review: EPComm lacked flops_bwd()/bytes_bwd() methods which
would cause AttributeError when TreePerfAnalyzer computes backward
perf metrics. Added both (delegate to flops()/bytes() — symmetric
for comm ops). Also corrected docstring: TFLOPS/s is 0, not nan,
when flops()=0 and kernel time > 0.
@gphuang gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from ed7bd83 to b0eafa2 Compare March 24, 2026 09:00
DeepEP ops (DeepEPDispatch, DeepEPCombine, and backward variants) are
Primus/Megatron-specific and don't belong in core TraceLens alongside
universal aten:: ops.

- Keep EPComm base class in perf_model.py (reusable abstraction)
- Move four concrete subclasses to example_megatron_extension.py
- Register via perf_model_extension and dict_cat2names_extension
- Remove DeepEP entries from core op_to_perf_model_class_map
- Update tests to import from extension, verify not in core mapping
@gphuang gphuang marked this pull request as draft March 25, 2026 07:02
@gphuang gphuang marked this pull request as ready for review March 25, 2026 07:21
gabeweisz and others added 3 commits March 25, 2026 15:17
Fixes #533

The performance model is currently bare-bones, based on O(n) operations
being strictly necessary for these reductions.

It could be more tuned but that would depend significantly on the
implementation and the theoretical limit is something like n/2 + log(n)
at a minimum which is not likely significantly more accurate

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Add string-typed groupby columns as tie-breakers so rows with the same
duration sum have reproducible ordering. Regenerate all reference xlsx.
@gphuang gphuang force-pushed the 518-add-perf-model-coverage-for-deepep-ep-ops-communication branch from e509dff to 9791cbb Compare March 26, 2026 12:40
@gphuang gphuang closed this Mar 26, 2026
@gphuang gphuang reopened this Mar 26, 2026
@ajassani
Copy link
Copy Markdown
Collaborator

ajassani commented Apr 1, 2026

Review Notes

Looks solid overall — clean model, good test coverage. A few known limitations to document as TODOs for follow-up:

Known Limitations

  1. Effective vs true interconnect BW — The current TB/s is effective throughput (bytes / total kernel time), which includes straggler sync/wait time. True interconnect BW (excluding sync overhead) requires tree perf / NCCL analyzer integration — planned for later.

  2. Label as interconnect BW, not HBM — These ops move data over the interconnect, not HBM. Please add a column named Effective Interconnect TB/s in the report to distinguish from HBM-based TB/s used by other ops. "Effective" signals that the link isn't active for the full duration (straggler wait, sync overhead).

  3. Bytes may be an upper boundbytes() assumes every token crosses the interconnect, but tokens routed to a local expert stay on-rank. Need to check what routing metadata is actually available in real traces.

Please add a note in the EPComm docstring flagging these so they are tracked.

Request

Could you DM @ajassani an example DeepEP trace? Want to verify what metadata is available in the trace event args.

Address review feedback from @ajassani:
- Effective vs true interconnect BW (includes straggler sync/wait)
- Label as interconnect BW, not HBM (need Effective Interconnect TB/s column)
- bytes() may be an upper bound (local-expert tokens don't cross interconnect)
@gphuang
Copy link
Copy Markdown
Contributor Author

gphuang commented Apr 2, 2026

Thanks for the review @ajassani — all three known limitations are now documented in the EPComm docstring (commit 8a5f105):

  1. Effective vs true interconnect BW — noted that current TB/s includes straggler sync/wait; true interconnect BW requires tree-perf / NCCL-analyzer integration (planned follow-up).
  2. Label as interconnect BW, not HBM — documented the need for an "Effective Interconnect TB/s" column to distinguish from HBM-based TB/s.
  3. bytes() upper bound — noted that tokens routed to local experts stay on-rank; need to check what routing metadata is available in real trace args.

Will DM you an example DeepEP trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf_model Add performance model for calculating TFLOPS/s and TB/s

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add coverage for DeepEP* expert-parallel ops (Communication / EP)

4 participants