Skip to content

docs: expand multi-GPU DTensor limitations#6043

Open
wujingyue wants to merge 10 commits intomainfrom
wjy/dtensor
Open

docs: expand multi-GPU DTensor limitations#6043
wujingyue wants to merge 10 commits intomainfrom
wjy/dtensor

Conversation

@wujingyue
Copy link
Copy Markdown
Collaborator

@wujingyue wujingyue commented Mar 30, 2026

Summary

  • expand the multi-GPU reading doc with a detailed section on overlapping communication with GEMM
  • reorganize and extend the DTensor limitations discussion, including non-outermost and non-uniform sharding
  • add the non-outermost sharding and overlap-iterations figures, and update the expert-parallelism figure

Testing

  • not run (docs only)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR expands the multi-GPU reading documentation by splitting the existing API example into a "Raw API" and "DTensor API" subsection, adding a new "DTensor Representation Limitations" section (covering non-outermost sharding, non-uniform sharding, and ring-based decomposition), and introducing two new illustrative figures (overlap_iterations.png, nonoutermost_sharding.png).

Key changes:

  • Raw API example is now a self-contained, runnable snippet inside a with FusionDefinition() block, including mesh annotation and sharding schedule.
  • DTensor API section clarifies that DTensors are used only as annotations and links to the new limitations section.
  • DTensor Representation Limitations documents three concrete gaps between DTensor semantics and nvFuser's SPMD model (non-outermost sharding, non-uniform/jagged sharding, ring-based decomposition), with the AlphaFold 3 backprop as a practical example for the first gap.
  • Broken markdown link at line 113: [cannot yet represent]((#dtensor-representation-limitations) has a double (( and a missing closing ), which will render as a broken anchor in most Markdown processors.

Confidence Score: 4/5

Safe to merge after fixing the broken markdown link; all other changes are documentation improvements.

One P1 syntax defect (double opening parenthesis in the anchor link) would produce a broken hyperlink in the rendered documentation. Everything else is well-written content and image additions.

doc/reading/multigpu.md — fix the malformed link at line 113.

Important Files Changed

Filename Overview
doc/reading/multigpu.md Documentation expanded with Raw API example, DTensor API clarification, overlap-iterations figure, and a new "DTensor Representation Limitations" section; contains a broken markdown link at line 113 (double opening parenthesis).
doc/reading/multigpu/nonoutermost_sharding.png New figure illustrating the non-outermost sharding concept for the AlphaFold 3 backprop example; binary asset, no issues.
doc/reading/multigpu/overlap_iterations.png New figure illustrating ring-based all-gather / linear decomposition across three GPUs; binary asset, no issues.
doc/reading/multigpu/expert_parallelism.png Updated expert-parallelism figure; binary asset update, no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User provides DTensors / raw sharding annotations] --> B[nvFuser derives multi-GPU schedule]
    B --> C{DTensor can represent sharding?}
    C -- Yes --> D[Use DTensor annotation directly]
    C -- No --> E[Fall back to raw nvFuser schedule\ne.g. non-outermost / non-uniform sharding]
    D --> F[Sharding Propagation]
    E --> F
    F --> G[Communication-Computation Decomposition]
    G --> H[Segmentation + Intra-GPU Scheduling]
    H --> I[Device Lowering + Host IR Lowering]
    I --> J[CUDA Kernels + NCCL Collectives]
Loading

Reviews (6): Last reviewed commit: "Minor" | Re-trigger Greptile

Comment on lines +163 to +165
#### Non-uniform sharding

#### Computation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty placeholder sections published

#### Non-uniform sharding (line 163) and #### Computation (line 165) are section headers with no body text. Readers who reach these headings will encounter them as dead ends with no explanation. If content is planned for a follow-up PR, consider adding a brief sentence (e.g., "TODO: to be documented") or deferring the header entirely until the content is ready, to avoid confusion.

Comment on lines +53 to +54
# annotate intermediate and output tensors if/when they needs more control.
# It's not necessary for this particular example.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Subject-verb agreement typo

they needs should be they need.

Suggested change
# annotate intermediate and output tensors if/when they needs more control.
# It's not necessary for this particular example.
# annotate intermediate and output tensors if/when they need more control.

Comment on lines +67 to +71
inp = torch.randn(b * s, h, device="cuda")
up_w = torch.randn(h * 4 // d, h, device="cuda")
down_w = torch.randn(h, h * 4 // d, device="cuda")
(out,) = fd.execute([inp, up_w, down_w])
# `out` is a torch.Tensor of shape [b * s, h].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Variables b and s used but never defined

h and d are defined at the top of the snippet (h = 12288, d = dist.get_world_size()), but b (batch size) and s (sequence length) appear on line 67 without being introduced. For consistency, consider adding their definitions alongside h and d, e.g.:

h = 12288
b, s = 4, 2048
d = dist.get_world_size()

@wujingyue wujingyue changed the base branch from main to wjy/doc March 30, 2026 22:09
Base automatically changed from wjy/doc to main March 30, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant