docs: expand multi-GPU DTensor limitations by wujingyue · Pull Request #6043 · NVIDIA/Fuser

wujingyue · 2026-03-30T22:06:11Z

Summary

expand the multi-GPU reading doc with a detailed section on overlapping communication with GEMM
reorganize and extend the DTensor limitations discussion, including non-outermost and non-uniform sharding
add the non-outermost sharding and overlap-iterations figures, and update the expert-parallelism figure

Testing

not run (docs only)

greptile-apps · 2026-03-30T22:08:31Z

Greptile Summary

This PR expands the multi-GPU reading documentation by splitting the existing API example into a "Raw API" and "DTensor API" subsection, adding a new "DTensor Representation Limitations" section (covering non-outermost sharding, non-uniform sharding, and ring-based decomposition), and introducing two new illustrative figures (overlap_iterations.png, nonoutermost_sharding.png).

Key changes:

Raw API example is now a self-contained, runnable snippet inside a with FusionDefinition() block, including mesh annotation and sharding schedule.
DTensor API section clarifies that DTensors are used only as annotations and links to the new limitations section.
DTensor Representation Limitations documents three concrete gaps between DTensor semantics and nvFuser's SPMD model (non-outermost sharding, non-uniform/jagged sharding, ring-based decomposition), with the AlphaFold 3 backprop as a practical example for the first gap.
Broken markdown link at line 113: [cannot yet represent]((#dtensor-representation-limitations) has a double (( and a missing closing ), which will render as a broken anchor in most Markdown processors.

Confidence Score: 4/5

Safe to merge after fixing the broken markdown link; all other changes are documentation improvements.

One P1 syntax defect (double opening parenthesis in the anchor link) would produce a broken hyperlink in the rendered documentation. Everything else is well-written content and image additions.

doc/reading/multigpu.md — fix the malformed link at line 113.

Important Files Changed

Filename	Overview
doc/reading/multigpu.md	Documentation expanded with Raw API example, DTensor API clarification, overlap-iterations figure, and a new "DTensor Representation Limitations" section; contains a broken markdown link at line 113 (double opening parenthesis).
doc/reading/multigpu/nonoutermost_sharding.png	New figure illustrating the non-outermost sharding concept for the AlphaFold 3 backprop example; binary asset, no issues.
doc/reading/multigpu/overlap_iterations.png	New figure illustrating ring-based all-gather / linear decomposition across three GPUs; binary asset, no issues.
doc/reading/multigpu/expert_parallelism.png	Updated expert-parallelism figure; binary asset update, no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User provides DTensors / raw sharding annotations] --> B[nvFuser derives multi-GPU schedule]
    B --> C{DTensor can represent sharding?}
    C -- Yes --> D[Use DTensor annotation directly]
    C -- No --> E[Fall back to raw nvFuser schedule\ne.g. non-outermost / non-uniform sharding]
    D --> F[Sharding Propagation]
    E --> F
    F --> G[Communication-Computation Decomposition]
    G --> H[Segmentation + Intra-GPU Scheduling]
    H --> I[Device Lowering + Host IR Lowering]
    I --> J[CUDA Kernels + NCCL Collectives]

_{Reviews (6): Last reviewed commit: "Minor" | Re-trigger Greptile}

greptile-apps · 2026-03-30T22:08:35Z

doc/reading/multigpu.md

+#### Non-uniform sharding
+
+#### Computation


Empty placeholder sections published

#### Non-uniform sharding (line 163) and #### Computation (line 165) are section headers with no body text. Readers who reach these headings will encounter them as dead ends with no explanation. If content is planned for a follow-up PR, consider adding a brief sentence (e.g., "TODO: to be documented") or deferring the header entirely until the content is ready, to avoid confusion.

greptile-apps · 2026-03-30T22:08:36Z

doc/reading/multigpu.md

+    # annotate intermediate and output tensors if/when they needs more control.
+    # It's not necessary for this particular example.


Subject-verb agreement typo

they needs should be they need.

Suggested change

# annotate intermediate and output tensors if/when they needs more control.

# It's not necessary for this particular example.

# annotate intermediate and output tensors if/when they need more control.

greptile-apps · 2026-03-30T22:08:37Z

doc/reading/multigpu.md

+inp = torch.randn(b * s, h, device="cuda")
+up_w = torch.randn(h * 4 // d, h, device="cuda")
+down_w = torch.randn(h, h * 4 // d, device="cuda")
+(out,) = fd.execute([inp, up_w, down_w])
+# `out` is a torch.Tensor of shape [b * s, h].


Variables b and s used but never defined

h and d are defined at the top of the snippet (h = 12288, d = dist.get_world_size()), but b (batch size) and s (sequence length) appear on line 67 without being introduced. For consistency, consider adding their definitions alongside h and d, e.g.:

h = 12288 b, s = 4, 2048 d = dist.get_world_size()

wujingyue added 4 commits March 30, 2026 12:41

docs: add multi-GPU raw API and DTensor example

5380955

Minor

e08582a

WIP

8d5630b

Minor

5b507fe

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

Add link

e085530

wujingyue changed the base branch from main to wjy/doc March 30, 2026 22:09

wujingyue added 2 commits March 30, 2026 15:13

Minor

9bc5e57

Minor

25ad30f

Base automatically changed from wjy/doc to main March 30, 2026 22:15

wujingyue added 3 commits March 30, 2026 19:21

docs: refine multi-GPU DTensor limitations

8598949

docs: expand multi-GPU overlap and DTensor limitations

d8b9c50

Minor

2d9a26d

wujingyue requested review from Priya2698 and syed-ahmed March 31, 2026 04:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: expand multi-GPU DTensor limitations#6043

docs: expand multi-GPU DTensor limitations#6043
wujingyue wants to merge 10 commits intomainfrom
wjy/dtensor

wujingyue commented Mar 30, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 30, 2026

Uh oh!

greptile-apps bot Mar 30, 2026

Uh oh!

greptile-apps bot Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# annotate intermediate and output tensors if/when they needs more control.
		# It's not necessary for this particular example.

Conversation

wujingyue commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

greptile-apps bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wujingyue commented Mar 30, 2026 •

edited

Loading

greptile-apps bot commented Mar 30, 2026 •

edited

Loading