Conversation
Greptile SummaryThis PR expands the multi-GPU reading documentation by splitting the existing API example into a "Raw API" and "DTensor API" subsection, adding a new "DTensor Representation Limitations" section (covering non-outermost sharding, non-uniform sharding, and ring-based decomposition), and introducing two new illustrative figures ( Key changes:
Confidence Score: 4/5Safe to merge after fixing the broken markdown link; all other changes are documentation improvements. One P1 syntax defect (double opening parenthesis in the anchor link) would produce a broken hyperlink in the rendered documentation. Everything else is well-written content and image additions. doc/reading/multigpu.md — fix the malformed link at line 113. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User provides DTensors / raw sharding annotations] --> B[nvFuser derives multi-GPU schedule]
B --> C{DTensor can represent sharding?}
C -- Yes --> D[Use DTensor annotation directly]
C -- No --> E[Fall back to raw nvFuser schedule\ne.g. non-outermost / non-uniform sharding]
D --> F[Sharding Propagation]
E --> F
F --> G[Communication-Computation Decomposition]
G --> H[Segmentation + Intra-GPU Scheduling]
H --> I[Device Lowering + Host IR Lowering]
I --> J[CUDA Kernels + NCCL Collectives]
Reviews (6): Last reviewed commit: "Minor" | Re-trigger Greptile |
doc/reading/multigpu.md
Outdated
| #### Non-uniform sharding | ||
|
|
||
| #### Computation |
There was a problem hiding this comment.
Empty placeholder sections published
#### Non-uniform sharding (line 163) and #### Computation (line 165) are section headers with no body text. Readers who reach these headings will encounter them as dead ends with no explanation. If content is planned for a follow-up PR, consider adding a brief sentence (e.g., "TODO: to be documented") or deferring the header entirely until the content is ready, to avoid confusion.
| # annotate intermediate and output tensors if/when they needs more control. | ||
| # It's not necessary for this particular example. |
There was a problem hiding this comment.
| inp = torch.randn(b * s, h, device="cuda") | ||
| up_w = torch.randn(h * 4 // d, h, device="cuda") | ||
| down_w = torch.randn(h, h * 4 // d, device="cuda") | ||
| (out,) = fd.execute([inp, up_w, down_w]) | ||
| # `out` is a torch.Tensor of shape [b * s, h]. |
There was a problem hiding this comment.
Variables
b and s used but never defined
h and d are defined at the top of the snippet (h = 12288, d = dist.get_world_size()), but b (batch size) and s (sequence length) appear on line 67 without being introduced. For consistency, consider adding their definitions alongside h and d, e.g.:
h = 12288
b, s = 4, 2048
d = dist.get_world_size()
Summary
Testing