Conversation
Greptile SummaryThis PR fixes two distinct bugs in the Key changes:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[mayUseTmaOuter] -->|n_inputs, dtype| B{Min tiles fit in smem?}
B -->|No| REJECT[Reject: fall back to non-TMA]
B -->|Yes| C{Welford op?}
C -->|Yes| REJECT
C -->|No| ACCEPT[Accept]
ACCEPT --> D[getReductionHeuristics]
D --> E[Compute smem_per_input = smem_bytes / n_inputs]
E --> F{tma_tile_r * tma_tile_i * dtype > budget?}
F -->|Yes, shrink r| F
F -->|Yes, shrink i| F
F -->|No| G[Compute iter_unroll_factor = tma_tile_i / bdimx]
G --> H[scheduleReduction]
H --> H1[cacheInputs → TMA TVs for each input]
H1 --> H2[canonicalizeReduction → R,I form on reduction_tv]
H2 --> H3[Propagate R,I to ALL TVs incl. all TMA TVs]
H3 --> H4[Apply TMA tiling splits to tma_tvs 0]
H4 --> H5[Propagate tiling from tma_tvs 0 → all TVs]
H5 --> H6[Parallelize all TMA TVs with parallelizeAllLike]
H6 --> H7[Sub-split redu_tv into thread dims]
H7 --> H8{iter_unroll_factor > 1?}
H8 -->|Yes| H9[axis 6 = Vectorize → Group for iter-grouped reduction]
H8 -->|No| H10[axis 6 = Serial → regular grid reduction]
H9 --> H11[rFactorHelper for grid reduction]
H10 --> H11
H11 --> H12[Propagate to non-TMA TVs]
Reviews (2): Last reviewed commit: "Disable vectorization when its factor is..." | Re-trigger Greptile |
Fix a canonicalization issue with
reduction_outer_tma, and support fusions with multiple inputs. UpdateTmaOuterReductionTestto cover these new cases. Also disabled Welford op for inner+outer TMA since supporting it is complicated.