fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask) by meinie0826 · Pull Request #43 · inclusionAI/cuLA

meinie0826 · 2026-04-08T09:49:46Z

📌 Description

Fix NaN outputs in the use_gate_in_kernel=True code path of the Blackwell KDA fused forward kernel (KDAChunkwise).

Root causes identified and fixed:

cache_key missing g_dtype (blackwell_fused_fwd.py): When use_gate_in_kernel=False, g is float32; when use_gate_in_kernel=True, kda_gate_chunk_cumsum may return a different dtype. Without g_dtype in the cache key, the first compiled kernel (float32) was reused for a different dtype, causing TMA to read SMEM with the wrong element size → NaN.

Missing boundary mask for non-varlen tail chunk in subchunk epilogue: valid_len_chunk was hardcoded to C for non-varlen sequences, so the last partial chunk (e.g. T=1500, tail=28 tokens) was not masked in apply_qk_kk_mask → invalid MMA accumulator values → NaN.

All 19 tests passed.

🔍 Related Issues

#16

🚀 Pull Request Checklist

Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing.

⚡ Performance

Reviewer Notes

gemini-code-assist · 2026-04-08T09:49:52Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Copilot

Copilot wasn't able to review any files in this pull request.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

meinie0826 · 2026-04-08T10:10:12Z

@icavanyu Could you take a look at this PR when you have time and give some feedback?

icavan · 2026-04-08T14:28:26Z

@icavanyu Could you take a look at this PR when you have time and give some feedback?

Hi, thank you for your contribution! As the filename suggests, cula/ops/kda_fully_fused_wip.py is still a work-in-progress, and there may be bugs, accuracy, or performance issues in it.

The KDA fully fused version requires TF32 sub-chunk MMA and TF32 inverse. Would you be interested in tackling these two issues?

meinie0826 · 2026-04-08T15:33:41Z

@icavanyu Could you take a look at this PR when you have time and give some feedback?

Hi, thank you for your contribution! As the filename suggests, cula/ops/kda_fully_fused_wip.py is still a work-in-progress, and there may be bugs, accuracy, or performance issues in it.

The KDA fully fused version requires TF32 sub-chunk MMA and TF32 inverse. Would you be interested in tackling these two issues?

Thanks for the heads-up! I'm definitely interested in tackling both the TF32 sub-chunk MMA and TF32 inverse — happy to follow up on those as next steps.

That said, this PR fixes 2 independent correctness bugs (cache_key missing g_dtype and tail-chunk boundary masking) that were causing NaN outputs across all use_gate_in_kernel=True test cases. All 19 tests now pass on B300, which is a meaningful step forward for the WIP kernel.

Would it be possible to merge this NaN-fix PR first, and I'll open a follow-up PR to address the TF32 sub-chunk MMA and TF32 inverse improvements?

- add g_dtype to cache_key to prevent reusing wrong compiled kernel when g dtype differs between use_gate_in_kernel=True/False paths - apply boundary mask for non-varlen tail chunk in subchunk epilogue (valid_len_chunk was hardcoded to C, missing padding mask for tail) - remove xfail markers from use_gate_in_kernel=True test cases

test/bench: add SM100 Blackwell fused forward tests and benchmark

b7e816a

Copilot AI review requested due to automatic review settings April 8, 2026 09:49

Copilot AI reviewed Apr 8, 2026

View reviewed changes

meinie0826 force-pushed the main branch from 8047f73 to 1805ca6 Compare April 8, 2026 10:00

meinie0826 changed the title ~~fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, copy atom, pipeline race, tail mask)~~ fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask) Apr 8, 2026

meinie0826 force-pushed the main branch from 1805ca6 to f03dceb Compare April 8, 2026 17:51

fix(bench): raise RuntimeError instead of warning for non-Blackwell GPU

4f547e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask)#43

fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask)#43
meinie0826 wants to merge 3 commits intoinclusionAI:mainfrom
meinie0826:main

meinie0826 commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

meinie0826 commented Apr 8, 2026

Uh oh!

icavan commented Apr 8, 2026

Uh oh!

meinie0826 commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

meinie0826 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

⚡ Performance

Reviewer Notes

Uh oh!

gemini-code-assist bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

meinie0826 commented Apr 8, 2026

Uh oh!

icavan commented Apr 8, 2026

Uh oh!

meinie0826 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

meinie0826 commented Apr 8, 2026 •

edited

Loading

meinie0826 commented Apr 8, 2026 •

edited

Loading