Skip to content

fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask)#43

Open
meinie0826 wants to merge 3 commits intoinclusionAI:mainfrom
meinie0826:main
Open

fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask)#43
meinie0826 wants to merge 3 commits intoinclusionAI:mainfrom
meinie0826:main

Conversation

@meinie0826
Copy link
Copy Markdown

@meinie0826 meinie0826 commented Apr 8, 2026

📌 Description

Fix NaN outputs in the use_gate_in_kernel=True code path of the Blackwell KDA fused forward kernel (KDAChunkwise).

Root causes identified and fixed:

cache_key missing g_dtype (blackwell_fused_fwd.py): When use_gate_in_kernel=False, g is float32; when use_gate_in_kernel=True, kda_gate_chunk_cumsum may return a different dtype. Without g_dtype in the cache key, the first compiled kernel (float32) was reused for a different dtype, causing TMA to read SMEM with the wrong element size → NaN.

Missing boundary mask for non-varlen tail chunk in subchunk epilogue: valid_len_chunk was hardcoded to C for non-varlen sequences, so the last partial chunk (e.g. T=1500, tail=28 tokens) was not masked in apply_qk_kk_mask → invalid MMA accumulator values → NaN.

All 19 tests passed.

🔍 Related Issues

#16

🚀 Pull Request Checklist

Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing.

⚡ Performance

Reviewer Notes

Copilot AI review requested due to automatic review settings April 8, 2026 09:49
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@meinie0826
Copy link
Copy Markdown
Author

@icavanyu Could you take a look at this PR when you have time and give some feedback?

@icavan
Copy link
Copy Markdown
Collaborator

icavan commented Apr 8, 2026

@icavanyu Could you take a look at this PR when you have time and give some feedback?

Hi, thank you for your contribution! As the filename suggests, cula/ops/kda_fully_fused_wip.py is still a work-in-progress, and there may be bugs, accuracy, or performance issues in it.

The KDA fully fused version requires TF32 sub-chunk MMA and TF32 inverse. Would you be interested in tackling these two issues?

@meinie0826
Copy link
Copy Markdown
Author

meinie0826 commented Apr 8, 2026

@icavanyu Could you take a look at this PR when you have time and give some feedback?

Hi, thank you for your contribution! As the filename suggests, cula/ops/kda_fully_fused_wip.py is still a work-in-progress, and there may be bugs, accuracy, or performance issues in it.

The KDA fully fused version requires TF32 sub-chunk MMA and TF32 inverse. Would you be interested in tackling these two issues?

Thanks for the heads-up! I'm definitely interested in tackling both the TF32 sub-chunk MMA and TF32 inverse — happy to follow up on those as next steps.

That said, this PR fixes 2 independent correctness bugs (cache_key missing g_dtype and tail-chunk boundary masking) that were causing NaN outputs across all use_gate_in_kernel=True test cases. All 19 tests now pass on B300, which is a meaningful step forward for the WIP kernel.

Would it be possible to merge this NaN-fix PR first, and I'll open a follow-up PR to address the TF32 sub-chunk MMA and TF32 inverse improvements?

@meinie0826 meinie0826 changed the title fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, copy atom, pipeline race, tail mask) fix(kernel): fix NaN in use_gate_in_kernel=True path (cache_key, tail mask) Apr 8, 2026
- add g_dtype to cache_key to prevent reusing wrong compiled kernel
  when g dtype differs between use_gate_in_kernel=True/False paths
- apply boundary mask for non-varlen tail chunk in subchunk epilogue
  (valid_len_chunk was hardcoded to C, missing padding mask for tail)
- remove xfail markers from use_gate_in_kernel=True test cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants