kernel-skills is an open source library of high quality skill files for AI coding agents working on compute kernels.
kernel-skills is a curated collection of SKILL.md files. Each file is a structured engineering playbook that an AI coding agent can follow when writing, optimizing, debugging, or porting compute kernels.
This repository is not a runtime, framework, package manager, benchmark harness, or MCP server. It is a library of Markdown files. No installation required. No tooling required. Pick a skill, read it or paste it into your agent workflow, and get better kernel output.
AI coding agents produce substantially worse kernel code when given vague prompts. They skip constraint gathering, choose incorrect tile strategies, ignore boundary conditions, make unsupported performance claims, and produce code that looks plausible but fails on real hardware.
Structured skill files change this. A well-authored skill forces the agent to:
- gather the right constraints before writing a single line of code
- choose the correct algorithm and memory strategy for the workload
- reason explicitly about correctness risks and edge cases
- avoid cargo-cult optimization and fake performance claims
- explain tradeoffs with technical precision
- know when a custom kernel is not the right answer
This repository exists to provide those skill files at expert quality, openly, for any agent and any workflow.
Same model. Same prompt. One difference: a kernel skill file. The naive softmax kernel fails on overflow and large shapes. The skill-guided version stays correct and bandwidth-competitive.
| Shape N | Naive · normal | Stable · normal | Naive · adversarial | Stable · adversarial |
|---|---|---|---|---|
| 64 | ✅ | ✅ | ❌ | ✅ |
| 128 | ✅ | ✅ | ❌ | ✅ |
| 256 | ✅ | ✅ | ❌ | ✅ |
| 257 | ❌ | ✅ | ❌ | ✅ |
| 512 | ❌ | ✅ | ❌ | ✅ |
| 1024 | ❌ | ✅ | ❌ | ✅ |
| 2048 | ❌ | ✅ | ❌ | ✅ |
| 4096 | ❌ | ✅ | ❌ | ✅ |
Naive adversarial: 8/8 shapes fail — NaN/Inf output, no max subtraction.
Naive normal for N > 256: 5/5 shapes fail — silent wrong output, no strided loop.
Stable after skill: 0/16 failures. Bandwidth within 1.2% of torch.softmax.
→ Full proof page with root-cause analysis and all charts
This repository is for engineers who use AI coding agents to work on:
- CUDA kernel development
- Triton kernel development
- Quantized kernels (int8, fp8)
- High performance numerics and AI workloads
- Kernel optimization, debugging, and porting
It is also useful for engineers who want a technical reference for how to approach these problems systematically, independent of any agent.
kernel-skills/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── ROADMAP.md
├── CLAUDE.md
├── .gitignore
├── skills/
│ ├── cuda/
│ │ ├── write-cuda-gemm-kernel/
│ │ ├── write-cuda-reduction-kernel/
│ │ ├── write-cuda-softmax-kernel/
│ │ ├── write-cuda-layernorm-kernel/
│ │ ├── optimize-global-memory-access/
│ │ ├── optimize-shared-memory-tiling/
│ │ ├── avoid-warp-divergence/
│ │ ├── choose-launch-configuration/
│ │ └── debug-cuda-kernel-correctness/
│ ├── triton/
│ │ ├── write-triton-gemm-kernel/
│ │ ├── write-triton-softmax-kernel/
│ │ ├── write-triton-layernorm-kernel/
│ │ ├── write-triton-attention-kernel/
│ │ └── optimize-triton-block-parameters/
│ ├── patterns/
│ │ ├── fuse-elementwise-ops/
│ │ ├── write-numerically-stable-kernel/
│ │ ├── handle-boundary-conditions/
│ │ ├── choose-tile-size-and-work-partitioning/
│ │ └── write-kernel-test-plan/
│ ├── quantization/
│ │ ├── write-int8-quantized-kernel/
│ │ ├── write-fp8-kernel/
│ │ └── debug-quantized-kernel-accuracy/
│ └── portability/
│ ├── port-cuda-kernel-to-triton/
│ ├── port-cuda-kernel-to-hip/
│ └── write-backend-agnostic-kernel-plan/
├── proof/
│ ├── README.md
│ ├── cuda/
│ │ └── softmax/
│ │ ├── softmax-correctness.md
│ │ ├── hero-proof.png
│ │ ├── error-cliff.png
│ │ └── code-diff.png
│ ├── triton/
│ ├── patterns/
│ ├── quantization/
│ └── portability/
└── examples/
├── how-to-use-with-claude-code.md
├── how-to-use-with-chatgpt.md
├── how-to-use-with-cursor.md
└── how-to-use-with-gemini-cli.md
More skills are being added. See ROADMAP.md for what is coming next.
| Skill | Description |
|---|---|
write-cuda-gemm-kernel |
Design and implement a tiled CUDA GEMM kernel — shared memory strategy, tensor core eligibility, accumulation precision, and when to use cuBLAS/CUTLASS instead |
write-cuda-reduction-kernel |
Write a correct parallel reduction with warp shuffle tree, multi-block strategy, and correct handling of partial tiles |
write-cuda-softmax-kernel |
Implement online or two-pass softmax with numerically stable max subtraction and correct warp-level reduction |
write-cuda-layernorm-kernel |
Implement layer normalization with Welford online variance, fused mean/variance computation, and fp32 accumulation in fp16 kernels |
optimize-global-memory-access |
Analyze and fix coalescing, alignment, and vectorized load/store patterns using Nsight Compute metrics |
optimize-shared-memory-tiling |
Apply shared memory tiling with bank conflict analysis, padding strategies, and double buffering |
avoid-warp-divergence |
Classify avoidable vs unavoidable divergence, apply ballot/shuffle fast paths and stream compaction, estimate the real cost before restructuring |
choose-launch-configuration |
Select block size, grid size, and shared memory from occupancy analysis, register budget, and workload shape |
debug-cuda-kernel-correctness |
Systematic workflow for isolating indexing bugs, race conditions, reduction errors, dtype issues, and out-of-bounds accesses in CUDA kernels |
| Skill | Description |
|---|---|
write-triton-gemm-kernel |
Write a Triton GEMM kernel with correct block tiling, tl.dot accumulation, row/col-major loading, and when CUTLASS is preferable |
write-triton-softmax-kernel |
Implement numerically stable softmax in Triton with block size selection for the reduction axis and masking for variable sequence lengths |
write-triton-layernorm-kernel |
Implement LayerNorm in Triton with Welford online variance, persistent kernel pattern, and backward pass accumulation strategy |
write-triton-attention-kernel |
Implement Flash Attention in Triton — causal mask handling, kv-block loop structure, online softmax scaling, and fp16/bf16 accumulation decisions |
optimize-triton-block-parameters |
Select BLOCK_M/N/K, num_warps, and num_stages; reason about register pressure, occupancy, and autotuning config design |
| Skill | Description |
|---|---|
fuse-elementwise-ops |
Decide when and how to fuse elementwise operations — memory bandwidth arithmetic, producer-consumer fusion, and epilogue fusion patterns |
write-numerically-stable-kernel |
Apply Kahan summation, log-sum-exp trick, compensated accumulation, and dtype selection for stable intermediate values |
handle-boundary-conditions |
Handle partial tiles, misaligned sizes, and out-of-bounds accesses correctly — masked loads, predicated stores, and tail handling strategies |
choose-tile-size-and-work-partitioning |
Reason about arithmetic intensity, shared memory budget, occupancy tradeoffs, and work partitioning for irregular shapes |
write-kernel-test-plan |
Design a correctness and numerical test plan — reference comparison strategy, input shape sweep, dtype coverage, tolerance reasoning, and CI integration |
| Skill | Description |
|---|---|
write-int8-quantized-kernel |
Implement INT8 quantized matrix operations — dp4a instruction, symmetric vs asymmetric quantization, INT32 accumulation, per-channel scale epilogue, cuBLAS vs CUTLASS vs custom decision |
write-fp8-kernel |
Design FP8 compute kernels for Hopper/Ada — E4M3/E5M2 format selection, satfinite conversion, delayed scaling, WGMMA on H100, and hipBLASLt on MI300X |
debug-quantized-kernel-accuracy |
Diagnose accuracy regressions in quantized kernels — scale validation, overflow detection, per-element error attribution, and calibration diagnostics |
| Skill | Description |
|---|---|
port-cuda-kernel-to-triton |
Systematically translate a CUDA kernel to Triton — execution model mapping, warp primitives to tl.reduce, shared memory to block-scoped accumulators |
port-cuda-kernel-to-hip |
Port CUDA to HIP/ROCm — wavefront width differences, 64-bit ballot masks, WMMA to rocWMMA, hipify audit checklist for MI250/MI300X targets |
write-backend-agnostic-kernel-plan |
Plan a kernel that must run on NVIDIA and AMD — abstraction strategy, portability risk register, per-backend tile sizing, and CI matrix |
- Find the skill that matches your task in
skills/. - Open the
SKILL.mdfile and paste its full contents into your agent's context. - Ask the agent to perform the task.
The skill does not replace your prompt — it forces the agent to reason correctly before writing a single line of code.
<paste contents of skills/cuda/write-cuda-reduction-kernel/SKILL.md>
Write a warp-shuffle reduction kernel for float32 inputs on an H100.
Input shape: [B=32, N=65536]. Output: [B] row-wise sums.
The skill works the same way with ChatGPT, Cursor, Gemini CLI, and any other agent that accepts context.
| Agent | Guide |
|---|---|
| Claude Code | examples/how-to-use-with-claude-code.md |
| ChatGPT | examples/how-to-use-with-chatgpt.md |
| Cursor | examples/how-to-use-with-cursor.md |
| Gemini CLI | examples/how-to-use-with-gemini-cli.md |
Contributions are welcome. Before opening a pull request, read CONTRIBUTING.md.
The short version: open an issue first to propose the skill scope, follow the required 11-section SKILL.md template, meet the quality bar, and keep naming conventions consistent.
Low-quality, vague, or out-of-scope skill files will not be merged regardless of technical domain.
More skills are being added across CUDA, Triton, quantization, and portability. Following the quality-first principle: each skill ships only when it is genuinely better than a generic prompt.
See ROADMAP.md for the full plan.
MIT. See LICENSE.

