kernel-skills

kernel-skills is an open source library of high quality skill files for AI coding agents working on compute kernels.

What it is

kernel-skills is a curated collection of SKILL.md files. Each file is a structured engineering playbook that an AI coding agent can follow when writing, optimizing, debugging, or porting compute kernels.

This repository is not a runtime, framework, package manager, benchmark harness, or MCP server. It is a library of Markdown files. No installation required. No tooling required. Pick a skill, read it or paste it into your agent workflow, and get better kernel output.

Why it exists

AI coding agents produce substantially worse kernel code when given vague prompts. They skip constraint gathering, choose incorrect tile strategies, ignore boundary conditions, make unsupported performance claims, and produce code that looks plausible but fails on real hardware.

Structured skill files change this. A well-authored skill forces the agent to:

gather the right constraints before writing a single line of code
choose the correct algorithm and memory strategy for the workload
reason explicitly about correctness risks and edge cases
avoid cargo-cult optimization and fake performance claims
explain tradeoffs with technical precision
know when a custom kernel is not the right answer

This repository exists to provide those skill files at expert quality, openly, for any agent and any workflow.

Measured impact

Same model. Same prompt. One difference: a kernel skill file. The naive softmax kernel fails on overflow and large shapes. The skill-guided version stays correct and bandwidth-competitive.

Correctness: pass / fail matrix

Shape N	Naive · normal	Stable · normal	Naive · adversarial	Stable · adversarial
64	✅	✅	❌	✅
128	✅	✅	❌	✅
256	✅	✅	❌	✅
257	❌	✅	❌	✅
512	❌	✅	❌	✅
1024	❌	✅	❌	✅
2048	❌	✅	❌	✅
4096	❌	✅	❌	✅

Naive adversarial: 8/8 shapes fail — NaN/Inf output, no max subtraction.
Naive normal for N > 256: 5/5 shapes fail — silent wrong output, no strided loop.
Stable after skill: 0/16 failures. Bandwidth within 1.2% of torch.softmax.

The two changes the skill directed

→ Full proof page with root-cause analysis and all charts

Who it is for

This repository is for engineers who use AI coding agents to work on:

CUDA kernel development
Triton kernel development
Quantized kernels (int8, fp8)
High performance numerics and AI workloads
Kernel optimization, debugging, and porting

It is also useful for engineers who want a technical reference for how to approach these problems systematically, independent of any agent.

Repository structure

kernel-skills/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── ROADMAP.md
├── CLAUDE.md
├── .gitignore
├── skills/
│   ├── cuda/
│   │   ├── write-cuda-gemm-kernel/
│   │   ├── write-cuda-reduction-kernel/
│   │   ├── write-cuda-softmax-kernel/
│   │   ├── write-cuda-layernorm-kernel/
│   │   ├── optimize-global-memory-access/
│   │   ├── optimize-shared-memory-tiling/
│   │   ├── avoid-warp-divergence/
│   │   ├── choose-launch-configuration/
│   │   └── debug-cuda-kernel-correctness/
│   ├── triton/
│   │   ├── write-triton-gemm-kernel/
│   │   ├── write-triton-softmax-kernel/
│   │   ├── write-triton-layernorm-kernel/
│   │   ├── write-triton-attention-kernel/
│   │   └── optimize-triton-block-parameters/
│   ├── patterns/
│   │   ├── fuse-elementwise-ops/
│   │   ├── write-numerically-stable-kernel/
│   │   ├── handle-boundary-conditions/
│   │   ├── choose-tile-size-and-work-partitioning/
│   │   └── write-kernel-test-plan/
│   ├── quantization/
│   │   ├── write-int8-quantized-kernel/
│   │   ├── write-fp8-kernel/
│   │   └── debug-quantized-kernel-accuracy/
│   └── portability/
│       ├── port-cuda-kernel-to-triton/
│       ├── port-cuda-kernel-to-hip/
│       └── write-backend-agnostic-kernel-plan/
├── proof/
│   ├── README.md
│   ├── cuda/
│   │   └── softmax/
│   │       ├── softmax-correctness.md
│   │       ├── hero-proof.png
│   │       ├── error-cliff.png
│   │       └── code-diff.png
│   ├── triton/
│   ├── patterns/
│   ├── quantization/
│   └── portability/
└── examples/
    ├── how-to-use-with-claude-code.md
    ├── how-to-use-with-chatgpt.md
    ├── how-to-use-with-cursor.md
    └── how-to-use-with-gemini-cli.md

More skills are being added. See ROADMAP.md for what is coming next.

Skills

CUDA

Skill	Description
`write-cuda-gemm-kernel`	Design and implement a tiled CUDA GEMM kernel — shared memory strategy, tensor core eligibility, accumulation precision, and when to use cuBLAS/CUTLASS instead
`write-cuda-reduction-kernel`	Write a correct parallel reduction with warp shuffle tree, multi-block strategy, and correct handling of partial tiles
`write-cuda-softmax-kernel`	Implement online or two-pass softmax with numerically stable max subtraction and correct warp-level reduction
`write-cuda-layernorm-kernel`	Implement layer normalization with Welford online variance, fused mean/variance computation, and fp32 accumulation in fp16 kernels
`optimize-global-memory-access`	Analyze and fix coalescing, alignment, and vectorized load/store patterns using Nsight Compute metrics
`optimize-shared-memory-tiling`	Apply shared memory tiling with bank conflict analysis, padding strategies, and double buffering
`avoid-warp-divergence`	Classify avoidable vs unavoidable divergence, apply ballot/shuffle fast paths and stream compaction, estimate the real cost before restructuring
`choose-launch-configuration`	Select block size, grid size, and shared memory from occupancy analysis, register budget, and workload shape
`debug-cuda-kernel-correctness`	Systematic workflow for isolating indexing bugs, race conditions, reduction errors, dtype issues, and out-of-bounds accesses in CUDA kernels

Triton

Skill	Description
`write-triton-gemm-kernel`	Write a Triton GEMM kernel with correct block tiling, tl.dot accumulation, row/col-major loading, and when CUTLASS is preferable
`write-triton-softmax-kernel`	Implement numerically stable softmax in Triton with block size selection for the reduction axis and masking for variable sequence lengths
`write-triton-layernorm-kernel`	Implement LayerNorm in Triton with Welford online variance, persistent kernel pattern, and backward pass accumulation strategy
`write-triton-attention-kernel`	Implement Flash Attention in Triton — causal mask handling, kv-block loop structure, online softmax scaling, and fp16/bf16 accumulation decisions
`optimize-triton-block-parameters`	Select BLOCK_M/N/K, num_warps, and num_stages; reason about register pressure, occupancy, and autotuning config design

Patterns

Skill	Description
`fuse-elementwise-ops`	Decide when and how to fuse elementwise operations — memory bandwidth arithmetic, producer-consumer fusion, and epilogue fusion patterns
`write-numerically-stable-kernel`	Apply Kahan summation, log-sum-exp trick, compensated accumulation, and dtype selection for stable intermediate values
`handle-boundary-conditions`	Handle partial tiles, misaligned sizes, and out-of-bounds accesses correctly — masked loads, predicated stores, and tail handling strategies
`choose-tile-size-and-work-partitioning`	Reason about arithmetic intensity, shared memory budget, occupancy tradeoffs, and work partitioning for irregular shapes
`write-kernel-test-plan`	Design a correctness and numerical test plan — reference comparison strategy, input shape sweep, dtype coverage, tolerance reasoning, and CI integration

Quantization

Skill	Description
`write-int8-quantized-kernel`	Implement INT8 quantized matrix operations — dp4a instruction, symmetric vs asymmetric quantization, INT32 accumulation, per-channel scale epilogue, cuBLAS vs CUTLASS vs custom decision
`write-fp8-kernel`	Design FP8 compute kernels for Hopper/Ada — E4M3/E5M2 format selection, satfinite conversion, delayed scaling, WGMMA on H100, and hipBLASLt on MI300X
`debug-quantized-kernel-accuracy`	Diagnose accuracy regressions in quantized kernels — scale validation, overflow detection, per-element error attribution, and calibration diagnostics

Portability

Skill	Description
`port-cuda-kernel-to-triton`	Systematically translate a CUDA kernel to Triton — execution model mapping, warp primitives to tl.reduce, shared memory to block-scoped accumulators
`port-cuda-kernel-to-hip`	Port CUDA to HIP/ROCm — wavefront width differences, 64-bit ballot masks, WMMA to rocWMMA, hipify audit checklist for MI250/MI300X targets
`write-backend-agnostic-kernel-plan`	Plan a kernel that must run on NVIDIA and AMD — abstraction strategy, portability risk register, per-backend tile sizing, and CI matrix

How to use a skill

Find the skill that matches your task in skills/.
Open the SKILL.md file and paste its full contents into your agent's context.
Ask the agent to perform the task.

The skill does not replace your prompt — it forces the agent to reason correctly before writing a single line of code.

Example (Claude Code)

<paste contents of skills/cuda/write-cuda-reduction-kernel/SKILL.md>

Write a warp-shuffle reduction kernel for float32 inputs on an H100.
Input shape: [B=32, N=65536]. Output: [B] row-wise sums.

The skill works the same way with ChatGPT, Cursor, Gemini CLI, and any other agent that accepts context.

Agent	Guide
Claude Code	examples/how-to-use-with-claude-code.md
ChatGPT	examples/how-to-use-with-chatgpt.md
Cursor	examples/how-to-use-with-cursor.md
Gemini CLI	examples/how-to-use-with-gemini-cli.md

Contributing

Contributions are welcome. Before opening a pull request, read CONTRIBUTING.md.

The short version: open an issue first to propose the skill scope, follow the required 11-section SKILL.md template, meet the quality bar, and keep naming conventions consistent.

Low-quality, vague, or out-of-scope skill files will not be merged regardless of technical domain.

Roadmap

More skills are being added across CUDA, Triton, quantization, and portability. Following the quality-first principle: each skill ships only when it is genuinely better than a generic prompt.

See ROADMAP.md for the full plan.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kernel-skills

What it is

Why it exists

Measured impact

Correctness: pass / fail matrix

The two changes the skill directed

Who it is for

Repository structure

Skills

CUDA

Triton

Patterns

Quantization

Portability

How to use a skill

Example (Claude Code)

Contributing

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
examples		examples
proof		proof
skills		skills
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md

Folders and files

Latest commit

History

Repository files navigation

kernel-skills

What it is

Why it exists

Measured impact

Correctness: pass / fail matrix

The two changes the skill directed

Who it is for

Repository structure

Skills

CUDA

Triton

Patterns

Quantization

Portability

How to use a skill

Example (Claude Code)

Contributing

Roadmap

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Packages

Contributors