Skip to content

KrxGu/kernel-skills

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

kernel-skills

kernel-skills is an open source library of high quality skill files for AI coding agents working on compute kernels.


What it is

kernel-skills is a curated collection of SKILL.md files. Each file is a structured engineering playbook that an AI coding agent can follow when writing, optimizing, debugging, or porting compute kernels.

This repository is not a runtime, framework, package manager, benchmark harness, or MCP server. It is a library of Markdown files. No installation required. No tooling required. Pick a skill, read it or paste it into your agent workflow, and get better kernel output.


Why it exists

AI coding agents produce substantially worse kernel code when given vague prompts. They skip constraint gathering, choose incorrect tile strategies, ignore boundary conditions, make unsupported performance claims, and produce code that looks plausible but fails on real hardware.

Structured skill files change this. A well-authored skill forces the agent to:

  • gather the right constraints before writing a single line of code
  • choose the correct algorithm and memory strategy for the workload
  • reason explicitly about correctness risks and edge cases
  • avoid cargo-cult optimization and fake performance claims
  • explain tradeoffs with technical precision
  • know when a custom kernel is not the right answer

This repository exists to provide those skill files at expert quality, openly, for any agent and any workflow.


Measured impact

Same model. Same prompt. One difference: a kernel skill file. The naive softmax kernel fails on overflow and large shapes. The skill-guided version stays correct and bandwidth-competitive.

Proof of impact — pass/fail heatmap, stat cards, bandwidth chart

Correctness: pass / fail matrix

Shape N Naive · normal Stable · normal Naive · adversarial Stable · adversarial
64
128
256
257
512
1024
2048
4096

Naive adversarial: 8/8 shapes fail — NaN/Inf output, no max subtraction.
Naive normal for N > 256: 5/5 shapes fail — silent wrong output, no strided loop.
Stable after skill: 0/16 failures. Bandwidth within 1.2% of torch.softmax.

The two changes the skill directed

Code diff — before and after skill guidance

Full proof page with root-cause analysis and all charts


Who it is for

This repository is for engineers who use AI coding agents to work on:

  • CUDA kernel development
  • Triton kernel development
  • Quantized kernels (int8, fp8)
  • High performance numerics and AI workloads
  • Kernel optimization, debugging, and porting

It is also useful for engineers who want a technical reference for how to approach these problems systematically, independent of any agent.


Repository structure

kernel-skills/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── ROADMAP.md
├── CLAUDE.md
├── .gitignore
├── skills/
│   ├── cuda/
│   │   ├── write-cuda-gemm-kernel/
│   │   ├── write-cuda-reduction-kernel/
│   │   ├── write-cuda-softmax-kernel/
│   │   ├── write-cuda-layernorm-kernel/
│   │   ├── optimize-global-memory-access/
│   │   ├── optimize-shared-memory-tiling/
│   │   ├── avoid-warp-divergence/
│   │   ├── choose-launch-configuration/
│   │   └── debug-cuda-kernel-correctness/
│   ├── triton/
│   │   ├── write-triton-gemm-kernel/
│   │   ├── write-triton-softmax-kernel/
│   │   ├── write-triton-layernorm-kernel/
│   │   ├── write-triton-attention-kernel/
│   │   └── optimize-triton-block-parameters/
│   ├── patterns/
│   │   ├── fuse-elementwise-ops/
│   │   ├── write-numerically-stable-kernel/
│   │   ├── handle-boundary-conditions/
│   │   ├── choose-tile-size-and-work-partitioning/
│   │   └── write-kernel-test-plan/
│   ├── quantization/
│   │   ├── write-int8-quantized-kernel/
│   │   ├── write-fp8-kernel/
│   │   └── debug-quantized-kernel-accuracy/
│   └── portability/
│       ├── port-cuda-kernel-to-triton/
│       ├── port-cuda-kernel-to-hip/
│       └── write-backend-agnostic-kernel-plan/
├── proof/
│   ├── README.md
│   ├── cuda/
│   │   └── softmax/
│   │       ├── softmax-correctness.md
│   │       ├── hero-proof.png
│   │       ├── error-cliff.png
│   │       └── code-diff.png
│   ├── triton/
│   ├── patterns/
│   ├── quantization/
│   └── portability/
└── examples/
    ├── how-to-use-with-claude-code.md
    ├── how-to-use-with-chatgpt.md
    ├── how-to-use-with-cursor.md
    └── how-to-use-with-gemini-cli.md

More skills are being added. See ROADMAP.md for what is coming next.


Skills

CUDA

Skill Description
write-cuda-gemm-kernel Design and implement a tiled CUDA GEMM kernel — shared memory strategy, tensor core eligibility, accumulation precision, and when to use cuBLAS/CUTLASS instead
write-cuda-reduction-kernel Write a correct parallel reduction with warp shuffle tree, multi-block strategy, and correct handling of partial tiles
write-cuda-softmax-kernel Implement online or two-pass softmax with numerically stable max subtraction and correct warp-level reduction
write-cuda-layernorm-kernel Implement layer normalization with Welford online variance, fused mean/variance computation, and fp32 accumulation in fp16 kernels
optimize-global-memory-access Analyze and fix coalescing, alignment, and vectorized load/store patterns using Nsight Compute metrics
optimize-shared-memory-tiling Apply shared memory tiling with bank conflict analysis, padding strategies, and double buffering
avoid-warp-divergence Classify avoidable vs unavoidable divergence, apply ballot/shuffle fast paths and stream compaction, estimate the real cost before restructuring
choose-launch-configuration Select block size, grid size, and shared memory from occupancy analysis, register budget, and workload shape
debug-cuda-kernel-correctness Systematic workflow for isolating indexing bugs, race conditions, reduction errors, dtype issues, and out-of-bounds accesses in CUDA kernels

Triton

Skill Description
write-triton-gemm-kernel Write a Triton GEMM kernel with correct block tiling, tl.dot accumulation, row/col-major loading, and when CUTLASS is preferable
write-triton-softmax-kernel Implement numerically stable softmax in Triton with block size selection for the reduction axis and masking for variable sequence lengths
write-triton-layernorm-kernel Implement LayerNorm in Triton with Welford online variance, persistent kernel pattern, and backward pass accumulation strategy
write-triton-attention-kernel Implement Flash Attention in Triton — causal mask handling, kv-block loop structure, online softmax scaling, and fp16/bf16 accumulation decisions
optimize-triton-block-parameters Select BLOCK_M/N/K, num_warps, and num_stages; reason about register pressure, occupancy, and autotuning config design

Patterns

Skill Description
fuse-elementwise-ops Decide when and how to fuse elementwise operations — memory bandwidth arithmetic, producer-consumer fusion, and epilogue fusion patterns
write-numerically-stable-kernel Apply Kahan summation, log-sum-exp trick, compensated accumulation, and dtype selection for stable intermediate values
handle-boundary-conditions Handle partial tiles, misaligned sizes, and out-of-bounds accesses correctly — masked loads, predicated stores, and tail handling strategies
choose-tile-size-and-work-partitioning Reason about arithmetic intensity, shared memory budget, occupancy tradeoffs, and work partitioning for irregular shapes
write-kernel-test-plan Design a correctness and numerical test plan — reference comparison strategy, input shape sweep, dtype coverage, tolerance reasoning, and CI integration

Quantization

Skill Description
write-int8-quantized-kernel Implement INT8 quantized matrix operations — dp4a instruction, symmetric vs asymmetric quantization, INT32 accumulation, per-channel scale epilogue, cuBLAS vs CUTLASS vs custom decision
write-fp8-kernel Design FP8 compute kernels for Hopper/Ada — E4M3/E5M2 format selection, satfinite conversion, delayed scaling, WGMMA on H100, and hipBLASLt on MI300X
debug-quantized-kernel-accuracy Diagnose accuracy regressions in quantized kernels — scale validation, overflow detection, per-element error attribution, and calibration diagnostics

Portability

Skill Description
port-cuda-kernel-to-triton Systematically translate a CUDA kernel to Triton — execution model mapping, warp primitives to tl.reduce, shared memory to block-scoped accumulators
port-cuda-kernel-to-hip Port CUDA to HIP/ROCm — wavefront width differences, 64-bit ballot masks, WMMA to rocWMMA, hipify audit checklist for MI250/MI300X targets
write-backend-agnostic-kernel-plan Plan a kernel that must run on NVIDIA and AMD — abstraction strategy, portability risk register, per-backend tile sizing, and CI matrix

How to use a skill

  1. Find the skill that matches your task in skills/.
  2. Open the SKILL.md file and paste its full contents into your agent's context.
  3. Ask the agent to perform the task.

The skill does not replace your prompt — it forces the agent to reason correctly before writing a single line of code.

Example (Claude Code)

<paste contents of skills/cuda/write-cuda-reduction-kernel/SKILL.md>

Write a warp-shuffle reduction kernel for float32 inputs on an H100.
Input shape: [B=32, N=65536]. Output: [B] row-wise sums.

The skill works the same way with ChatGPT, Cursor, Gemini CLI, and any other agent that accepts context.

Agent Guide
Claude Code examples/how-to-use-with-claude-code.md
ChatGPT examples/how-to-use-with-chatgpt.md
Cursor examples/how-to-use-with-cursor.md
Gemini CLI examples/how-to-use-with-gemini-cli.md

Contributing

Contributions are welcome. Before opening a pull request, read CONTRIBUTING.md.

The short version: open an issue first to propose the skill scope, follow the required 11-section SKILL.md template, meet the quality bar, and keep naming conventions consistent.

Low-quality, vague, or out-of-scope skill files will not be merged regardless of technical domain.


Roadmap

More skills are being added across CUDA, Triton, quantization, and portability. Following the quality-first principle: each skill ships only when it is genuinely better than a generic prompt.

See ROADMAP.md for the full plan.


License

MIT. See LICENSE.

About

Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors