The following acronyms are used in the Composable Kernel codebase:
| Acronym | Expansion | Explanation |
|---|---|---|
| BF16 | Brain Floating Point 16 | 1 Signed bit, 8 Exponent bits, 7 Significand bits |
| BF8 | 8-bit Brain Floating Point | 1 Signed bit, 3 Exponent bits, 4 Significand bits |
| DLA | Deep Learning Accelerator | Specialized hardware for deep learning workloads |
| DRAM | Dynamic Random-Access Memory | Main memory. Global memory on GPU |
| E2E | End-to-End | Complete pipeline or process from input to output |
| ELU | Exponential Linear Unit | Activation function: |
| FMHA | Fused Multi-Head Attention | Efficient transformer attention kernel, fusing softmax, masking, and matmul |
| FP16 | Half-Precision Floating Point | 16-bit IEEE floating point format |
| FP32 | Single-Precision Floating Point | 32-bit IEEE floating point format |
| FP64 | Double-Precision Floating Point | 64-bit IEEE floating point format |
| FP8 | 8-bit Floating Point | Experimental 8-bit floating point format for inference |
| GEMM | General Matrix Multiply | Matrix multiplication operation: |
| GELU | Gaussian Error Linear Unit | Activation function: |
| GQA | Grouped Query Attention | Variant of multi-head attention with grouped queries/keys/values |
| HBM | High Bandwidth Memory | Fast memory used in modern GPUs |
| HIP | Heterogeneous-Compute Interface for Portability | AMD's CUDA-like GPU programming API |
| INT8 | 8-bit Integer | Quantized integer format for inference |
| KVS | Key-Value Store | Data structure for storing key-value pairs (context: QKV in transformers) |
| L2/L1 | Level 2/Level 1 Cache | On-chip memory hierarchy in CPUs/GPUs |
| LDS | Local Data Share | Shared memory on AMD GPUs (equivalent to CUDA's shared memory) |
| LLM | Large Language Model | Transformer-based model for NLP tasks |
| LSE | Log-Sum-Exp | Numerically stable softmax computation: |
| MHA | Multi-Head Attention | Attention mechanism with multiple heads in transformers |
| MFMA | Matrix Fused Multiply-Add | AMD GPU hardware instruction for matrix-matrix multiplication |
| MoE | Mixture of Experts | Neural network architecture with multiple expert subnetworks |
| MQA | Multi-Query Attention | Variant of multi-head attention with shared keys/values across heads |
| RCCL | ROCm Collective Communications Library | AMD Library for multi-GPU communication |
| NCHW | Batch, Channel, Height, Width | Tensor layout: batch-major, channels-first |
| NHWC | Batch, Height, Width, Channel | Tensor layout: batch-major, channels-last |
| OOM | Out Of Memory | Error when memory allocation fails |
| QAT | Quantization Aware Training | Training technique for quantized inference |
| QKV | Query, Key, Value | Components of transformer attention mechanism |
| RDMA | Remote Direct Memory Access | High-speed network memory access |
| RDQuant | Rowwise Dynamic Quantization | Quantization technique with per-row scaling for int8 inference |
| ReLU | Rectified Linear Unit | Activation function: |
| ROCm | Radeon Open Compute | AMD's open GPU computing stack |
| SGD | Stochastic Gradient Descent | Optimization algorithm for training neural networks |
| SM | Streaming Multiprocessor | GPU compute unit (NVIDIA terminology) |
| SWA | Sliding Window Attention | Attention mechanism with a limited window for each token |
| TLB | Translation Lookaside Buffer | Memory management unit cache for virtual-to-physical address translation |
| VGPR | Vector General Purpose Register | GPU register for vector operations |
| WARP | Group of Threads | Smallest scheduling unit on NVIDIA GPUs (32 threads) |
| WMMA | Warp Matrix Multiply-Accumulate | NVIDIA's matrix-multiply hardware primitive |
| XLA | Accelerated Linear Algebra | Compiler for optimizing ML computations (Google) |
| Symbol | Meaning | Context |
|---|---|---|
| M, N, K | Matrix dimensions | GEMM: |
| Q, K, V | Query, Key, Value | Transformer attention |
| S | Sequence length | NLP, transformers |
| D | Dimension | Hidden size, feature dim |
| B | Batch size | ML batch processing |
| H | Head count | Multi-head attention |
| C | Channel | CNNs, tensor layouts |
| T | Token | NLP, sequence models |
If you find an acronym not listed here, please submit a pull request or issue!