High-performance CUDA/C++ implementation of matrix-vector and matrix-matrix operations with a focus on memory access patterns, shared-memory optimization, and GPU vs CPU benchmarking.
| Component | Description |
|---|---|
src/matvec_kernels.cu |
Three MatVec kernels: naive, shared-memory, warp-coalesced |
src/matmul_kernels.cu |
Three MatMul kernels: naive, tiled (TILE=16), variable-tile sweep |
src/cpu_ops.cpp |
Scalar CPU reference (ikj loop order) |
src/main.cu |
Benchmark driver with GFLOPS, speedup, and correctness checks |
include/matrix_ops.cuh |
Kernel declarations + GPU timer + CUDA error macro |
include/cpu_ops.h |
CPU operation declarations |
- Naive global-memory access (baseline)
- Coalesced access with 2-D warp layout for contiguous reads
- Shared-memory tiling to reduce redundant global reads
- 1-D blocks (256 threads) for MatVec
- 2-D blocks (TILE x TILE) for MatMul
- Warp-shuffle (
__shfl_down_sync) for reduction without shared memory #pragma unrollin tile inner loops- Tile-size sweep (8x8, 16x16, 32x32)
- CUDA events for precise GPU timing
- Warm-up runs + averaged timed runs
- GFLOPS formulas for MatVec/MatMul
max_abs_diffcorrectness checks versus CPU reference
| Tool | Min version |
|---|---|
| CUDA Toolkit | 11.0 |
| NVCC | 11.0 |
| GCC / Clang | 9 / 10 |
| CMake (optional) | 3.18 |
Find your GPU's SM version:
nvidia-smi --query-gpu=compute_cap --format=csv,noheaderIf nvidia-smi does not expose compute_cap on your system, check your GPU model and map it to compute capability from NVIDIA documentation.
nvcc --version
cmake --version
gh --versionIf any command is not recognized, add the corresponding bin directory to your PATH and restart your terminal.
Typical install paths:
# CUDA (adjust vX.Y to your installed version, e.g. v12.6)
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\bin
# CMake
C:\Program Files\CMake\bin
# GitHub CLI
C:\Program Files\GitHub CLITemporary for current terminal session:
$env:Path += ";C:\Program Files\CMake\bin;C:\Program Files\GitHub CLI;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin"Persist for your user account:
[Environment]::SetEnvironmentVariable(
"Path",
$env:Path + ";C:\Program Files\CMake\bin;C:\Program Files\GitHub CLI;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin",
"User"
)Then open a new terminal and verify:
nvcc --version
cmake --version
gh --version# Replace 86 with your GPU's SM version (for example 75, 86, or 89)
make SM=86
make run SM=86
# Debug build
make DEBUG=1 SM=86
# Profiling build
make profile SM=86
ncu --set full ./bin/gpu_matrix_opsmkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH="86"
cmake --build . --config Release -j
./bin/gpu_matrix_opsWhen reporting performance results, include:
- GPU model and VRAM
- CUDA toolkit and driver versions
- target SM architecture (
SM=...or-DCUDA_ARCH=...) - matrix dimensions and iteration counts
- whether fast-math was enabled
Save logs under results/ (see results/README.md) for auditability.
GPU-Based Matrix Operations - Benchmark Suite
GPU : NVIDIA GeForce RTX 3080
SMs : 68 | Clock : 1.71 GHz | Mem BW: ~760 GB/s
VRAM : 10240 MB | Shared mem/block: 48 KB
MATRIX-VECTOR (y = A * x)
Matrix-Vector (8192 x 8192) x (8192 x 1)
CPU (scalar) 47.30 ms 1.16 GFLOPS
GPU naive 0.09 ms 611.00 GFLOPS speedup 525.6x err 0.00e+00
GPU shared-mem 0.08 ms 702.10 GFLOPS speedup 591.2x err 0.00e+00
GPU coalesced 0.06 ms 890.30 GFLOPS speedup 788.5x err 0.00e+00
MATRIX-MATRIX (C = A * B)
Matrix-Matrix (2048 x 2048) * (2048 x 2048)
CPU (scalar ikj) 4821.00 ms 3.56 GFLOPS
GPU naive 22.40 ms 765.10 GFLOPS speedup 215.2x
GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS speedup 1175.9x
GPU tiled (8x8) 6.90 ms 2488.50 GFLOPS
GPU tiled (16x16) 4.10 ms 4183.20 GFLOPS <- sweet spot
GPU tiled (32x32) 5.60 ms 3064.40 GFLOPS (register pressure up)
Actual numbers vary by GPU model, driver version, and power limits.
gpu-matrix-ops/
|-- README.md
|-- LICENSE
|-- CONTRIBUTING.md
|-- CITATION.cff
|-- .gitignore
|-- .gitattributes
|-- CMakeLists.txt
|-- Makefile
|-- include/
| |-- matrix_ops.cuh
| `-- cpu_ops.h
|-- src/
| |-- main.cu
| |-- matvec_kernels.cu
| |-- matmul_kernels.cu
| `-- cpu_ops.cpp
|-- docs/
| `-- OPTIMIZATION_NOTES.md
`-- results/
`-- README.md
# Nsight Compute (CUDA 11+)
ncu --set full --target-processes all ./bin/gpu_matrix_ops
# Legacy nvprof
nvprof --print-gpu-trace ./bin/gpu_matrix_ops
# Example metrics
ncu --metrics \
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,\
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,\
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum \
./bin/gpu_matrix_opsMIT - free to use, modify, and redistribute. See LICENSE.