GPU-Based Matrix Operations

High-performance CUDA/C++ implementation of matrix-vector and matrix-matrix operations with a focus on memory access patterns, shared-memory optimization, and GPU vs CPU benchmarking.

Project Overview

Component	Description
`src/matvec_kernels.cu`	Three MatVec kernels: naive, shared-memory, warp-coalesced
`src/matmul_kernels.cu`	Three MatMul kernels: naive, tiled (TILE=16), variable-tile sweep
`src/cpu_ops.cpp`	Scalar CPU reference (ikj loop order)
`src/main.cu`	Benchmark driver with GFLOPS, speedup, and correctness checks
`include/matrix_ops.cuh`	Kernel declarations + GPU timer + CUDA error macro
`include/cpu_ops.h`	CPU operation declarations

Key Techniques Demonstrated

Memory Access Patterns

Naive global-memory access (baseline)
Coalesced access with 2-D warp layout for contiguous reads
Shared-memory tiling to reduce redundant global reads

Thread-Block Optimizations

1-D blocks (256 threads) for MatVec
2-D blocks (TILE x TILE) for MatMul
Warp-shuffle (__shfl_down_sync) for reduction without shared memory
#pragma unroll in tile inner loops
Tile-size sweep (8x8, 16x16, 32x32)

Benchmarking

CUDA events for precise GPU timing
Warm-up runs + averaged timed runs
GFLOPS formulas for MatVec/MatMul
max_abs_diff correctness checks versus CPU reference

Prerequisites

Tool	Min version
CUDA Toolkit	11.0
NVCC	11.0
GCC / Clang	9 / 10
CMake (optional)	3.18

Find your GPU's SM version:

nvidia-smi --query-gpu=compute_cap --format=csv,noheader

If nvidia-smi does not expose compute_cap on your system, check your GPU model and map it to compute capability from NVIDIA documentation.

Quick Toolchain Verification

nvcc --version
cmake --version
gh --version

If any command is not recognized, add the corresponding bin directory to your PATH and restart your terminal.

Windows PATH Setup (PowerShell)

Typical install paths:

# CUDA (adjust vX.Y to your installed version, e.g. v12.6)
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\bin

# CMake
C:\Program Files\CMake\bin

# GitHub CLI
C:\Program Files\GitHub CLI

Temporary for current terminal session:

$env:Path += ";C:\Program Files\CMake\bin;C:\Program Files\GitHub CLI;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin"

Persist for your user account:

[Environment]::SetEnvironmentVariable(
  "Path",
  $env:Path + ";C:\Program Files\CMake\bin;C:\Program Files\GitHub CLI;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin",
  "User"
)

Then open a new terminal and verify:

nvcc --version
cmake --version
gh --version

Build and Run

Option A: Makefile

# Replace 86 with your GPU's SM version (for example 75, 86, or 89)
make SM=86
make run SM=86

# Debug build
make DEBUG=1 SM=86

# Profiling build
make profile SM=86
ncu --set full ./bin/gpu_matrix_ops

Option B: CMake

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH="86"
cmake --build . --config Release -j
./bin/gpu_matrix_ops

Reproducibility Checklist

When reporting performance results, include:

GPU model and VRAM
CUDA toolkit and driver versions
target SM architecture (SM=... or -DCUDA_ARCH=...)
matrix dimensions and iteration counts
whether fast-math was enabled

Save logs under results/ (see results/README.md) for auditability.

Expected Output (RTX 3080, SM86)

GPU-Based Matrix Operations - Benchmark Suite

GPU  : NVIDIA GeForce RTX 3080
SMs  : 68  |  Clock : 1.71 GHz  |  Mem BW: ~760 GB/s
VRAM : 10240 MB  |  Shared mem/block: 48 KB

MATRIX-VECTOR  (y = A * x)
Matrix-Vector  (8192 x 8192)  x (8192 x 1)
CPU (scalar)               47.30 ms    1.16 GFLOPS
GPU naive                   0.09 ms  611.00 GFLOPS  speedup 525.6x  err 0.00e+00
GPU shared-mem              0.08 ms  702.10 GFLOPS  speedup 591.2x  err 0.00e+00
GPU coalesced               0.06 ms  890.30 GFLOPS  speedup 788.5x  err 0.00e+00

MATRIX-MATRIX  (C = A * B)
Matrix-Matrix  (2048 x 2048) * (2048 x 2048)
CPU (scalar ikj)         4821.00 ms    3.56 GFLOPS
GPU naive                  22.40 ms  765.10 GFLOPS  speedup 215.2x
GPU tiled (16x16)           4.10 ms 4183.20 GFLOPS  speedup 1175.9x
GPU tiled (8x8)             6.90 ms 2488.50 GFLOPS
GPU tiled (16x16)           4.10 ms 4183.20 GFLOPS  <- sweet spot
GPU tiled (32x32)           5.60 ms 3064.40 GFLOPS  (register pressure up)

Actual numbers vary by GPU model, driver version, and power limits.

Project Structure

gpu-matrix-ops/
|-- README.md
|-- LICENSE
|-- CONTRIBUTING.md
|-- CITATION.cff
|-- .gitignore
|-- .gitattributes
|-- CMakeLists.txt
|-- Makefile
|-- include/
|   |-- matrix_ops.cuh
|   `-- cpu_ops.h
|-- src/
|   |-- main.cu
|   |-- matvec_kernels.cu
|   |-- matmul_kernels.cu
|   `-- cpu_ops.cpp
|-- docs/
|   `-- OPTIMIZATION_NOTES.md
`-- results/
    `-- README.md

Profiling

# Nsight Compute (CUDA 11+)
ncu --set full --target-processes all ./bin/gpu_matrix_ops

# Legacy nvprof
nvprof --print-gpu-trace ./bin/gpu_matrix_ops

# Example metrics
ncu --metrics \
  l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,\
  smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,\
  l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum \
  ./bin/gpu_matrix_ops

License

MIT - free to use, modify, and redistribute. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Based Matrix Operations

Project Overview

Key Techniques Demonstrated

Memory Access Patterns

Thread-Block Optimizations

Benchmarking

Prerequisites

Quick Toolchain Verification

Windows PATH Setup (PowerShell)

Build and Run

Option A: Makefile

Option B: CMake

Reproducibility Checklist

Expected Output (RTX 3080, SM86)

Project Structure

Profiling

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
include		include
results		results
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

GPU-Based Matrix Operations

Project Overview

Key Techniques Demonstrated

Memory Access Patterns

Thread-Block Optimizations

Benchmarking

Prerequisites

Quick Toolchain Verification

Windows PATH Setup (PowerShell)

Build and Run

Option A: Makefile

Option B: CMake

Reproducibility Checklist

Expected Output (RTX 3080, SM86)

Project Structure

Profiling

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages