Skip to content

Olajide-Badejo/GPU-Based-Matrix-Operations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Based Matrix Operations

High-performance CUDA/C++ implementation of matrix-vector and matrix-matrix operations with a focus on memory access patterns, shared-memory optimization, and GPU vs CPU benchmarking.


Project Overview

Component Description
src/matvec_kernels.cu Three MatVec kernels: naive, shared-memory, warp-coalesced
src/matmul_kernels.cu Three MatMul kernels: naive, tiled (TILE=16), variable-tile sweep
src/cpu_ops.cpp Scalar CPU reference (ikj loop order)
src/main.cu Benchmark driver with GFLOPS, speedup, and correctness checks
include/matrix_ops.cuh Kernel declarations + GPU timer + CUDA error macro
include/cpu_ops.h CPU operation declarations

Key Techniques Demonstrated

Memory Access Patterns

  • Naive global-memory access (baseline)
  • Coalesced access with 2-D warp layout for contiguous reads
  • Shared-memory tiling to reduce redundant global reads

Thread-Block Optimizations

  • 1-D blocks (256 threads) for MatVec
  • 2-D blocks (TILE x TILE) for MatMul
  • Warp-shuffle (__shfl_down_sync) for reduction without shared memory
  • #pragma unroll in tile inner loops
  • Tile-size sweep (8x8, 16x16, 32x32)

Benchmarking

  • CUDA events for precise GPU timing
  • Warm-up runs + averaged timed runs
  • GFLOPS formulas for MatVec/MatMul
  • max_abs_diff correctness checks versus CPU reference

Prerequisites

Tool Min version
CUDA Toolkit 11.0
NVCC 11.0
GCC / Clang 9 / 10
CMake (optional) 3.18

Find your GPU's SM version:

nvidia-smi --query-gpu=compute_cap --format=csv,noheader

If nvidia-smi does not expose compute_cap on your system, check your GPU model and map it to compute capability from NVIDIA documentation.

Quick Toolchain Verification

nvcc --version
cmake --version
gh --version

If any command is not recognized, add the corresponding bin directory to your PATH and restart your terminal.

Windows PATH Setup (PowerShell)

Typical install paths:

# CUDA (adjust vX.Y to your installed version, e.g. v12.6)
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\bin

# CMake
C:\Program Files\CMake\bin

# GitHub CLI
C:\Program Files\GitHub CLI

Temporary for current terminal session:

$env:Path += ";C:\Program Files\CMake\bin;C:\Program Files\GitHub CLI;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin"

Persist for your user account:

[Environment]::SetEnvironmentVariable(
  "Path",
  $env:Path + ";C:\Program Files\CMake\bin;C:\Program Files\GitHub CLI;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin",
  "User"
)

Then open a new terminal and verify:

nvcc --version
cmake --version
gh --version

Build and Run

Option A: Makefile

# Replace 86 with your GPU's SM version (for example 75, 86, or 89)
make SM=86
make run SM=86

# Debug build
make DEBUG=1 SM=86

# Profiling build
make profile SM=86
ncu --set full ./bin/gpu_matrix_ops

Option B: CMake

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_ARCH="86"
cmake --build . --config Release -j
./bin/gpu_matrix_ops

Reproducibility Checklist

When reporting performance results, include:

  • GPU model and VRAM
  • CUDA toolkit and driver versions
  • target SM architecture (SM=... or -DCUDA_ARCH=...)
  • matrix dimensions and iteration counts
  • whether fast-math was enabled

Save logs under results/ (see results/README.md) for auditability.


Expected Output (RTX 3080, SM86)

GPU-Based Matrix Operations - Benchmark Suite

GPU  : NVIDIA GeForce RTX 3080
SMs  : 68  |  Clock : 1.71 GHz  |  Mem BW: ~760 GB/s
VRAM : 10240 MB  |  Shared mem/block: 48 KB

MATRIX-VECTOR  (y = A * x)
Matrix-Vector  (8192 x 8192)  x (8192 x 1)
CPU (scalar)               47.30 ms    1.16 GFLOPS
GPU naive                   0.09 ms  611.00 GFLOPS  speedup 525.6x  err 0.00e+00
GPU shared-mem              0.08 ms  702.10 GFLOPS  speedup 591.2x  err 0.00e+00
GPU coalesced               0.06 ms  890.30 GFLOPS  speedup 788.5x  err 0.00e+00

MATRIX-MATRIX  (C = A * B)
Matrix-Matrix  (2048 x 2048) * (2048 x 2048)
CPU (scalar ikj)         4821.00 ms    3.56 GFLOPS
GPU naive                  22.40 ms  765.10 GFLOPS  speedup 215.2x
GPU tiled (16x16)           4.10 ms 4183.20 GFLOPS  speedup 1175.9x
GPU tiled (8x8)             6.90 ms 2488.50 GFLOPS
GPU tiled (16x16)           4.10 ms 4183.20 GFLOPS  <- sweet spot
GPU tiled (32x32)           5.60 ms 3064.40 GFLOPS  (register pressure up)

Actual numbers vary by GPU model, driver version, and power limits.


Project Structure

gpu-matrix-ops/
|-- README.md
|-- LICENSE
|-- CONTRIBUTING.md
|-- CITATION.cff
|-- .gitignore
|-- .gitattributes
|-- CMakeLists.txt
|-- Makefile
|-- include/
|   |-- matrix_ops.cuh
|   `-- cpu_ops.h
|-- src/
|   |-- main.cu
|   |-- matvec_kernels.cu
|   |-- matmul_kernels.cu
|   `-- cpu_ops.cpp
|-- docs/
|   `-- OPTIMIZATION_NOTES.md
`-- results/
    `-- README.md

Profiling

# Nsight Compute (CUDA 11+)
ncu --set full --target-processes all ./bin/gpu_matrix_ops

# Legacy nvprof
nvprof --print-gpu-trace ./bin/gpu_matrix_ops

# Example metrics
ncu --metrics \
  l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,\
  smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,\
  l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum \
  ./bin/gpu_matrix_ops

License

MIT - free to use, modify, and redistribute. See LICENSE.

About

CUDA/C++ matrix-vector and matrix-matrix kernels with naive, shared-memory tiled, and warp-coalesced variants, plus GFLOPS benchmarking vs CPU.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors