CppNet is a high-performance C++17 deep learning library for building and training neural networks from scratch.
Built on Eigen for fast tensor operations,
OpenMP for CPU parallelism,
and CUDA for GPU acceleration.
- Features
- Installation
- Quick Start
- API Overview
- Examples
- GPU Acceleration
- Benchmarks
- Testing
- Project Structure
- Roadmap
- Contributing
- License
- High Performance — Vectorized tensor operations via Eigen, multi-threaded with OpenMP, full CUDA GPU backend for all layers, activations, losses, and optimizers.
- Rich Layer Library — Linear, Conv2D, MaxPool2D, RNN, LSTM, GRU, Multi-Head Attention, Dropout, BatchNorm, Embedding, Residual, GlobalPool, MeanPool1D, Flatten.
- Multiple Backends — Per-layer compute backend selection:
"cpu-eigen"(Eigen contractions),"cpu"(OpenMP loops),"gpu"(CUDA kernels). - Complete CUDA Coverage — 41 CUDA kernel files covering all layers, activations, losses, and optimizers for end-to-end GPU training.
- Modular Architecture — Clean separation of layers, activations, losses, optimizers, metrics, regularizations, and utilities.
- Training Utilities — DataLoader with batching & shuffling, learning rate schedulers, early stopping callbacks, gradient clipping, model serialization.
- Visualization — Built-in
TrainingLoggerfor tracking metrics and exporting training history to CSV. - Extensible — Abstract base classes for layers, losses, and optimizers make it straightforward to add custom components.
- Single-Header Access —
#include <CppNet/CppNet.hpp>brings in the entire library.
| Dependency | Version | Required |
|---|---|---|
| C++ compiler (GCC, Clang, MSVC) | C++17 support | Yes |
| CMake | ≥ 3.18 | Yes |
| Eigen3 | ≥ 3.3 | Yes |
| OpenMP | any | Optional (CPU parallelism) |
| CUDA Toolkit | any | Optional (GPU acceleration) |
git clone https://github.com/LoqmanSamani/CppNet.git
cd CppNet
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)sudo make installThis installs headers to /usr/local/include/CppNet/ and the static library to /usr/local/lib/.
find_package(CppNet REQUIRED)
target_link_libraries(your_target PRIVATE CppNet::CppNet)A minimal binary classification example:
#include <CppNet/CppNet.hpp>
#include <iostream>
int main() {
// Define layers
CppNet::Layers::Linear layer1(30, 64, "fc1", true, true, "cpu-eigen", "xavier");
CppNet::Layers::Linear layer2(64, 1, "fc2", true, true, "cpu-eigen", "xavier");
CppNet::Activations::ReLU relu("cpu-eigen");
CppNet::Activations::Sigmoid sigmoid;
// Loss & optimizer
CppNet::Losses::BinaryCrossEntropy loss_fn("mean");
CppNet::Optimizers::Adam optimizer;
float lr = 0.001;
// Training loop
for (int epoch = 0; epoch < 100; ++epoch) {
auto h = relu.forward(layer1.forward(X_train));
auto pred = sigmoid.forward(layer2.forward(h));
float loss = loss_fn.forward(pred, Y_train);
auto grad = loss_fn.backward(pred, Y_train);
grad = layer2.backward(sigmoid.backward(grad));
layer1.backward(relu.backward(grad));
layer2.step(optimizer, lr);
layer1.step(optimizer, lr);
std::cout << "Epoch " << epoch << " — Loss: " << loss << std::endl;
}
return 0;
}All layers inherit from CppNet::Layers::Layer and implement forward(), backward(), step(), freeze(), unfreeze(), and print_layer_info().
| Layer | Description | Key Parameters |
|---|---|---|
Linear |
Fully connected layer | in_size, out_size, bias, device, weight_init |
Conv2D |
2D convolution | in_channels, out_channels, kernel_size, stride, padding |
MaxPool2D |
2D max pooling | kernel_size, stride |
Flatten |
Reshape to 2D | — |
RNN |
Vanilla recurrent layer | input_size, hidden_size |
LSTM |
Long Short-Term Memory | input_size, hidden_size |
GRU |
Gated Recurrent Unit | input_size, hidden_size |
MultiHeadAttention |
Scaled dot-product multi-head attention | embed_dim, num_heads |
Dropout |
Dropout regularization | drop_rate |
BatchNorm |
Batch normalization | num_features |
Embedding |
Embedding lookup table | vocab_size, embed_dim |
Residual |
Residual (skip) connection wrapper | — |
GlobalPool |
Global average/max pooling | — |
MeanPool1D |
Mean pooling over sequence dimension | — |
| Activation | Function |
|---|---|
ReLU |
|
LeakyReLU |
|
Sigmoid |
|
Tanh |
|
Softmax |
All activations support both 2D and 4D tensor inputs and run on all three backends (cpu-eigen, cpu, gpu).
| Loss | Typical Use |
|---|---|
MSE |
Regression |
MAE |
Regression |
Huber |
Robust regression |
BinaryCrossEntropy |
Binary classification |
CategoricalCrossEntropy |
Multi-class classification |
SoftmaxCrossEntropy |
Multi-class (fused softmax + CE) |
All support configurable reduction modes ("mean", "sum") and CUDA GPU acceleration.
| Optimizer | Description |
|---|---|
SGD |
Stochastic Gradient Descent |
Adam |
Adaptive Moment Estimation (default: |
Adagrad |
Adaptive gradient accumulation |
Momentum |
SGD with momentum |
RMSProp |
Root Mean Square Propagation |
All optimizers have dedicated CUDA kernels for GPU-side weight updates.
CppNet::Metrics::accuracy(predictions, targets);
CppNet::Metrics::binary_accuracy(predictions, targets, 0.5);
CppNet::Metrics::precision(predictions, targets, 0.5);
CppNet::Metrics::recall(predictions, targets, 0.5);
CppNet::Metrics::f1_score(predictions, targets, 0.5);CppNet::Regularizations::l1_penalty(weights, lambda);
CppNet::Regularizations::l2_penalty(weights, lambda);
CppNet::Regularizations::elastic_net_penalty(weights, lambda, l1_ratio);
// Corresponding gradient functions: l1_gradient, l2_gradient, elastic_net_gradient| Utility | Description |
|---|---|
| DataLoader | Batched iteration with shuffling. Supports range-based for loops. |
| Weight Init | Xavier (uniform/normal), He (uniform/normal), constant, custom. |
| Gradient Clipping | clip_by_value() and clip_by_norm(). |
| Serialization | save_model() / load_model() for full model persistence; tensor-level binary I/O. |
| LR Schedulers | StepLR, ExponentialLR, CosineAnnealingLR. |
| Callbacks | EarlyStopping with configurable patience, delta, and mode. |
| Elapsed Time | Training duration measurement. |
DataLoader example:
CppNet::Utils::DataLoader loader(X, Y, /*batch_size=*/32, /*shuffle=*/true);
for (auto& [x_batch, y_batch] : loader) {
// forward / backward / step
}
loader.reset(); // re-shuffle for next epochLearning rate scheduler example:
CppNet::Schedulers::CosineAnnealingLR scheduler(/*initial_lr=*/0.01, /*T_max=*/100);
for (int epoch = 0; epoch < 100; ++epoch) {
float lr = scheduler.step();
// ... train with lr
}CppNet::Visualizations::TrainingLogger logger;
// Inside training loop:
logger.log("train_loss", loss);
logger.log("val_accuracy", val_acc);
logger.next_epoch();
// After training:
logger.print_epoch_summary();
logger.export_csv("training_history.csv");The examples/ directory contains complete, self-contained deep learning programs that train on synthetic data — no downloads required. Each example generates its own dataset, trains a model, and reports final metrics.
| Example | Architecture | Dataset | Key Components | Result |
|---|---|---|---|---|
mlp_classification.cpp |
Linear→ReLU→Linear→ReLU→Linear | 3-class spiral (600 samples, 2D) | ReLU, SoftmaxCrossEntropy, Adam | ~75% accuracy |
cnn_image_classification.cpp |
Conv2D→ReLU→MaxPool2D→Flatten→Linear | 8×8 stripe images (400 samples) | Conv2D, MaxPool2D, SoftmaxCrossEntropy, Adam | 100% accuracy |
rnn_sequence_prediction.cpp |
LSTM(1,16)→Linear(16,1) | Sine-wave sequences (400 samples) | LSTM, MSE, Adam | MSE ≈ 0.00001 |
gru_sequence_prediction.cpp |
GRU(1,16)→Linear(16,1) | Sine-wave sequences (400 samples) | GRU, MAE, Momentum | MAE ≈ 0.010 |
transformer_classifier.cpp |
Embedding→Attention+skip→ReLU→Linear | Token sequences (400 samples) | Embedding, MultiHeadAttention, MeanPool1D | 100% accuracy |
resnet_classifier.cpp |
Linear→ReLU→ResBlock(32)→Linear→Sigmoid | Concentric circles (600 samples) | Residual, GradientClip, He init | ~99% accuracy |
regularized_cnn.cpp |
Conv2D→LeakyReLU→Pool→BN→Dropout→FC | 8×8 pattern images (600 samples, 3 classes) | BatchNorm, Dropout, LeakyReLU, CategoricalCrossEntropy, Adagrad | 100% accuracy |
optimizer_comparison.cpp |
Linear→Tanh→Linear→Tanh→Linear | Regression: y = sin(x₀)·cos(x₁) (500 samples) | SGD, Momentum, Adagrad, RMSProp, Adam, Tanh, Huber | loss ≈ 0.002 |
Build and run:
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=ON
make -j$(nproc)
./examples/mlp_classification
./examples/cnn_image_classification
./examples/rnn_sequence_prediction
./examples/gru_sequence_prediction
./examples/transformer_classifier
./examples/resnet_classifier
./examples/regularized_cnn
./examples/optimizer_comparisonCppNet provides full CUDA GPU support across all layers, activations, losses, and optimizers. When CUDA is detected at build time, layers can target the GPU backend:
CppNet::Layers::Linear layer(784, 256, "fc1", true, true, "gpu", "xavier");| Category | CUDA Kernels |
|---|---|
| Linear algebra | matmul, matmul_grad_input, matmul_grad_weight, add_bias, bias_grad, elementwise |
| Convolution | conv2d_forward, conv2d_backward, maxpool2d_forward, maxpool2d_backward |
| Recurrent | rnn_cell, lstm_cell, gru_cell |
| Attention | attention_scores (scale, softmax, backward), embedding_forward, embedding_backward |
| Normalization | batch_norm_forward, batch_norm_backward, dropout |
| Pooling | global_avg_pool2d, global_max_pool2d, mean_pool1d |
| Activations | relu, relu_grad, leaky_relu, leaky_relu_grad, sigmoid, sigmoid_grad, tanh_activation, tanh_activation_grad |
| Losses | mse, mae, huber, bce, categorical_ce, softmax_ce |
| Optimizers | sgd_step, momentum_step, adagrad_step, rmsprop_step, adam_step |
To force a CPU-only build even when CUDA is present:
cmake .. -DCUDAToolkit_ROOT=/nonexistentFive benchmarks compare three compute backends — cpu-eigen (Eigen SIMD contractions), cpu (OpenMP loops), and gpu (CUDA kernels) — across different architectures and model sizes. All benchmarks are reproducible via the scripts in the benchmarks/ directory.
| Architecture | Model Size | GPU Speedup vs cpu-eigen | Key Observation |
|---|---|---|---|
| MLP | Small (4.5K params) | 2.0x | GPU overhead limits gains for small matmuls |
| Medium (66K params) | 6.9x | — | |
| Large (660K params) | 14.3x | — | |
| XLarge (2.6M params) | 25.3x | Sub-linear GPU time scaling with params | |
| CNN | Small (Conv16→32) | 28.8x | Convolution is highly GPU-parallel |
| Medium (Conv32→64→FC128) | 42.0x | Highest CNN speedup | |
| RNN/LSTM/GRU | Small (H=64) | 2.2–5.2x | GRU benefits most from GPU |
| Medium (H=128) | 4.7–15.5x | — | |
| Large (H=256) | 12.2–56.4x | GRU Large achieves 56.4x — highest overall | |
| Transformer | Small (d=32, h=2) | 0.5x (slower) | GPU overhead dominates at small scale |
| Medium (d=64, h=4) | 1.0x (break-even) | — | |
| Large (d=128, h=8) | 1.2x | Modest gain; hybrid CPU/GPU attention | |
| ResNet | Small (W=64, D=2) | 1.6x | Depth amplifies GPU advantage |
| Medium (W=128, D=4) | 6.7x | — | |
| Large (W=256, D=6) | 9.0x | Skip connections add negligible overhead |
- GPU advantage grows with model size. Across all architectures, larger models see dramatically higher GPU speedups as matrix sizes better saturate GPU cores.
- CNNs and recurrent layers benefit most from GPU. Convolution achieves up to 42x speedup; GRU achieves up to 56.4x — the highest across all benchmarks.
- Transformers show modest GPU gains at tested scales due to mixed operations (embedding lookups, attention softmax, multiple small projections) and a hybrid CPU/GPU attention path.
- Eigen (
cpu-eigen) consistently outperforms OpenMP (cpu) for all architectures, leveraging SIMD vectorization and cache-optimal memory layouts. - Numerical consistency is verified across all backends — all devices converge to equivalent loss and accuracy values.
| Architecture | Avg GPU Speedup | Best GPU Speedup | Best Config |
|---|---|---|---|
| MLP | 12.1x | 25.3x | XLarge (2.6M params) |
| CNN | 35.4x | 42.0x | Medium (Conv32→64→FC128) |
| Sequence (RNN/LSTM/GRU) | 13.7x | 56.4x | GRU Large (H=256, seq=50) |
| Transformer | 0.9x | 1.2x | Large (d=128, h=8) |
| ResNet | 5.8x | 9.0x | Large (W=256, D=6) |
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCHMARKS=ON
make -j$(nproc)
./benchmarks/mlp_benchmark
./benchmarks/cnn_benchmark
./benchmarks/sequence_benchmark
./benchmarks/transformer_benchmark
./benchmarks/residual_benchmarkSee
benchmarks/benchmarks.mdfor full per-epoch results, detailed speedup analysis, and methodology.
CppNet has 41 unit tests with 377 test cases covering every module:
cd build
cmake .. -DBUILD_TESTS=ON
make -j$(nproc)
ctest --output-on-failure| Category | Tests | Test Cases |
|---|---|---|
| Layers (13) | Linear, Conv2D, Flatten, MaxPool2D, RNN, Attention, BatchNorm, Dropout, Embedding, GlobalPool, GRU, LSTM, Residual | 123 |
| Activations (5) | ReLU, Sigmoid, Softmax, Tanh, LeakyReLU | 55 |
| Losses (6) | BinaryCrossEntropy, CategoricalCrossEntropy, MSE, MAE, Huber, SoftmaxCrossEntropy | 52 |
| Optimizers (5) | SGD, Adam, Momentum, Adagrad, RMSProp | 34 |
| Utilities (7) | Metrics, Regularizations, Callbacks, DataLoader, ElapsedTime, GradientClip, Init | 65 |
| GPU Kernels (1) | GPU matmul via Linear layer (forward, backward, step, CPU/GPU comparison) | 7 |
| Other (4) | Schedulers, Utils, Models, Visualizations | 41 |
Each test validates forward pass, backward pass (gradient shapes & values), parameter updates, and GPU/CPU numerical consistency where applicable.
CppNet/
├── CMakeLists.txt # Top-level build configuration
├── cmake/ # CMake package config templates
├── include/CppNet/ # Public headers
│ ├── CppNet.hpp # Single-include entry point
│ ├── activations/ # ReLU, Sigmoid, Softmax, Tanh, LeakyReLU
│ ├── layers/ # Linear, Conv2D, RNN, LSTM, GRU, Attention, ...
│ ├── losses/ # MSE, MAE, Huber, BCE, CCE, SoftmaxCE
│ ├── optimizers/ # SGD, Adam, Adagrad, Momentum, RMSProp
│ ├── models/ # SequentialModel
│ ├── metrics/ # Accuracy, Precision, Recall, F1
│ ├── regularizations/ # L1, L2, Elastic Net
│ ├── kernels/gpu/ # CUDA kernel declarations
│ ├── utils/ # DataLoader, Init, Schedulers, Serialization, ...
│ └── visualizations/ # TrainingLogger
├── src/CppNet/ # Implementation files (.cpp / .cu)
│ └── kernels/gpu/ # 41 CUDA kernel implementations
├── tests/ # 41 unit tests (377 test cases)
├── examples/ # 8 deep learning examples
├── benchmarks/ # 5 device benchmarks (CPU vs GPU)
└── docs/ # Additional documentation
- Core layer library (Linear, Conv2D, Pooling, RNN, LSTM, GRU, Attention, BatchNorm, Dropout, Embedding, Residual)
- Activation functions (ReLU, Sigmoid, Tanh, Softmax, LeakyReLU)
- Loss functions (MSE, MAE, Huber, BCE, CCE, SoftmaxCE)
- Optimizers (SGD, Adam, Adagrad, Momentum, RMSProp)
- DataLoader, LR schedulers, early stopping, gradient clipping
- Model serialization (save/load)
- Full CUDA GPU backend — 41 kernels covering all layers, activations, losses, and optimizers
- OpenMP CPU parallelism
- Comprehensive test suite (41 tests, 377 test cases)
- Deep learning examples (MLP, CNN, RNN/LSTM, GRU, Transformer, ResNet, Regularized CNN, Optimizer Comparison)
- Device benchmarks (MLP, CNN, Sequence, Transformer, ResNet)
- Add Trainer abstraction with built-in training loop
- Additional examples (GANs, Reinforcement Learning, NLP pipelines)
- Python bindings (pybind11)
- Comprehensive API reference documentation
Contributions are welcome! To get started:
- Fork the repository and create a feature branch.
- Follow the existing coding style — headers in
include/CppNet/, implementations insrc/CppNet/. - Add tests for new functionality in
tests/. - Make sure all tests pass:
cd build && ctest --output-on-failure. - Open a pull request with a clear description of your changes.
CppNet is released under the MIT License.
Copyright © 2025 Loghman Samani
