CppNet

CppNet is a high-performance C++17 deep learning library for building and training neural networks from scratch.
Built on Eigen for fast tensor operations, OpenMP for CPU parallelism, and CUDA for GPU acceleration.

Features

High Performance — Vectorized tensor operations via Eigen, multi-threaded with OpenMP, full CUDA GPU backend for all layers, activations, losses, and optimizers.
Rich Layer Library — Linear, Conv2D, MaxPool2D, RNN, LSTM, GRU, Multi-Head Attention, Dropout, BatchNorm, Embedding, Residual, GlobalPool, MeanPool1D, Flatten.
Multiple Backends — Per-layer compute backend selection: "cpu-eigen" (Eigen contractions), "cpu" (OpenMP loops), "gpu" (CUDA kernels).
Complete CUDA Coverage — 41 CUDA kernel files covering all layers, activations, losses, and optimizers for end-to-end GPU training.
Modular Architecture — Clean separation of layers, activations, losses, optimizers, metrics, regularizations, and utilities.
Training Utilities — DataLoader with batching & shuffling, learning rate schedulers, early stopping callbacks, gradient clipping, model serialization.
Visualization — Built-in TrainingLogger for tracking metrics and exporting training history to CSV.
Extensible — Abstract base classes for layers, losses, and optimizers make it straightforward to add custom components.
Single-Header Access — #include <CppNet/CppNet.hpp> brings in the entire library.

Installation

Prerequisites

Dependency	Version	Required
C++ compiler (GCC, Clang, MSVC)	C++17 support	Yes
CMake	≥ 3.18	Yes
Eigen3	≥ 3.3	Yes
OpenMP	any	Optional (CPU parallelism)
CUDA Toolkit	any	Optional (GPU acceleration)

Build from Source

git clone https://github.com/LoqmanSamani/CppNet.git
cd CppNet
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Install System-Wide

sudo make install

This installs headers to /usr/local/include/CppNet/ and the static library to /usr/local/lib/.

Use in Your CMake Project

find_package(CppNet REQUIRED)
target_link_libraries(your_target PRIVATE CppNet::CppNet)

Quick Start

A minimal binary classification example:

#include <CppNet/CppNet.hpp>
#include <iostream>

int main() {
    // Define layers
    CppNet::Layers::Linear layer1(30, 64, "fc1", true, true, "cpu-eigen", "xavier");
    CppNet::Layers::Linear layer2(64, 1,  "fc2", true, true, "cpu-eigen", "xavier");
    CppNet::Activations::ReLU relu("cpu-eigen");
    CppNet::Activations::Sigmoid sigmoid;

    // Loss & optimizer
    CppNet::Losses::BinaryCrossEntropy loss_fn("mean");
    CppNet::Optimizers::Adam optimizer;
    float lr = 0.001;

    // Training loop
    for (int epoch = 0; epoch < 100; ++epoch) {
        auto h = relu.forward(layer1.forward(X_train));
        auto pred = sigmoid.forward(layer2.forward(h));

        float loss = loss_fn.forward(pred, Y_train);
        auto grad = loss_fn.backward(pred, Y_train);

        grad = layer2.backward(sigmoid.backward(grad));
        layer1.backward(relu.backward(grad));

        layer2.step(optimizer, lr);
        layer1.step(optimizer, lr);

        std::cout << "Epoch " << epoch << " — Loss: " << loss << std::endl;
    }
    return 0;
}

API Overview

Layers

All layers inherit from CppNet::Layers::Layer and implement forward(), backward(), step(), freeze(), unfreeze(), and print_layer_info().

Layer	Description	Key Parameters
`Linear`	Fully connected layer	`in_size`, `out_size`, `bias`, `device`, `weight_init`
`Conv2D`	2D convolution	`in_channels`, `out_channels`, `kernel_size`, `stride`, `padding`
`MaxPool2D`	2D max pooling	`kernel_size`, `stride`
`Flatten`	Reshape to 2D	—
`RNN`	Vanilla recurrent layer	`input_size`, `hidden_size`
`LSTM`	Long Short-Term Memory	`input_size`, `hidden_size`
`GRU`	Gated Recurrent Unit	`input_size`, `hidden_size`
`MultiHeadAttention`	Scaled dot-product multi-head attention	`embed_dim`, `num_heads`
`Dropout`	Dropout regularization	`drop_rate`
`BatchNorm`	Batch normalization	`num_features`
`Embedding`	Embedding lookup table	`vocab_size`, `embed_dim`
`Residual`	Residual (skip) connection wrapper	—
`GlobalPool`	Global average/max pooling	—
`MeanPool1D`	Mean pooling over sequence dimension	—

Activations

Activation	Function
`ReLU`	$\max(0, x)$
`LeakyReLU`	$\max(\alpha x, x)$
`Sigmoid`	$\sigma(x) = \frac{1}{1 + e^{-x}}$
`Tanh`	$\tanh(x)$
`Softmax`	$\frac{e^{x_i}}{\sum_j e^{x_j}}$

All activations support both 2D and 4D tensor inputs and run on all three backends (cpu-eigen, cpu, gpu).

Losses

Loss	Typical Use
`MSE`	Regression
`MAE`	Regression
`Huber`	Robust regression
`BinaryCrossEntropy`	Binary classification
`CategoricalCrossEntropy`	Multi-class classification
`SoftmaxCrossEntropy`	Multi-class (fused softmax + CE)

All support configurable reduction modes ("mean", "sum") and CUDA GPU acceleration.

Optimizers

Optimizer	Description
`SGD`	Stochastic Gradient Descent
`Adam`	Adaptive Moment Estimation (default: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$)
`Adagrad`	Adaptive gradient accumulation
`Momentum`	SGD with momentum
`RMSProp`	Root Mean Square Propagation

All optimizers have dedicated CUDA kernels for GPU-side weight updates.

Metrics

CppNet::Metrics::accuracy(predictions, targets);
CppNet::Metrics::binary_accuracy(predictions, targets, 0.5);
CppNet::Metrics::precision(predictions, targets, 0.5);
CppNet::Metrics::recall(predictions, targets, 0.5);
CppNet::Metrics::f1_score(predictions, targets, 0.5);

Regularizations

CppNet::Regularizations::l1_penalty(weights, lambda);
CppNet::Regularizations::l2_penalty(weights, lambda);
CppNet::Regularizations::elastic_net_penalty(weights, lambda, l1_ratio);
// Corresponding gradient functions: l1_gradient, l2_gradient, elastic_net_gradient

Utilities

Utility	Description
DataLoader	Batched iteration with shuffling. Supports range-based `for` loops.
Weight Init	Xavier (uniform/normal), He (uniform/normal), constant, custom.
Gradient Clipping	`clip_by_value()` and `clip_by_norm()`.
Serialization	`save_model()` / `load_model()` for full model persistence; tensor-level binary I/O.
LR Schedulers	`StepLR`, `ExponentialLR`, `CosineAnnealingLR`.
Callbacks	`EarlyStopping` with configurable patience, delta, and mode.
Elapsed Time	Training duration measurement.

DataLoader example:

CppNet::Utils::DataLoader loader(X, Y, /*batch_size=*/32, /*shuffle=*/true);
for (auto& [x_batch, y_batch] : loader) {
    // forward / backward / step
}
loader.reset(); // re-shuffle for next epoch

Learning rate scheduler example:

CppNet::Schedulers::CosineAnnealingLR scheduler(/*initial_lr=*/0.01, /*T_max=*/100);
for (int epoch = 0; epoch < 100; ++epoch) {
    float lr = scheduler.step();
    // ... train with lr
}

Visualization

CppNet::Visualizations::TrainingLogger logger;
// Inside training loop:
logger.log("train_loss", loss);
logger.log("val_accuracy", val_acc);
logger.next_epoch();
// After training:
logger.print_epoch_summary();
logger.export_csv("training_history.csv");

Examples

The examples/ directory contains complete, self-contained deep learning programs that train on synthetic data — no downloads required. Each example generates its own dataset, trains a model, and reports final metrics.

Example	Architecture	Dataset	Key Components	Result
`mlp_classification.cpp`	Linear→ReLU→Linear→ReLU→Linear	3-class spiral (600 samples, 2D)	ReLU, SoftmaxCrossEntropy, Adam	~75% accuracy
`cnn_image_classification.cpp`	Conv2D→ReLU→MaxPool2D→Flatten→Linear	8×8 stripe images (400 samples)	Conv2D, MaxPool2D, SoftmaxCrossEntropy, Adam	100% accuracy
`rnn_sequence_prediction.cpp`	LSTM(1,16)→Linear(16,1)	Sine-wave sequences (400 samples)	LSTM, MSE, Adam	MSE ≈ 0.00001
`gru_sequence_prediction.cpp`	GRU(1,16)→Linear(16,1)	Sine-wave sequences (400 samples)	GRU, MAE, Momentum	MAE ≈ 0.010
`transformer_classifier.cpp`	Embedding→Attention+skip→ReLU→Linear	Token sequences (400 samples)	Embedding, MultiHeadAttention, MeanPool1D	100% accuracy
`resnet_classifier.cpp`	Linear→ReLU→ResBlock(32)→Linear→Sigmoid	Concentric circles (600 samples)	Residual, GradientClip, He init	~99% accuracy
`regularized_cnn.cpp`	Conv2D→LeakyReLU→Pool→BN→Dropout→FC	8×8 pattern images (600 samples, 3 classes)	BatchNorm, Dropout, LeakyReLU, CategoricalCrossEntropy, Adagrad	100% accuracy
`optimizer_comparison.cpp`	Linear→Tanh→Linear→Tanh→Linear	Regression: y = sin(x₀)·cos(x₁) (500 samples)	SGD, Momentum, Adagrad, RMSProp, Adam, Tanh, Huber	loss ≈ 0.002

Build and run:

cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=ON
make -j$(nproc)
./examples/mlp_classification
./examples/cnn_image_classification
./examples/rnn_sequence_prediction
./examples/gru_sequence_prediction
./examples/transformer_classifier
./examples/resnet_classifier
./examples/regularized_cnn
./examples/optimizer_comparison

GPU Acceleration

CppNet provides full CUDA GPU support across all layers, activations, losses, and optimizers. When CUDA is detected at build time, layers can target the GPU backend:

CppNet::Layers::Linear layer(784, 256, "fc1", true, true, "gpu", "xavier");

CUDA Kernel Coverage (41 kernels)

Category	CUDA Kernels
Linear algebra	`matmul`, `matmul_grad_input`, `matmul_grad_weight`, `add_bias`, `bias_grad`, `elementwise`
Convolution	`conv2d_forward`, `conv2d_backward`, `maxpool2d_forward`, `maxpool2d_backward`
Recurrent	`rnn_cell`, `lstm_cell`, `gru_cell`
Attention	`attention_scores` (scale, softmax, backward), `embedding_forward`, `embedding_backward`
Normalization	`batch_norm_forward`, `batch_norm_backward`, `dropout`
Pooling	`global_avg_pool2d`, `global_max_pool2d`, `mean_pool1d`
Activations	`relu`, `relu_grad`, `leaky_relu`, `leaky_relu_grad`, `sigmoid`, `sigmoid_grad`, `tanh_activation`, `tanh_activation_grad`
Losses	`mse`, `mae`, `huber`, `bce`, `categorical_ce`, `softmax_ce`
Optimizers	`sgd_step`, `momentum_step`, `adagrad_step`, `rmsprop_step`, `adam_step`

To force a CPU-only build even when CUDA is present:

cmake .. -DCUDAToolkit_ROOT=/nonexistent

Benchmarks

Five benchmarks compare three compute backends — cpu-eigen (Eigen SIMD contractions), cpu (OpenMP loops), and gpu (CUDA kernels) — across different architectures and model sizes. All benchmarks are reproducible via the scripts in the benchmarks/ directory.

Summary of GPU Speedups

Architecture	Model Size	GPU Speedup vs cpu-eigen	Key Observation
MLP	Small (4.5K params)	2.0x	GPU overhead limits gains for small matmuls
	Medium (66K params)	6.9x	—
	Large (660K params)	14.3x	—
	XLarge (2.6M params)	25.3x	Sub-linear GPU time scaling with params
CNN	Small (Conv16→32)	28.8x	Convolution is highly GPU-parallel
	Medium (Conv32→64→FC128)	42.0x	Highest CNN speedup
RNN/LSTM/GRU	Small (H=64)	2.2–5.2x	GRU benefits most from GPU
	Medium (H=128)	4.7–15.5x	—
	Large (H=256)	12.2–56.4x	GRU Large achieves 56.4x — highest overall
Transformer	Small (d=32, h=2)	0.5x (slower)	GPU overhead dominates at small scale
	Medium (d=64, h=4)	1.0x (break-even)	—
	Large (d=128, h=8)	1.2x	Modest gain; hybrid CPU/GPU attention
ResNet	Small (W=64, D=2)	1.6x	Depth amplifies GPU advantage
	Medium (W=128, D=4)	6.7x	—
	Large (W=256, D=6)	9.0x	Skip connections add negligible overhead

Key Findings

GPU advantage grows with model size. Across all architectures, larger models see dramatically higher GPU speedups as matrix sizes better saturate GPU cores.
CNNs and recurrent layers benefit most from GPU. Convolution achieves up to 42x speedup; GRU achieves up to 56.4x — the highest across all benchmarks.
Transformers show modest GPU gains at tested scales due to mixed operations (embedding lookups, attention softmax, multiple small projections) and a hybrid CPU/GPU attention path.
Eigen (cpu-eigen) consistently outperforms OpenMP (cpu) for all architectures, leveraging SIMD vectorization and cache-optimal memory layouts.
Numerical consistency is verified across all backends — all devices converge to equivalent loss and accuracy values.

Average GPU Speedups by Architecture

Architecture	Avg GPU Speedup	Best GPU Speedup	Best Config
MLP	12.1x	25.3x	XLarge (2.6M params)
CNN	35.4x	42.0x	Medium (Conv32→64→FC128)
Sequence (RNN/LSTM/GRU)	13.7x	56.4x	GRU Large (H=256, seq=50)
Transformer	0.9x	1.2x	Large (d=128, h=8)
ResNet	5.8x	9.0x	Large (W=256, D=6)

How to Reproduce

cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCHMARKS=ON
make -j$(nproc)
./benchmarks/mlp_benchmark
./benchmarks/cnn_benchmark
./benchmarks/sequence_benchmark
./benchmarks/transformer_benchmark
./benchmarks/residual_benchmark

See benchmarks/benchmarks.md for full per-epoch results, detailed speedup analysis, and methodology.

Testing

CppNet has 41 unit tests with 377 test cases covering every module:

cd build
cmake .. -DBUILD_TESTS=ON
make -j$(nproc)
ctest --output-on-failure

Category	Tests	Test Cases
Layers (13)	Linear, Conv2D, Flatten, MaxPool2D, RNN, Attention, BatchNorm, Dropout, Embedding, GlobalPool, GRU, LSTM, Residual	123
Activations (5)	ReLU, Sigmoid, Softmax, Tanh, LeakyReLU	55
Losses (6)	BinaryCrossEntropy, CategoricalCrossEntropy, MSE, MAE, Huber, SoftmaxCrossEntropy	52
Optimizers (5)	SGD, Adam, Momentum, Adagrad, RMSProp	34
Utilities (7)	Metrics, Regularizations, Callbacks, DataLoader, ElapsedTime, GradientClip, Init	65
GPU Kernels (1)	GPU matmul via Linear layer (forward, backward, step, CPU/GPU comparison)	7
Other (4)	Schedulers, Utils, Models, Visualizations	41

Each test validates forward pass, backward pass (gradient shapes & values), parameter updates, and GPU/CPU numerical consistency where applicable.

Project Structure

CppNet/
├── CMakeLists.txt              # Top-level build configuration
├── cmake/                      # CMake package config templates
├── include/CppNet/             # Public headers
│   ├── CppNet.hpp              # Single-include entry point
│   ├── activations/            # ReLU, Sigmoid, Softmax, Tanh, LeakyReLU
│   ├── layers/                 # Linear, Conv2D, RNN, LSTM, GRU, Attention, ...
│   ├── losses/                 # MSE, MAE, Huber, BCE, CCE, SoftmaxCE
│   ├── optimizers/             # SGD, Adam, Adagrad, Momentum, RMSProp
│   ├── models/                 # SequentialModel
│   ├── metrics/                # Accuracy, Precision, Recall, F1
│   ├── regularizations/        # L1, L2, Elastic Net
│   ├── kernels/gpu/            # CUDA kernel declarations
│   ├── utils/                  # DataLoader, Init, Schedulers, Serialization, ...
│   └── visualizations/         # TrainingLogger
├── src/CppNet/                 # Implementation files (.cpp / .cu)
│   └── kernels/gpu/            # 41 CUDA kernel implementations
├── tests/                      # 41 unit tests (377 test cases)
├── examples/                   # 8 deep learning examples
├── benchmarks/                 # 5 device benchmarks (CPU vs GPU)
└── docs/                       # Additional documentation

Roadmap

Contributing

Contributions are welcome! To get started:

Fork the repository and create a feature branch.
Follow the existing coding style — headers in include/CppNet/, implementations in src/CppNet/.
Add tests for new functionality in tests/.
Make sure all tests pass: cd build && ctest --output-on-failure.
Open a pull request with a clear description of your changes.

License

CppNet is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.idea		.idea
benchmarks		benchmarks
cmake		cmake
docs		docs
examples		examples
imgs		imgs
include/CppNet		include/CppNet
src/CppNet		src/CppNet
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CppNet

Table of Contents

Features

Installation

Prerequisites

Build from Source

Install System-Wide

Use in Your CMake Project

Quick Start

API Overview

Layers

Activations

Losses

Optimizers

Metrics

Regularizations

Utilities

Visualization

Examples

GPU Acceleration

CUDA Kernel Coverage (41 kernels)

Benchmarks

Summary of GPU Speedups

Key Findings

Average GPU Speedups by Architecture

How to Reproduce

Testing

Project Structure

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages