Thank you for your interest in contributing to ztensor, the GPU-accelerated tensor, compute engine, and computation graph library for the Zerfoo ML ecosystem. This guide will help you get started.
- Development Setup
- Building from Source
- Running Tests
- Code Style
- Commit Conventions
- Pull Request Process
- Issue Reporting
- Good First Issues
- Key Conventions
- Go 1.25+ (generics with
tensor.Numericconstraint) - Git
- CUDA Toolkit (optional, for GPU tests — CUDA 13.0 recommended)
- ROCm (optional, for AMD GPU tests)
- OpenCL (optional, for CLBlast backend tests)
git clone https://github.com/zerfoo/ztensor.git
cd ztensor
go mod tidy
go test ./...ztensor depends on:
github.com/zerfoo/float16— IEEE 754 half-precision arithmeticgithub.com/zerfoo/float8— FP8 E4M3FN arithmeticgonum.org/v1/gonum— numerical routines
These are fetched automatically by go mod tidy.
go build ./...No CGo is required for CPU-only builds. GPU support is loaded dynamically at runtime via purego/dlopen, so go build works on any platform without a CUDA toolkit installed.
# Run all CPU tests (no GPU required)
go test ./...
# Run tests with race detector
go test -race ./...
# Run GPU tests — CUDA backend (requires CUDA toolkit and a GPU)
go test -tags cuda ./...
# Run GPU tests — ROCm backend (requires ROCm and an AMD GPU)
go test -tags rocm ./...
# Run GPU tests — OpenCL backend (requires OpenCL runtime)
go test -tags opencl ./...
# Run tests with coverage
go test -cover ./...
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.htmlGPU tests are skipped automatically when no GPU is available. All new code must have tests. Aim for at least 80% coverage on new packages.
gofmt— all code must be formatted withgofmtgoimports— imports must be organized (stdlib, external, internal)golangci-lint— rungolangci-lint runbefore submitting
- Follow standard Go naming: PascalCase for exported symbols, camelCase for unexported
- Use table-driven tests with
t.Runsubtests - Write documentation comments for all exported functions, types, and methods
- Use generics with
[T tensor.Numeric]constraints — avoid type-specific code where generics work
We use Conventional Commits for automated versioning with release-please.
<type>(<scope>): <description>
[optional body]
[optional footer(s)]
| Type | Description |
|---|---|
feat |
A new feature |
fix |
A bug fix |
perf |
A performance improvement |
docs |
Documentation only changes |
test |
Adding or correcting tests |
chore |
Maintenance tasks, CI, dependencies |
refactor |
Code change that neither fixes a bug nor adds a feature |
feat(compute): add ROCm backend for Engine[T]
fix(graph): correct topological sort for diamond dependencies
perf(cuda): fuse elementwise ops in CUDA graph capture
docs(tensor): document memory layout for quantized types
test(device): add multi-GPU allocation tests
- One logical change per PR — keep PRs focused and reviewable
- Branch from
mainand keep your branch up to date with rebase - All CI checks must pass — tests, linting, formatting
- Rebase and merge — we do not use squash merges or merge commits
- Reference related issues — use
Fixes #123orCloses #123in the PR description - Respond to review feedback promptly
go test ./...
go test -race ./...
go vet ./...
golangci-lint runPlease include:
- Description: Clear summary of the bug
- Steps to reproduce: Minimal code or commands to trigger the issue
- Expected behavior: What should happen
- Actual behavior: What happens instead
- Environment: Go version, OS, architecture, GPU model and driver version (if GPU-related)
Please include:
- Problem statement: What problem does this solve?
- Proposed solution: How should it work?
- Alternatives considered: Other approaches you thought about
- Use case: How would you use this feature in practice?
Look for issues labeled good first issue on GitHub. These are scoped, well-defined tasks suitable for new contributors.
Good areas for first contributions:
- Adding test coverage for existing packages
- Documentation improvements
- CPU compute engine optimizations
- New numeric type support in
tensor/
These conventions are critical to maintaining consistency across the codebase:
All tensor arithmetic must flow through compute.Engine[T]. Never operate on raw slices outside the engine — this enables transparent CPU/GPU switching and CUDA graph capture.
// Good
engine.MatMul(ctx, out, a, b)
// Bad — bypasses the engine, breaks GPU support
for i := range out.Data() {
out.Data()[i] = a.Data()[i] * b.Data()[i]
}GPU bindings use purego/dlopen. A plain go build ./... must compile on any platform without a C compiler. Build tags (cuda, rocm, opencl) are optional and only used for CGo-based alternative paths.
All GPU runtime calls (CUDA, ROCm, OpenCL) are loaded dynamically via purego. This means:
- Function signatures are declared as Go types and resolved at runtime
- No
#cgodirectives or C header includes - GPU availability is detected at runtime, not compile time
ARM NEON and x86 AVX2 assembly lives in internal/xblas/. When writing or modifying assembly:
- Keep Go fallback implementations in sync
- Test both assembly and pure-Go paths
- Use
go test -tags noasmto verify the Go fallback
Use [T tensor.Numeric] constraints. Do not write float32-specific code where generics work.
The internal/gpuapi/ package provides a unified interface across CUDA, ROCm, and OpenCL. New GPU features should be added through GRAL, not directly to a single backend.