Acceleration Backend Error Codes

Comprehensive reference for error codes used across all GPU acceleration backends.

Overview

ThemisDB uses structured error codes to provide clear, actionable diagnostic information when GPU operations fail. Each error code is:

Categorized by type (initialization, resource, runtime, etc.)
Unique across all backends
Documented with troubleshooting steps
Actionable with specific resolution hints

Error Code Categories

Range	Category	Description
0	Success	Operation completed successfully
100-199	Initialization	Backend/device initialization failures
200-299	Resource Management	Memory allocation and resource errors
300-399	Runtime/Execution	Kernel execution and synchronization errors
400-499	Configuration	Configuration and parameter errors
500-599	Kernel/Shader	Compilation and linking errors
900-999	Generic	Unknown or internal errors

Initialization Errors (100-199)

101: No Devices Found

Description: No compatible GPU devices were found on the system.

Common Causes:

No GPU physically installed
GPU not detected by system (BIOS/UEFI issue)
Driver not installed or loaded
GPU disabled in device manager

Troubleshooting:

Verify GPU is physically installed: lspci | grep -i vga (Linux) or Device Manager (Windows)
Check driver installation: nvidia-smi (NVIDIA), rocm-smi (AMD)
Ensure GPU is enabled in BIOS/UEFI
Check for hardware failures

Backends: All (CUDA, HIP, OpenCL, Metal, Vulkan)

102: Driver Not Installed

Description: Required GPU driver not installed or not accessible.

Common Causes:

Driver not installed
Driver version too old
Driver service not running
Permission issues

Troubleshooting:

CUDA/NVIDIA:

# Check driver
nvidia-smi

# Install driver (Ubuntu)
sudo apt install nvidia-driver-535

HIP/AMD:

# Check driver
rocm-smi

# Install ROCm (Ubuntu)
sudo apt install rocm-dkms

OpenCL:

# Check OpenCL platforms
clinfo

# Install OpenCL ICD loader
sudo apt install ocl-icd-libopencl1

Metal: Update to macOS 10.11+ or iOS 8+

Backends: All

103: Device Not Supported

Description: Device exists but is not supported (too old, wrong architecture).

Common Causes:

GPU architecture too old (compute capability < 3.0 for CUDA)
Missing required features
Unsupported vendor (e.g., NVIDIA GPU with AMD driver)

Troubleshooting:

Check GPU specifications against backend requirements
CUDA: Requires compute capability 3.0+ (Kepler or newer)
HIP: Requires GCN 3.0+ architecture (Fiji or newer)
Metal: Requires macOS 10.11+ GPU (2012+ Macs)

Backends: All

104: Context Creation Failed

Description: Failed to create GPU context/environment.

Common Causes:

GPU already in exclusive use
Insufficient system resources
Driver error
Conflicting applications

Troubleshooting:

Close other GPU-using applications
Restart GPU driver: sudo systemctl restart nvidia-persistenced (Linux)
Check for competing CUDA/OpenCL applications
Reboot system if persistent

Backends: All

105: Queue Creation Failed

Description: Failed to create command queue or stream.

Common Causes:

Context not properly initialized
Out of GPU resources
Driver bug

Troubleshooting:

Ensure device context was created successfully
Check GPU memory availability
Update driver to latest version
Reduce concurrent stream/queue count

Backends: All

Resource Management Errors (200-299)

201: Out of Device Memory

Description: GPU ran out of memory during allocation.

Common Causes:

Requested allocation too large
Memory fragmentation
Other applications using GPU memory
Memory leaks in application

Troubleshooting:

Check GPU memory usage: nvidia-smi or rocm-smi
Reduce batch size or data dimensions
Close other GPU applications
Restart application to clear leaks
Consider using a GPU with more VRAM

Calculation:

Required VRAM ≈ (num_vectors × dimensions × 4 bytes) + overhead
Example: 1M vectors × 768 dims × 4 = 3GB + ~20% overhead = 3.6GB

Backends: All

202: Out of Host Memory

Description: System RAM exhausted during operation.

Common Causes:

Large dataset doesn't fit in RAM
Memory leak
Insufficient system memory

Troubleshooting:

Check system memory: free -h (Linux) or Task Manager (Windows)
Reduce dataset size
Use streaming/batching to process data in chunks
Close other applications
Add more RAM if persistent

Backends: All

204: Memory Copy Failed

Description: Failed to copy data between host and device.

Common Causes:

Invalid pointer
Size mismatch
Device disconnected
PCIe error

Troubleshooting:

Verify pointers are valid and allocated
Check copy size matches allocated size
Ensure GPU is still responsive: nvidia-smi
Check PCIe connection (reseat GPU if physical issue)
Update motherboard firmware if PCIe errors persist

Backends: All

Runtime/Execution Errors (300-399)

301: Kernel Launch Failed

Description: Failed to launch GPU kernel/shader.

Common Causes:

Invalid kernel arguments
Work group size too large
Kernel not compiled
Resource limits exceeded

Troubleshooting:

Validate kernel arguments match kernel signature

Check work group size is within device limits:

// CUDA: query max threads per block
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// Max threads: prop.maxThreadsPerBlock

Ensure kernel was successfully compiled
Reduce work group size or grid dimensions

Backends: CUDA, HIP, OpenCL, Metal, Vulkan

302: Kernel Execution Failed

Description: Kernel launched but failed during execution.

Common Causes:

Out-of-bounds memory access
Division by zero
Infinite loop in kernel
Stack overflow in kernel

Troubleshooting:

Enable compute-sanitizer (CUDA) or similar tools
Add bounds checking to kernel code
Review kernel logic for edge cases
Reduce problem size to isolate issue
Check for race conditions in kernel

Backends: CUDA, HIP, OpenCL, Metal, Vulkan

303: Synchronization Failed

Description: Failed to synchronize with GPU (wait for completion).

Common Causes:

Kernel timeout (Windows TDR)
Device hang
Driver crash

Troubleshooting:

Windows TDR: Increase timeout in registry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
TdrDelay = 60 (seconds)

Check for infinite loops in kernels
Reduce kernel complexity or work size
Update GPU driver
Check GPU temperature (overheating can cause hangs)

Backends: CUDA, HIP, OpenCL, Metal

Kernel Compilation Errors (500-599)

501: Kernel Compilation Failed

Description: GPU kernel/shader failed to compile.

Common Causes:

Syntax errors in kernel code
Unsupported language features
Compiler version mismatch
Missing includes or definitions

Troubleshooting:

Check compilation log for specific errors
Verify kernel syntax matches backend requirements
Ensure all required extensions/features are supported
Test kernel with simple example to isolate issue
Check compiler version compatibility

Example (OpenCL):

// Check build log on error
size_t logSize;
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 
                      0, nullptr, &logSize);
std::vector<char> log(logSize);
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
                      logSize, log.data(), nullptr);
std::cerr << "Build log:\n" << log.data() << std::endl;

Backends: CUDA, HIP, OpenCL, Metal, Vulkan

Configuration Errors (400-499)

401: Invalid Configuration

Description: Backend configuration is invalid or inconsistent.

Common Causes:

Conflicting configuration options
Invalid parameter values
Missing required settings

Troubleshooting:

Review configuration for invalid combinations
Check parameter value ranges
Consult backend documentation for valid configs
Use default configuration to verify basic operation

Backends: All

404: Backend Not Initialized

Description: Attempted operation on uninitialized backend.

Common Causes:

Forgot to call initialize()
Initialization failed but error not checked
Backend was shutdown

Troubleshooting:

Ensure backend->initialize() is called before use
Check return value of initialize() for success
Add initialization checks: if (!initialized_) return error;
Don't use backend after calling shutdown()

Backends: All

Best Practices

For Users

Check Return Values: Always check if operations succeeded

if (!backend->initialize()) {
    // Handle error
}

Log Error Context: Use error context for debugging

ErrorContext error = backend->getLastError();
std::cerr << error.format() << std::endl;

Follow Hints: Error hints provide actionable next steps

For Developers

Use Structured Errors: Always use error codes, not just strings

return ErrorContext(
    AccelerationErrorCode::NoDevicesFound,
    "CUDA",
    "No CUDA devices found",
    "Install NVIDIA GPU and driver"
);

Provide Context: Include relevant system information
Be Specific: Detailed messages help users resolve issues
Test Error Paths: Validate error handling works correctly

Common Error Patterns

Initialization Sequence

Typical initialization error sequence:

1. Check if backend is available (isAvailable())
   ↓ No → Error 102 (Driver Not Installed)
   ↓ Yes
2. Initialize backend (initialize())
   ↓ Fail → Error 101 (No Devices)
   ↓      → Error 104 (Context Failed)
   ↓      → Error 105 (Queue Failed)
   ↓ Success
3. Ready to use

Memory Allocation Pattern

1. Allocate device memory
   ↓ Fail → Error 201 (Out of Device Memory)
   ↓ Success
2. Copy data to device
   ↓ Fail → Error 204 (Memory Copy Failed)
   ↓ Success
3. Execute kernel

Quick Reference

Error	Code	Category	Common Fix
No devices	101	Init	Install GPU/driver
No driver	102	Init	Install driver
No memory	201	Resource	Reduce batch size
Kernel failed	301	Runtime	Check arguments
Won't compile	501	Kernel	Fix syntax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acceleration Backend Error Codes

Overview

Error Code Categories

Initialization Errors (100-199)

101: No Devices Found

102: Driver Not Installed

103: Device Not Supported

104: Context Creation Failed

105: Queue Creation Failed

Resource Management Errors (200-299)

201: Out of Device Memory

202: Out of Host Memory

204: Memory Copy Failed

Runtime/Execution Errors (300-399)

301: Kernel Launch Failed

302: Kernel Execution Failed

303: Synchronization Failed

Kernel Compilation Errors (500-599)

501: Kernel Compilation Failed

Configuration Errors (400-499)

401: Invalid Configuration

404: Backend Not Initialized

Best Practices

For Users

For Developers

Common Error Patterns

Initialization Sequence

Memory Allocation Pattern

Quick Reference

See Also

FilesExpand file tree

error_codes.md

Latest commit

History

error_codes.md

File metadata and controls

Acceleration Backend Error Codes

Overview

Error Code Categories

Initialization Errors (100-199)

101: No Devices Found

102: Driver Not Installed

103: Device Not Supported

104: Context Creation Failed

105: Queue Creation Failed

Resource Management Errors (200-299)

201: Out of Device Memory

202: Out of Host Memory

204: Memory Copy Failed

Runtime/Execution Errors (300-399)

301: Kernel Launch Failed

302: Kernel Execution Failed

303: Synchronization Failed

Kernel Compilation Errors (500-599)

501: Kernel Compilation Failed

Configuration Errors (400-499)

401: Invalid Configuration

404: Backend Not Initialized

Best Practices

For Users

For Developers

Common Error Patterns

Initialization Sequence

Memory Allocation Pattern

Quick Reference

See Also