Comprehensive reference for error codes used across all GPU acceleration backends.
ThemisDB uses structured error codes to provide clear, actionable diagnostic information when GPU operations fail. Each error code is:
- Categorized by type (initialization, resource, runtime, etc.)
- Unique across all backends
- Documented with troubleshooting steps
- Actionable with specific resolution hints
| Range | Category | Description |
|---|---|---|
| 0 | Success | Operation completed successfully |
| 100-199 | Initialization | Backend/device initialization failures |
| 200-299 | Resource Management | Memory allocation and resource errors |
| 300-399 | Runtime/Execution | Kernel execution and synchronization errors |
| 400-499 | Configuration | Configuration and parameter errors |
| 500-599 | Kernel/Shader | Compilation and linking errors |
| 900-999 | Generic | Unknown or internal errors |
Description: No compatible GPU devices were found on the system.
Common Causes:
- No GPU physically installed
- GPU not detected by system (BIOS/UEFI issue)
- Driver not installed or loaded
- GPU disabled in device manager
Troubleshooting:
- Verify GPU is physically installed:
lspci | grep -i vga(Linux) or Device Manager (Windows) - Check driver installation:
nvidia-smi(NVIDIA),rocm-smi(AMD) - Ensure GPU is enabled in BIOS/UEFI
- Check for hardware failures
Backends: All (CUDA, HIP, OpenCL, Metal, Vulkan)
Description: Required GPU driver not installed or not accessible.
Common Causes:
- Driver not installed
- Driver version too old
- Driver service not running
- Permission issues
Troubleshooting:
-
CUDA/NVIDIA:
# Check driver nvidia-smi # Install driver (Ubuntu) sudo apt install nvidia-driver-535
-
HIP/AMD:
# Check driver rocm-smi # Install ROCm (Ubuntu) sudo apt install rocm-dkms
-
OpenCL:
# Check OpenCL platforms clinfo # Install OpenCL ICD loader sudo apt install ocl-icd-libopencl1
-
Metal: Update to macOS 10.11+ or iOS 8+
Backends: All
Description: Device exists but is not supported (too old, wrong architecture).
Common Causes:
- GPU architecture too old (compute capability < 3.0 for CUDA)
- Missing required features
- Unsupported vendor (e.g., NVIDIA GPU with AMD driver)
Troubleshooting:
- Check GPU specifications against backend requirements
- CUDA: Requires compute capability 3.0+ (Kepler or newer)
- HIP: Requires GCN 3.0+ architecture (Fiji or newer)
- Metal: Requires macOS 10.11+ GPU (2012+ Macs)
Backends: All
Description: Failed to create GPU context/environment.
Common Causes:
- GPU already in exclusive use
- Insufficient system resources
- Driver error
- Conflicting applications
Troubleshooting:
- Close other GPU-using applications
- Restart GPU driver:
sudo systemctl restart nvidia-persistenced(Linux) - Check for competing CUDA/OpenCL applications
- Reboot system if persistent
Backends: All
Description: Failed to create command queue or stream.
Common Causes:
- Context not properly initialized
- Out of GPU resources
- Driver bug
Troubleshooting:
- Ensure device context was created successfully
- Check GPU memory availability
- Update driver to latest version
- Reduce concurrent stream/queue count
Backends: All
Description: GPU ran out of memory during allocation.
Common Causes:
- Requested allocation too large
- Memory fragmentation
- Other applications using GPU memory
- Memory leaks in application
Troubleshooting:
- Check GPU memory usage:
nvidia-smiorrocm-smi - Reduce batch size or data dimensions
- Close other GPU applications
- Restart application to clear leaks
- Consider using a GPU with more VRAM
Calculation:
Required VRAM ≈ (num_vectors × dimensions × 4 bytes) + overhead
Example: 1M vectors × 768 dims × 4 = 3GB + ~20% overhead = 3.6GB
Backends: All
Description: System RAM exhausted during operation.
Common Causes:
- Large dataset doesn't fit in RAM
- Memory leak
- Insufficient system memory
Troubleshooting:
- Check system memory:
free -h(Linux) or Task Manager (Windows) - Reduce dataset size
- Use streaming/batching to process data in chunks
- Close other applications
- Add more RAM if persistent
Backends: All
Description: Failed to copy data between host and device.
Common Causes:
- Invalid pointer
- Size mismatch
- Device disconnected
- PCIe error
Troubleshooting:
- Verify pointers are valid and allocated
- Check copy size matches allocated size
- Ensure GPU is still responsive:
nvidia-smi - Check PCIe connection (reseat GPU if physical issue)
- Update motherboard firmware if PCIe errors persist
Backends: All
Description: Failed to launch GPU kernel/shader.
Common Causes:
- Invalid kernel arguments
- Work group size too large
- Kernel not compiled
- Resource limits exceeded
Troubleshooting:
- Validate kernel arguments match kernel signature
- Check work group size is within device limits:
// CUDA: query max threads per block cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); // Max threads: prop.maxThreadsPerBlock
- Ensure kernel was successfully compiled
- Reduce work group size or grid dimensions
Backends: CUDA, HIP, OpenCL, Metal, Vulkan
Description: Kernel launched but failed during execution.
Common Causes:
- Out-of-bounds memory access
- Division by zero
- Infinite loop in kernel
- Stack overflow in kernel
Troubleshooting:
- Enable compute-sanitizer (CUDA) or similar tools
- Add bounds checking to kernel code
- Review kernel logic for edge cases
- Reduce problem size to isolate issue
- Check for race conditions in kernel
Backends: CUDA, HIP, OpenCL, Metal, Vulkan
Description: Failed to synchronize with GPU (wait for completion).
Common Causes:
- Kernel timeout (Windows TDR)
- Device hang
- Driver crash
Troubleshooting:
- Windows TDR: Increase timeout in registry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers TdrDelay = 60 (seconds) - Check for infinite loops in kernels
- Reduce kernel complexity or work size
- Update GPU driver
- Check GPU temperature (overheating can cause hangs)
Backends: CUDA, HIP, OpenCL, Metal
Description: GPU kernel/shader failed to compile.
Common Causes:
- Syntax errors in kernel code
- Unsupported language features
- Compiler version mismatch
- Missing includes or definitions
Troubleshooting:
- Check compilation log for specific errors
- Verify kernel syntax matches backend requirements
- Ensure all required extensions/features are supported
- Test kernel with simple example to isolate issue
- Check compiler version compatibility
Example (OpenCL):
// Check build log on error
size_t logSize;
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
0, nullptr, &logSize);
std::vector<char> log(logSize);
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
logSize, log.data(), nullptr);
std::cerr << "Build log:\n" << log.data() << std::endl;Backends: CUDA, HIP, OpenCL, Metal, Vulkan
Description: Backend configuration is invalid or inconsistent.
Common Causes:
- Conflicting configuration options
- Invalid parameter values
- Missing required settings
Troubleshooting:
- Review configuration for invalid combinations
- Check parameter value ranges
- Consult backend documentation for valid configs
- Use default configuration to verify basic operation
Backends: All
Description: Attempted operation on uninitialized backend.
Common Causes:
- Forgot to call
initialize() - Initialization failed but error not checked
- Backend was shutdown
Troubleshooting:
- Ensure
backend->initialize()is called before use - Check return value of
initialize()for success - Add initialization checks:
if (!initialized_) return error; - Don't use backend after calling
shutdown()
Backends: All
-
Check Return Values: Always check if operations succeeded
if (!backend->initialize()) { // Handle error }
-
Log Error Context: Use error context for debugging
ErrorContext error = backend->getLastError(); std::cerr << error.format() << std::endl; -
Follow Hints: Error hints provide actionable next steps
-
Use Structured Errors: Always use error codes, not just strings
return ErrorContext( AccelerationErrorCode::NoDevicesFound, "CUDA", "No CUDA devices found", "Install NVIDIA GPU and driver" );
-
Provide Context: Include relevant system information
-
Be Specific: Detailed messages help users resolve issues
-
Test Error Paths: Validate error handling works correctly
Typical initialization error sequence:
1. Check if backend is available (isAvailable())
↓ No → Error 102 (Driver Not Installed)
↓ Yes
2. Initialize backend (initialize())
↓ Fail → Error 101 (No Devices)
↓ → Error 104 (Context Failed)
↓ → Error 105 (Queue Failed)
↓ Success
3. Ready to use
1. Allocate device memory
↓ Fail → Error 201 (Out of Device Memory)
↓ Success
2. Copy data to device
↓ Fail → Error 204 (Memory Copy Failed)
↓ Success
3. Execute kernel
| Error | Code | Category | Common Fix |
|---|---|---|---|
| No devices | 101 | Init | Install GPU/driver |
| No driver | 102 | Init | Install driver |
| No memory | 201 | Resource | Reduce batch size |
| Kernel failed | 301 | Runtime | Check arguments |
| Won't compile | 501 | Kernel | Fix syntax |