Module: src/acceleration
Version: 2026 Q1
Status: Production
- Quick Diagnostics Checklist
- Backend Not Selected / Always Falls Back to CPU
- Initialization Failures
- GPU Memory Errors
- Kernel Execution Failures
- Performance Degradation
- Backend Health Issues
- Plugin Loading Failures
- Platform-Specific Issues
- Logging and Diagnostics
- Environment Variables Reference
- See Also
Run through this checklist whenever acceleration behaves unexpectedly:
[ ] 1. Check which backend is actually selected:
auto* vb = BackendRegistry::instance().getSelectedVectorBackend();
// nullptr → no backend matched requirements
[ ] 2. Check initialization error code:
ErrorContext err = backend->getLastError();
std::cerr << err.format() << std::endl;
[ ] 3. Verify GPU is visible to the OS:
Linux: lspci | grep -iE "vga|3d|nvidia|amd|intel"
Windows: Device Manager → Display Adapters
macOS: system_profiler SPDisplaysDataType
[ ] 4. Verify driver is installed and loaded:
NVIDIA: nvidia-smi
AMD: rocm-smi
OpenCL: clinfo
Vulkan: vulkaninfo --summary
[ ] 5. Confirm build flags match the installed SDK:
strings themisdb | grep -E "THEMIS_ENABLE_(CUDA|VULKAN|HIP)"
[ ] 6. Check available GPU memory:
nvidia-smi --query-gpu=memory.free,memory.total --format=csv
rocm-smi --showmeminfo vram
[ ] 7. Check backend health status:
BackendHealthStatus h = backend->getHealthStatus();
// h.status == "healthy" | "degraded" | "unhealthy"
getSelectedVectorBackend() returns nullptr, or all operations run on CPU even though a GPU is present.
Diagnosis:
strings ./themisdb | grep -E "THEMIS_ENABLE_(CUDA|VULKAN|HIP)"
# No output → the backend was compiled outFix: Rebuild with the appropriate flag:
cmake -DTHEMIS_ENABLE_CUDA=ON ..
cmake -DTHEMIS_ENABLE_VULKAN=ON ..Diagnosis:
# NVIDIA
nvidia-smi
# Expected: table showing driver version, GPU name, memory
# If missing: driver not installed
# AMD
rocm-smiFix: Install the driver for your platform. See Initialization Failures for per-backend commands.
Diagnosis:
// Check whether default requirements would select a backend:
auto* fallback = BackendRegistry::instance().selectVectorBackendFor(
BackendRegistry::defaultVectorRequirements());
// If non-null, your custom requirements are stricter than the backend supports.Fix: Relax requirements, or remove the requirement for async/FP16 if not needed:
BackendRegistry::CapabilityRequirements req;
req.needsVectorOps = true;
// Do NOT set req.needsAsync = true unless you require it.
BackendRegistry::instance().initializeRuntime(req, ...);Diagnosis: isRuntimeInitialized() returns false.
Fix: Call BackendRegistry::instance().initializeRuntime() once during single-threaded startup, before spawning worker threads.
Symptom: Backend returns error code 101 on initialize().
Steps:
- Verify GPU is physically installed and seated in the PCIe slot.
- Check OS visibility:
lspci | grep -i vga(Linux) or Device Manager (Windows). - Check BIOS/UEFI — ensure the GPU is not disabled.
- Check for hardware failures (try a different slot, reseat GPU).
Symptom: isAvailable() returns false; error code 102.
NVIDIA / CUDA:
# Check if driver is present
nvidia-smi
# Ubuntu — install driver
sudo apt install nvidia-driver-535
# Verify CUDA toolkit
nvcc --versionAMD / HIP (ROCm):
# Check ROCm
rocm-smi
# Ubuntu — install ROCm
sudo apt install rocm-dkms
# Check HIP
hipcc --versionOpenCL:
# List OpenCL platforms
clinfo
# Install ICD loader
sudo apt install ocl-icd-libopencl1 ocl-icd-opencl-devMetal (macOS): Requires macOS 10.11 (El Capitan) or later. No separate driver installation needed.
Vulkan:
# Check Vulkan ICD
vulkaninfo --summary
# AMD Mesa (Linux)
sudo apt install mesa-vulkan-drivers
# NVIDIA (Linux)
sudo apt install libvulkan1
# macOS — install MoltenVK
brew install molten-vkSymptom: Driver is present but initialize() fails with code 104.
Steps:
- Close other GPU-intensive applications (rendering, ML training, other database instances).
- Restart the GPU driver service:
# Linux — NVIDIA persistence daemon sudo systemctl restart nvidia-persistenced - Check for exclusive-mode conflicts: another process may hold the GPU context.
- Reboot if the driver is in an inconsistent state.
Symptom: Context created but command queue creation fails (code 105).
Steps:
- Ensure context was successfully created (check code 104 first).
- Check available GPU memory (
nvidia-smi); low memory can prevent stream allocation. - Reduce the number of concurrent streams/queues your application creates.
- Update the GPU driver to the latest stable version.
Symptom: Operations fail with code 201; GPU memory is exhausted.
Diagnosis:
# Check current GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,noheader
# AMD
rocm-smi --showmeminfo vramEstimate required VRAM:
Required VRAM ≈ (num_vectors × dimensions × 4 bytes) + ~20% overhead
Example: 1 000 000 vectors × 768 dimensions × 4 bytes = 3.0 GB + 600 MB = 3.6 GB
Fix options:
- Reduce batch size: process vectors in smaller chunks.
- Close other GPU applications.
- Use a GPU with more VRAM.
- Switch to CPU backend for large batches; see Backend Health Issues.
Symptom: Data transfer between host and device fails (code 204).
Steps:
- Verify all host-side pointers are valid and properly aligned.
- Confirm the copy size matches the allocated buffer size.
- Verify GPU is still responsive:
nvidia-smishould show the device. - Check for PCIe errors in the kernel log:
dmesg | grep -iE "pcie|aer|error"
- If PCIe errors appear, reseat the GPU or test with another slot.
Symptom: Kernel fails to launch (code 301).
Steps:
- Validate kernel arguments — check for null pointers, zero dimensions, or negative sizes.
- Ensure work group size is within device limits:
// CUDA: query max threads per block cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); // prop.maxThreadsPerBlock is the upper limit
- Confirm the kernel was compiled successfully (no error 501 earlier in the sequence).
- Reduce grid/block dimensions if the launch configuration is too large.
Symptom: Kernel launched but failed during GPU execution (code 302).
Steps:
- Enable CUDA compute sanitizer for detailed diagnostics:
compute-sanitizer --tool memcheck ./themisdb
- Add bounds checking to custom kernel code.
- Test with a minimal, reduced-size input to isolate the failure.
- Check for race conditions if multiple streams access shared memory.
These are transient — the kernel dispatcher retries them automatically before falling back to CPU.
| Code | Name | Common Cause |
|---|---|---|
| 303 | SynchronizationFailed | Windows TDR timeout, GPU hang |
| 304 | OperationTimeout | Kernel exceeded deadline |
| 305 | DeviceLost | GPU reset or physically disconnected |
If retries do not resolve the issue:
- Windows TDR timeout (303): Extend the timeout:
Registry: HKLM\System\CurrentControlSet\Control\GraphicsDrivers Value: TdrDelay = 60 (seconds) - GPU overheating (303/304): Check GPU temperature:
Improve case airflow or clean GPU heatsink.
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader # Values above 85°C indicate thermal throttling - Device lost (305): Check for PCIe slot issues or driver crash:
dmesg | grep -iE "GPU|NVRM|amdgpu|device lost"
- Increase retry count in
RetryPolicyfor unstable hardware:RetryPolicy policy; policy.maxAttempts = 5; policy.initialDelayMs = 10; policy.maxDelayMs = 500; ANNKernelFallbackDispatcher dispatcher(gpuTable, cpuTable, policy);
Diagnosis:
- Check health status:
backend->getHealthStatus()— adegradedbackend may be falling back. - Check if kernels are actually dispatching to GPU:
# Monitor GPU utilization while running queries watch -n 1 nvidia-smi # Should show >0% GPU utilization during vector operations
- Check whether the kernel dispatcher is silently falling back:
ErrorContext err = backend->getLastError(); // code != 0 → dispatcher fell back to CPU after repeated failures
Symptom: GPU utilization is high but throughput is lower than expected.
Steps:
- Profile PCIe transfer overhead:
nvidia-smi dmon -s u # utilization monitor - Batch multiple small queries into a single large kernel launch to amortize PCIe overhead.
- Use pinned (page-locked) host memory when allocating input buffers to improve transfer speed.
- Verify GPU is in PCIe Gen 3/4 × 16 mode, not x1/x4:
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
- Pin the driver version in your deployment manifest.
- Run the backend consistency tests to verify L2 distance values are unchanged:
ctest -R test_backend_consistency --output-on-failure
- Roll back the driver if parity tests fail.
Symptom: status.status == "degraded" — driver is reachable but last operation failed.
Steps:
- Read
status.issuesfor specific, actionable descriptions:auto h = backend->getHealthStatus(); for (const auto& issue : h.issues) { std::cerr << "Issue: " << issue << std::endl; }
- Check the structured error context:
ErrorContext err = backend->getLastError(); std::cerr << err.format() << std::endl; // Includes: backend name, error code, description, troubleshooting hint
- Check GPU driver logs:
journalctl -k | grep -i nvidia # Linux — NVIDIA journalctl -k | grep -i amdgpu # Linux — AMD
- Re-run backend detection to promote a healthy alternative:
BackendRegistry::instance().initializeRuntime(); - If the issue is transient (temperature spike, PCIe event), increase
RetryPolicy::maxAttempts.
Symptom: Backend completely unavailable; healthy == false, alive == false.
Cause: Driver is unreachable or device is lost.
Steps:
- Verify the GPU is still visible to the OS (see Quick Diagnostics Checklist).
- Restart the driver service or reboot.
- After recovery, call
initializeRuntime()again to re-detect backends.
Symptom: Expected backend from a dynamically loaded plugin is not available.
Steps:
- Verify the plugin
.so/.dllfile is in the plugin search path (checkTHEMIS_PLUGIN_PATH). - Check that the plugin was signed or is on the allow-list in
plugin_security.cpp:# Verify plugin SHA-256 hash matches the allow-list entry sha256sum /path/to/backend_plugin.so - Check the plugin loader log for rejection reason:
[WARN] Plugin rejected: signature mismatch — /path/to/backend_plugin.so - Ensure the plugin ABI version matches the ThemisDB version (no breaking changes before v2.0).
Steps:
- Load the plugin in isolation using a test harness before integrating.
- Check
THEMIS_PLUGIN_SANDBOX=strictis set to prevent unsafe operations. - Contact the plugin vendor with the crash log from
journalctlor the Windows event log.
Kernel module not loaded (NVIDIA):
lsmod | grep nvidia
# If missing:
sudo modprobe nvidiaPermission denied on GPU device node:
ls -la /dev/nvidia*
# If group is 'video', add your user:
sudo usermod -aG video $USER
# Then log out and back inROCm device permissions:
ls -la /dev/kfd /dev/dri/renderD*
sudo usermod -aG render,video $USERGPU not detected in a VM / container:
- Ensure GPU passthrough (VFIO / SR-IOV) is configured.
- CUDA requires WDDM 2.x drivers for GPU virtualization.
TDR timeout (error 303): See Error 303 — Transient Errors above.
DirectX Compute not available:
- Requires Windows 10 version 1607 or later.
- Verify DirectX feature level 11_0 is supported:
dxdiag.
Metal not available:
- Metal requires macOS 10.11 (El Capitan) or newer.
- GPU must be supported: run
system_profiler SPDisplaysDataType | grep Metal.
MoltenVK (Vulkan on macOS):
brew install molten-vk
# Set the ICD path
export VK_ICD_FILENAMES=/usr/local/share/vulkan/icd.d/MoltenVK_icd.jsonNVIDIA GPU not visible in container:
# Requires nvidia-container-toolkit on the host
docker run --gpus all themisdb nvidia-smiEnable GPU in Kubernetes:
resources:
limits:
nvidia.com/gpu: 1Check GPU sharing policy:
nvidia-smi --query-gpu=compute-mode --format=csv
# "Exclusive_Process" → only one process allowed at a time
# Change to "Default" for shared access:
sudo nvidia-smi -c 0Every error from a backend carries a structured ErrorContext object:
ErrorContext err = backend->getLastError();
// Fields:
// err.code — AccelerationErrorCode enum value
// err.backendName — "CUDA", "HIP", "OpenCL", etc.
// err.message — human-readable description
// err.troubleshootingHint — actionable next step
std::cerr << err.format() << std::endl;The acceleration metrics system exports Prometheus-compatible data for operational monitoring.
Key metrics to watch:
| Metric | Alert Threshold | Meaning |
|---|---|---|
acceleration_errors_total{code="201"} |
> 0 sustained | Out of device memory |
acceleration_errors_total{code="303"} |
> 5/min | GPU synchronization failures |
acceleration_fallback_total |
Increasing | Kernels falling back to CPU |
acceleration_latency_p99_seconds |
> 2× baseline | Performance regression |
acceleration_memory_used_bytes |
> 80% GPU VRAM | Memory pressure |
Scrape endpoint: http://localhost:<port>/metrics
See metrics.md for the full metrics reference.
Set the THEMIS_LOG_LEVEL environment variable before starting the server:
# Show all acceleration-related debug messages
export THEMIS_LOG_LEVEL=DEBUG
./themisdbLook for log lines tagged [acceleration] or [backend_registry].
| Variable | Default | Description |
|---|---|---|
THEMIS_LOG_LEVEL |
INFO |
Log verbosity: DEBUG, INFO, WARN, ERROR |
THEMIS_PLUGIN_PATH |
./plugins |
Directory to search for backend plugins |
THEMIS_PLUGIN_SANDBOX |
strict |
Plugin security mode: strict or permissive |
THEMIS_GPU_DEVICE_INDEX |
0 |
Index of the preferred GPU device (0-based) |
THEMIS_MAX_RETRY_ATTEMPTS |
3 |
Default RetryPolicy::maxAttempts for all dispatchers |
THEMIS_DISABLE_GPU |
unset | If set to 1, forces CPU-only mode regardless of hardware |
- error_codes.md — full error code reference with per-code resolution steps
- capability_negotiation.md — backend selection, fallback chain, and retry policy configuration
- production_readiness.md — production deployment checklist
- metrics.md — Prometheus metrics reference
src/acceleration/README.md— module overview, build flags, and directory layoutinclude/acceleration/compute_backend.h—BackendCapabilities,CapabilityRequirements,BackendRegistryAPIinclude/acceleration/kernel_fallback_dispatcher.h—RetryPolicy,ANNKernelFallbackDispatcher,GeoKernelFallbackDispatcher