Acceleration Module — Operational Troubleshooting Guide

Module: src/acceleration
Version: 2026 Q1
Status: Production

Quick Diagnostics Checklist
Backend Not Selected / Always Falls Back to CPU
Initialization Failures
GPU Memory Errors
Kernel Execution Failures
Performance Degradation
Backend Health Issues
Plugin Loading Failures
Platform-Specific Issues
Logging and Diagnostics
Environment Variables Reference
See Also

Quick Diagnostics Checklist

Run through this checklist whenever acceleration behaves unexpectedly:

[ ] 1. Check which backend is actually selected:
        auto* vb = BackendRegistry::instance().getSelectedVectorBackend();
        // nullptr → no backend matched requirements

[ ] 2. Check initialization error code:
        ErrorContext err = backend->getLastError();
        std::cerr << err.format() << std::endl;

[ ] 3. Verify GPU is visible to the OS:
        Linux:   lspci | grep -iE "vga|3d|nvidia|amd|intel"
        Windows: Device Manager → Display Adapters
        macOS:   system_profiler SPDisplaysDataType

[ ] 4. Verify driver is installed and loaded:
        NVIDIA:  nvidia-smi
        AMD:     rocm-smi
        OpenCL:  clinfo
        Vulkan:  vulkaninfo --summary

[ ] 5. Confirm build flags match the installed SDK:
        strings themisdb | grep -E "THEMIS_ENABLE_(CUDA|VULKAN|HIP)"

[ ] 6. Check available GPU memory:
        nvidia-smi --query-gpu=memory.free,memory.total --format=csv
        rocm-smi --showmeminfo vram

[ ] 7. Check backend health status:
        BackendHealthStatus h = backend->getHealthStatus();
        // h.status == "healthy" | "degraded" | "unhealthy"

Backend Not Selected / Always Falls Back to CPU

Symptom

getSelectedVectorBackend() returns nullptr, or all operations run on CPU even though a GPU is present.

Cause A — Build flag not set

Diagnosis:

strings ./themisdb | grep -E "THEMIS_ENABLE_(CUDA|VULKAN|HIP)"
# No output → the backend was compiled out

Fix: Rebuild with the appropriate flag:

cmake -DTHEMIS_ENABLE_CUDA=ON ..
cmake -DTHEMIS_ENABLE_VULKAN=ON ..

Cause B — GPU driver not installed or not loaded

Diagnosis:

# NVIDIA
nvidia-smi
# Expected: table showing driver version, GPU name, memory
# If missing: driver not installed

# AMD
rocm-smi

Fix: Install the driver for your platform. See Initialization Failures for per-backend commands.

Cause C — Capability requirements too strict

Diagnosis:

// Check whether default requirements would select a backend:
auto* fallback = BackendRegistry::instance().selectVectorBackendFor(
    BackendRegistry::defaultVectorRequirements());
// If non-null, your custom requirements are stricter than the backend supports.

Fix: Relax requirements, or remove the requirement for async/FP16 if not needed:

BackendRegistry::CapabilityRequirements req;
req.needsVectorOps = true;
// Do NOT set req.needsAsync = true unless you require it.
BackendRegistry::instance().initializeRuntime(req, ...);

Cause D — `initializeRuntime()` not called

Diagnosis: isRuntimeInitialized() returns false.

Fix: Call BackendRegistry::instance().initializeRuntime() once during single-threaded startup, before spawning worker threads.

Initialization Failures

Error 101 — No Devices Found

Symptom: Backend returns error code 101 on initialize().

Steps:

Verify GPU is physically installed and seated in the PCIe slot.
Check OS visibility: lspci | grep -i vga (Linux) or Device Manager (Windows).
Check BIOS/UEFI — ensure the GPU is not disabled.
Check for hardware failures (try a different slot, reseat GPU).

Error 102 — Driver Not Installed

Symptom: isAvailable() returns false; error code 102.

NVIDIA / CUDA:

# Check if driver is present
nvidia-smi
# Ubuntu — install driver
sudo apt install nvidia-driver-535
# Verify CUDA toolkit
nvcc --version

AMD / HIP (ROCm):

# Check ROCm
rocm-smi
# Ubuntu — install ROCm
sudo apt install rocm-dkms
# Check HIP
hipcc --version

OpenCL:

# List OpenCL platforms
clinfo
# Install ICD loader
sudo apt install ocl-icd-libopencl1 ocl-icd-opencl-dev

Metal (macOS): Requires macOS 10.11 (El Capitan) or later. No separate driver installation needed.

Vulkan:

# Check Vulkan ICD
vulkaninfo --summary
# AMD Mesa (Linux)
sudo apt install mesa-vulkan-drivers
# NVIDIA (Linux)
sudo apt install libvulkan1
# macOS — install MoltenVK
brew install molten-vk

Error 104 — Context Creation Failed

Symptom: Driver is present but initialize() fails with code 104.

Steps:

Close other GPU-intensive applications (rendering, ML training, other database instances).

Restart the GPU driver service:

# Linux — NVIDIA persistence daemon
sudo systemctl restart nvidia-persistenced

Check for exclusive-mode conflicts: another process may hold the GPU context.
Reboot if the driver is in an inconsistent state.

Error 105 — Queue / Stream Creation Failed

Symptom: Context created but command queue creation fails (code 105).

Steps:

Ensure context was successfully created (check code 104 first).
Check available GPU memory (nvidia-smi); low memory can prevent stream allocation.
Reduce the number of concurrent streams/queues your application creates.
Update the GPU driver to the latest stable version.

GPU Memory Errors

Error 201 — Out of Device Memory

Symptom: Operations fail with code 201; GPU memory is exhausted.

Diagnosis:

# Check current GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,noheader
# AMD
rocm-smi --showmeminfo vram

Estimate required VRAM:

Required VRAM ≈ (num_vectors × dimensions × 4 bytes) + ~20% overhead
Example:  1 000 000 vectors × 768 dimensions × 4 bytes = 3.0 GB + 600 MB = 3.6 GB

Fix options:

Reduce batch size: process vectors in smaller chunks.
Close other GPU applications.
Use a GPU with more VRAM.
Switch to CPU backend for large batches; see Backend Health Issues.

Error 204 — Memory Copy Failed

Symptom: Data transfer between host and device fails (code 204).

Steps:

Verify all host-side pointers are valid and properly aligned.
Confirm the copy size matches the allocated buffer size.
Verify GPU is still responsive: nvidia-smi should show the device.
Check for PCIe errors in the kernel log:
```
dmesg | grep -iE "pcie|aer|error"
```
If PCIe errors appear, reseat the GPU or test with another slot.

Kernel Execution Failures

Error 301 — Kernel Launch Failed

Symptom: Kernel fails to launch (code 301).

Steps:

Validate kernel arguments — check for null pointers, zero dimensions, or negative sizes.

Ensure work group size is within device limits:

// CUDA: query max threads per block
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// prop.maxThreadsPerBlock is the upper limit

Confirm the kernel was compiled successfully (no error 501 earlier in the sequence).
Reduce grid/block dimensions if the launch configuration is too large.

Error 302 — Kernel Execution Failed

Symptom: Kernel launched but failed during GPU execution (code 302).

Steps:

Enable CUDA compute sanitizer for detailed diagnostics:
```
compute-sanitizer --tool memcheck ./themisdb
```
Add bounds checking to custom kernel code.
Test with a minimal, reduced-size input to isolate the failure.
Check for race conditions if multiple streams access shared memory.

Error 303 / 304 / 305 — Transient Errors (Sync / Timeout / Device Lost)

These are transient — the kernel dispatcher retries them automatically before falling back to CPU.

Code	Name	Common Cause
303	SynchronizationFailed	Windows TDR timeout, GPU hang
304	OperationTimeout	Kernel exceeded deadline
305	DeviceLost	GPU reset or physically disconnected

If retries do not resolve the issue:

Windows TDR timeout (303): Extend the timeout:

Registry: HKLM\System\CurrentControlSet\Control\GraphicsDrivers
Value: TdrDelay = 60 (seconds)

GPU overheating (303/304): Check GPU temperature:

nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
# Values above 85°C indicate thermal throttling

Improve case airflow or clean GPU heatsink.

Device lost (305): Check for PCIe slot issues or driver crash:
```
dmesg | grep -iE "GPU|NVRM|amdgpu|device lost"
```

Increase retry count in RetryPolicy for unstable hardware:

RetryPolicy policy;
policy.maxAttempts    = 5;
policy.initialDelayMs = 10;
policy.maxDelayMs     = 500;
ANNKernelFallbackDispatcher dispatcher(gpuTable, cpuTable, policy);

Performance Degradation

GPU Backend Selected but Performance Is CPU-Like

Diagnosis:

Check health status: backend->getHealthStatus() — a degraded backend may be falling back.

Check if kernels are actually dispatching to GPU:

# Monitor GPU utilization while running queries
watch -n 1 nvidia-smi
# Should show >0% GPU utilization during vector operations

Check whether the kernel dispatcher is silently falling back:

ErrorContext err = backend->getLastError();
// code != 0 → dispatcher fell back to CPU after repeated failures

High Memory Bandwidth / PCIe Bottleneck

Symptom: GPU utilization is high but throughput is lower than expected.

Steps:

Profile PCIe transfer overhead:

nvidia-smi dmon -s u   # utilization monitor

Batch multiple small queries into a single large kernel launch to amortize PCIe overhead.
Use pinned (page-locked) host memory when allocating input buffers to improve transfer speed.

Verify GPU is in PCIe Gen 3/4 × 16 mode, not x1/x4:

nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv

Unexpected Throughput Regression After Driver Update

Pin the driver version in your deployment manifest.
Run the backend consistency tests to verify L2 distance values are unchanged:
```
ctest -R test_backend_consistency --output-on-failure
```
Roll back the driver if parity tests fail.

Backend Health Issues

`getHealthStatus()` Returns `degraded`

Symptom: status.status == "degraded" — driver is reachable but last operation failed.

Steps:

Read status.issues for specific, actionable descriptions:

auto h = backend->getHealthStatus();
for (const auto& issue : h.issues) {
    std::cerr << "Issue: " << issue << std::endl;
}

Check the structured error context:

ErrorContext err = backend->getLastError();
std::cerr << err.format() << std::endl;
// Includes: backend name, error code, description, troubleshooting hint

Check GPU driver logs:

journalctl -k | grep -i nvidia   # Linux — NVIDIA
journalctl -k | grep -i amdgpu   # Linux — AMD

Re-run backend detection to promote a healthy alternative:
```
BackendRegistry::instance().initializeRuntime();
```
If the issue is transient (temperature spike, PCIe event), increase RetryPolicy::maxAttempts.

`getHealthStatus()` Returns `unhealthy`

Symptom: Backend completely unavailable; healthy == false, alive == false.

Cause: Driver is unreachable or device is lost.

Steps:

Verify the GPU is still visible to the OS (see Quick Diagnostics Checklist).
Restart the driver service or reboot.
After recovery, call initializeRuntime() again to re-detect backends.

Plugin Loading Failures

Plugin Not Loaded

Symptom: Expected backend from a dynamically loaded plugin is not available.

Steps:

Verify the plugin .so / .dll file is in the plugin search path (check THEMIS_PLUGIN_PATH).

Check that the plugin was signed or is on the allow-list in plugin_security.cpp:

# Verify plugin SHA-256 hash matches the allow-list entry
sha256sum /path/to/backend_plugin.so

Check the plugin loader log for rejection reason:

[WARN] Plugin rejected: signature mismatch — /path/to/backend_plugin.so

Ensure the plugin ABI version matches the ThemisDB version (no breaking changes before v2.0).

Plugin Causes Crash on Load

Steps:

Load the plugin in isolation using a test harness before integrating.
Check THEMIS_PLUGIN_SANDBOX=strict is set to prevent unsafe operations.
Contact the plugin vendor with the crash log from journalctl or the Windows event log.

Platform-Specific Issues

Linux

Kernel module not loaded (NVIDIA):

lsmod | grep nvidia
# If missing:
sudo modprobe nvidia

Permission denied on GPU device node:

ls -la /dev/nvidia*
# If group is 'video', add your user:
sudo usermod -aG video $USER
# Then log out and back in

ROCm device permissions:

ls -la /dev/kfd /dev/dri/renderD*
sudo usermod -aG render,video $USER

Windows

GPU not detected in a VM / container:

Ensure GPU passthrough (VFIO / SR-IOV) is configured.
CUDA requires WDDM 2.x drivers for GPU virtualization.

TDR timeout (error 303): See Error 303 — Transient Errors above.

DirectX Compute not available:

Requires Windows 10 version 1607 or later.
Verify DirectX feature level 11_0 is supported: dxdiag.

macOS

Metal not available:

Metal requires macOS 10.11 (El Capitan) or newer.
GPU must be supported: run system_profiler SPDisplaysDataType | grep Metal.

MoltenVK (Vulkan on macOS):

brew install molten-vk
# Set the ICD path
export VK_ICD_FILENAMES=/usr/local/share/vulkan/icd.d/MoltenVK_icd.json

Docker / Kubernetes

NVIDIA GPU not visible in container:

# Requires nvidia-container-toolkit on the host
docker run --gpus all themisdb nvidia-smi

Enable GPU in Kubernetes:

resources:
  limits:
    nvidia.com/gpu: 1

Check GPU sharing policy:

nvidia-smi --query-gpu=compute-mode --format=csv
# "Exclusive_Process" → only one process allowed at a time
# Change to "Default" for shared access:
sudo nvidia-smi -c 0

Logging and Diagnostics

Structured Error Context

Every error from a backend carries a structured ErrorContext object:

ErrorContext err = backend->getLastError();
// Fields:
//   err.code               — AccelerationErrorCode enum value
//   err.backendName        — "CUDA", "HIP", "OpenCL", etc.
//   err.message            — human-readable description
//   err.troubleshootingHint — actionable next step
std::cerr << err.format() << std::endl;

Prometheus Metrics

The acceleration metrics system exports Prometheus-compatible data for operational monitoring.

Key metrics to watch:

Metric	Alert Threshold	Meaning
`acceleration_errors_total{code="201"}`	> 0 sustained	Out of device memory
`acceleration_errors_total{code="303"}`	> 5/min	GPU synchronization failures
`acceleration_fallback_total`	Increasing	Kernels falling back to CPU
`acceleration_latency_p99_seconds`	> 2× baseline	Performance regression
`acceleration_memory_used_bytes`	> 80% GPU VRAM	Memory pressure

Scrape endpoint: http://localhost:<port>/metrics

See metrics.md for the full metrics reference.

Enabling Verbose Logging

Set the THEMIS_LOG_LEVEL environment variable before starting the server:

# Show all acceleration-related debug messages
export THEMIS_LOG_LEVEL=DEBUG
./themisdb

Look for log lines tagged [acceleration] or [backend_registry].

Environment Variables Reference

Variable	Default	Description
`THEMIS_LOG_LEVEL`	`INFO`	Log verbosity: `DEBUG`, `INFO`, `WARN`, `ERROR`
`THEMIS_PLUGIN_PATH`	`./plugins`	Directory to search for backend plugins
`THEMIS_PLUGIN_SANDBOX`	`strict`	Plugin security mode: `strict` or `permissive`
`THEMIS_GPU_DEVICE_INDEX`	`0`	Index of the preferred GPU device (0-based)
`THEMIS_MAX_RETRY_ATTEMPTS`	`3`	Default `RetryPolicy::maxAttempts` for all dispatchers
`THEMIS_DISABLE_GPU`	unset	If set to `1`, forces CPU-only mode regardless of hardware

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Acceleration Module — Operational Troubleshooting Guide

Table of Contents

Quick Diagnostics Checklist

Backend Not Selected / Always Falls Back to CPU

Symptom

Cause A — Build flag not set

Cause B — GPU driver not installed or not loaded

Cause C — Capability requirements too strict

Cause D — initializeRuntime() not called

Initialization Failures

Error 101 — No Devices Found

Error 102 — Driver Not Installed

Error 104 — Context Creation Failed

Error 105 — Queue / Stream Creation Failed

GPU Memory Errors

Error 201 — Out of Device Memory

Error 204 — Memory Copy Failed

Kernel Execution Failures

Error 301 — Kernel Launch Failed

Error 302 — Kernel Execution Failed

Error 303 / 304 / 305 — Transient Errors (Sync / Timeout / Device Lost)

Performance Degradation

GPU Backend Selected but Performance Is CPU-Like

High Memory Bandwidth / PCIe Bottleneck

Unexpected Throughput Regression After Driver Update

Backend Health Issues

getHealthStatus() Returns degraded

getHealthStatus() Returns unhealthy

Plugin Loading Failures

Plugin Not Loaded

Plugin Causes Crash on Load

Platform-Specific Issues

Linux

Windows

macOS

Docker / Kubernetes

Logging and Diagnostics

Structured Error Context

Prometheus Metrics

Enabling Verbose Logging

Environment Variables Reference

See Also

Cause D — `initializeRuntime()` not called

`getHealthStatus()` Returns `degraded`

`getHealthStatus()` Returns `unhealthy`