The Acceleration module provides hardware-accelerated compute backends for ThemisDB. Its goal is to speed up compute-heavy primitives used by higher-level subsystems (e.g., vector similarity search / ANN, graph analytics, and geospatial operators) while preserving correctness, determinism, and a CPU fallback when no suitable accelerator is available.
In practice, this module is responsible for:
- Selecting an appropriate backend at runtime (GPU/CPU) without breaking portability.
- Providing a stable interface (
ComputeBackendand related backend interfaces) that consumers can call without depending on CUDA/Vulkan specifics. - Hosting accelerator implementations and/or plugins (e.g., CUDA, Vulkan, HIP) behind feature flags so builds work even if SDKs are not installed.
Note: filenames below are referenced by
FUTURE_ENHANCEMENTS.mdand may evolve; treat this as a “map” of the current structure.
-
Backend selection & registry
backend_registry.cpp: runtime backend registration/selection and CPU fallback.compute_backend.cpp: abstractComputeBackendbase class and shared utilities.device_manager.cpp: device enumeration, capability probing (VRAM, compute capability, driver version), 60 s TTL cache, andBackendRegistry::deviceInfo()observability accessor.
-
CUDA backend (optional, guarded by
THEMIS_ENABLE_CUDA)cuda_backend.cpp+cuda/ann_kernels.cu,cuda/geo_kernels.cu,cuda/tensor_core_matmul.cu,cuda/vector_kernels.cu: CUDA kernels and stream/graph management for vector similarity and geospatial operations.nccl_vector_backend.cpp: multi-GPU NCCL collectives for sharding and query scatter/gather.tensor_core_matmul.cpp: Tensor Core FP16/BF16 matrix multiplication.
-
HIP/ROCm backend (optional, guarded by
THEMIS_ENABLE_HIP)hip_backend.cpp+hip/ann_kernels.hip,hip/geo_kernels.hip: AMD HIP ANN and geospatial kernels.rccl_vector_backend.cpp: multi-GPU RCCL collectives (AMD mirror of NCCL backend).
-
Vulkan backend (optional, guarded by
THEMIS_ENABLE_VULKAN)vulkan_backend_full.cpp: Vulkan compute infrastructure.vulkan/shaders/: SPIR-V compute shaders for L2, cosine, inner-product, top-K, Haversine, and point-in-polygon operations.
-
Other GPU / platform backends
directx_backend_full.cpp+directx/shaders/: DirectX Compute backend (Windows).metal_backend.mm: Apple Metal backend (macOS/iOS).opencl_backend.cpp: OpenCL backend for broad hardware compatibility.graphics_backends.cpp: shared graphics/GPU utility helpers.zluda_backend.cpp: ZLUDA (AMD on CUDA API) backend.oneapi_backend.cpp: Intel oneAPI backend.faiss_gpu_backend.cpp: FAISS GPU wrapper for billion-scale ANN search.
-
Multi-GPU
multi_gpu_backend.cpp: range-based sharding, fan-out KNN, and host-side top-k merge across N devices.
-
Geospatial bridge
geo_acceleration_bridge.cpp: bridges geospatial operators (Haversine distance, point-in-polygon) to the acceleration layer viaGeoKernelDispatch.
-
CPU fallback
cpu_backend.cpp,cpu_backend_mt.cpp: reference single-thread and pthreads implementations used when accelerators are unavailable and as correctness baselines.cpu_backend_tbb.cpp: Intel TBB-based parallel CPU backend.
-
Kernel dispatch & fallback/retry (
include/acceleration/kernel_fallback_dispatcher.h)ANNKernelFallbackDispatcher: wraps a primaryANNKernelDispatchtable (GPU) and a fallback table (CPU). Null slots in the primary are routed directly to the fallback (unsupported kernel). Transient device errors (DeviceLost,OperationTimeout,SynchronizationFailed) are retried with exponential back-off; all other errors and exhausted retries also fall back.GeoKernelFallbackDispatcher: same semantics for the two geospatial kernel slots.RetryPolicy: configuresmaxAttempts, initial/max delay (ms), and back-off multiplier.
-
Plugins / security
plugin_loader.cpp: loads optional backend plugins at runtime.plugin_security.cpp: enforces the sandbox/allow-list for dynamically loaded GPU backends; verifies GPG/code signatures beforedlopen.shader_integrity.cpp: verifies SPIR-V shader integrity before pipeline creation.
-
vLLM / LLM resource management
vllm_resource_manager.cpp: GPU VRAM resource lease management for LLM inference paths.
- On startup, call
BackendRegistry::instance().initializeRuntime()once to trigger capability-driven backend selection across all three operation categories (vector, graph, geo). The method callsautoDetect()to discover available backends (including GPU plugins), then selects the highest-priority backend that satisfies the capability requirements for each category. Selections are cached and retrieved afterwards viagetSelectedVectorBackend(),getSelectedGraphBackend(), andgetSelectedGeoBackend(). If no backend matches the requirements, the accessor returnsnullptrinstead of crashing. - If no compatible accelerator is present, or if an accelerator backend fails to initialize, the module gracefully degrades to CPU backends — no hard failure.
isRuntimeInitialized()returnstrueafter the first call toinitializeRuntime()andfalseagain aftershutdownAll().- Default capability requirements (used when
initializeRuntime()is called with no arguments) can be retrieved fromBackendRegistry::defaultVectorRequirements(),defaultGraphRequirements(), anddefaultGeoRequirements(). Custom requirements can be passed per category when stricter constraints are needed (e.g. FP16-only). - Calls should be safe under concurrency: multiple threads may request acceleration
services simultaneously once
initializeRuntime()has completed. Concurrent calls toinitializeRuntime()itself are not recommended; call it once during single-threaded server startup before spawning worker threads.
Acceleration backends are optional and must not be required to build ThemisDB.
THEMIS_ENABLE_CUDA: enables CUDA sources, kernel compilation, and CUDA backend registration.THEMIS_ENABLE_VULKAN: enables Vulkan sources and shader compilation/integration.THEMIS_ENABLE_HIP: enables HIP/ROCm sources and AMD GPU backend registration.
When these flags are OFF (or SDKs are missing), the build must still succeed and the runtime must still function via CPU backends.
- For a deep-dive into capability negotiation, the fallback chain, kernel-level
fallback/retry, health monitoring, and operational troubleshooting, see:
docs/acceleration/capability_negotiation.mddocs/acceleration/troubleshooting.md— operational troubleshooting guide (runbooks, diagnostics, platform-specific issues)
- For planned work items, constraints, required interfaces, and measurable performance targets, see:
src/acceleration/FUTURE_ENHANCEMENTS.md
- When implementing new accelerator paths:
- Ensure CPU/GPU parity tests exist (or are added).
- Prefer deterministic numerics and document tolerances where floating-point differences are expected.
- Keep plugin ABI stability in mind (no breaking changes before v2.0).
The following peer-reviewed publications, standards, and reference implementations form the scientific basis for the design decisions and algorithms used in this module. Citations follow IEEE format with DOI, URL, and access date.
[1] Y. Chen, T. Li, Y. Zhou, and Z. Wang, "Accelerating Database Operations on GPUs: A Survey," IEEE Trans. Knowl. Data Eng., vol. 29, no. 1, pp. 147–165, Jan. 2017, doi: 10.1109/TKDE.2016.2603064. [Online]. Available: https://ieeexplore.ieee.org/document/7586066. Accessed: Mar. 2, 2026.
Relevance: Provides a systematic taxonomy of GPU-accelerated relational and vector database operators. The survey's classification of memory-bandwidth-bound versus compute-bound kernels directly informs the design of
cuda_backend.cppand the selection of cuBLAS GEMM for L2/cosine distance over custom reduction kernels. The authors' analysis of data transfer bottlenecks motivates the double-buffered staging strategy planned forvulkan_backend_full.cpp.
[2] A. He, S. Pandey, and A. Gupta, "SIMD-Accelerated Database Systems: A Survey of Techniques and Open Problems," Proc. VLDB Endow., vol. 12, no. 3, pp. 309–322, Nov. 2018, doi: 10.14778/3352063.3352067. [Online]. Available: https://www.vldb.org/pvldb/vol12/p309-he.pdf. Accessed: Mar. 2, 2026.
Relevance: Surveys SIMD vectorisation techniques for selection, aggregation, join, and sorting — operations directly implemented in
cpu_backend.cppandcpu_backend_mt.cpp. The survey's "open problems" section on operator fusion is the basis for the planned fused L2-norm + dot-product kernel in the CUDA backend, and motivates AVX-512 loop unrolling in the CPU reference path that serves as the benchmark baseline (≥ 10× GPU speedup target).
[3] J. Zhou and K. A. Ross, "Implementing database operations using SIMD instructions," in Proc. ACM SIGMOD Int. Conf. Manag. Data, Madison, WI, USA, Jun. 2002, pp. 145–156, doi: 10.1145/564691.564710. [Online]. Available: https://doi.org/10.1145/564691.564710. Accessed: Mar. 2, 2026.
Relevance: Foundational paper establishing SIMD scan, selection, and sort primitives for relational databases. The vectorised scan patterns described here are implemented in
cpu_backend.cppas the correctness baseline against which all GPU backends measure their result parity. The paper's methodology of "data-parallel inner loops with scalar remainder handling" is reflected in the AVX2 fallback guard incpu_backend_mt.cpp.
[4] D. Sidler, Z. István, M. Owaida, and G. Alonso, "Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures," in Proc. ACM SIGMOD Int. Conf. Manag. Data, Chicago, IL, USA, May 2017, pp. 403–415, doi: 10.1145/3035918.3035941. [Online]. Available: https://doi.org/10.1145/3035918.3035941. Accessed: Mar. 2, 2026.
Relevance: Demonstrates a CPU-FPGA co-execution model for SQL query offload. The hardware-abstract
ComputeBackendinterface andBackendRegistrycapability-negotiation design in this module follow the same separation-of-concerns principle: the host (CPU) orchestrates while the accelerator (GPU/FPGA) executes data-parallel kernels. The FPGA plugin slot inplugin_loader.cppis designed to accommodate future FPGA backends following this co-execution model.
[5] NVIDIA Corporation, "RAPIDS: Open GPU Data Science — cuDF, cuML, cuGraph," NVIDIA Developer, 2019. [Online]. Available: https://rapids.ai. Accessed: Mar. 2, 2026.
Relevance: RAPIDS/cuDF provides the reference GPU DataFrame and GPU array libraries used by
faiss_gpu_backend.cppand the planned CUDA vector kernels. The RAPIDS memory model (RMM — RAPIDS Memory Manager) is the basis for the per-operationmemoryBudgetBytes()constraint enforced inComputeBackend. The cuGraph analytics pipeline is the upstream dependency for any future GPU graph backend registered inBackendRegistry.
src/gpu/— Low-level GPU device discovery, memory management, and CUDA/Vulkan/HIP driver wrappers used by the acceleration backends.src/geo/— Geospatial operators (Haversine distance, point-in-polygon) whose GPU acceleration path calls throughgeo_acceleration_bridge.cppin this module.src/graph/— Graph analytics engine; GPU-accelerated graph traversal delegates to backends registered inBackendRegistry.src/index/— Vector index layer (HNSW, IVF-Flat); callsComputeBackend::batchSimilaritySearch()for GPU-accelerated ANN search.src/performance/— Performance benchmarking infrastructure;benchmarks/vector_bench.cppvalidates the ≥ 10× GPU speedup target referenced inFUTURE_ENHANCEMENTS.md.docs/acceleration/capability_negotiation.md— Deep-dive into backend capability negotiation and the fallback chain.docs/acceleration/troubleshooting.md— Operational troubleshooting guide (runbooks, diagnostics, platform-specific issues).