On-call procedures for common GPU module incidents. Each runbook follows a Detect → Diagnose → Mitigate → Verify → Follow-up structure.
TryAllocateGPUreturnsfalse;GPUSafeFailroutes to CPU fallback.themis_gpu_alloc_total{result="fail_global_limit"}counter climbing.VRAM_HIGHalert firing (themis_gpu_vram_allocated_bytes / limit ≥ 0.80).
GET /admin/gpu/stats
# Look for: allocated_bytes / max_bytes > 0.80, alloc_fail_global_limit > 0
Or query metrics:
themis_gpu_vram_allocated_bytes
themis_gpu_alloc_total{result="fail_global_limit"}
- Call
GET /admin/gpu/tenantsto find which tenant(s) hold the most VRAM. - Use
GPUMemoryManager::GetActiveAllocations()to list live allocation tags. - Check audit log (
GPUAuditLog::snapshot()) forALLOC_FAIL_GLOBAL_LIMITevents and theirtag/tenant_id.
- Short-term: CPU fallback is already active. Monitor latency impact; alert if p99 exceeds budget.
- Reduce tenant usage: call
SetTenantQuota(tenant_id, lower_limit)to cap the largest consumer temporarily. - Force deallocation: if a specific allocation is known to be leaked, call
DeallocateGPU(size, tenant_id)against the owning tenant. - Upgrade edition: if the OOM is legitimate load growth, plan an edition upgrade to raise the VRAM cap.
GET /admin/gpu/stats → allocated_bytes should decrease
themis_gpu_fallback_total → should stop growing
- Add or tighten per-tenant quotas with
SetTenantQuota. - Review
FUTURE_ENHANCEMENTS.mdmemory-pool preallocation hint item. - File a capacity-planning ticket if load is genuinely growing.
DeviceDiscovery::Enumerate()returns onlyCPU_FALLBACK.DEVICE_UNAVAILABLEalert firing.GPUSafeFailstate transitions toFAILEDorCIRCUIT_OPEN.CIRCUIT_OPENalert firing.
GET /admin/gpu/stats → backend == "CPU_FALLBACK"
themis_gpu_circuit_open_total → non-zero
- Check host GPU health:
nvidia-smi(CUDA) orrocm-smi(ROCm). - Inspect
GPUSafeFail::getStatus().last_errorfor the driver error message. - Check audit log for
DEVICE_UNAVAILABLEandCIRCUIT_OPENEDevents. - If
circuit_opened_atis recent, the circuit breaker timeout has not yet elapsed — no GPU retries will happen untilcircuit_reset_timeout_secs.
- Wait for automatic reset: after
circuit_reset_timeout_secsthe circuit transitions toDEGRADEDand probes resume. - Manual reset: call
GPUSafeFail::forceHealthy()after the hardware is confirmed operational. - Persistent failure: call
GPUSafeFail::forceFailed("maintenance")to keep CPU fallback active while hardware is replaced; re-enable withforceHealthy()once resolved. - Re-enumerate: call
DeviceDiscovery::Enumerate()and pass the result toGPULoadBalancer::updateDevices()to refresh the device list.
GPUSafeFail::getStatus().state == HEALTHY
DeviceDiscovery::HasGPU() == true
DEVICE_UNAVAILABLE alert resolved
- Root-cause the driver crash (kernel panic, power event, ECC error).
- Add this device-loss pattern to CI fault-injection harness.
TryAllocateGPU(size, tag, tenant_id)returnsfalseonly for one tenant.themis_gpu_alloc_total{result="fail_tenant_quota", tenant=...}climbing.ALLOC_FAIL_TENANT_QUOTAevents in the audit log.
GET /admin/gpu/tenants → find tenant where allocated_bytes ≈ quota_bytes
themis_gpu_alloc_total{result="fail_tenant_quota"}
GPUMemoryManager::GetTenantStats(tenant_id)— compareallocated_bytesvsquota_bytes.GPUMemoryManager::GetActiveAllocations()— filter bytenant_idto see which tags are consuming VRAM.GPUMemoryManager::GetTenantHeadroom(tenant_id)— how much is left.
- Temporary increase:
SetTenantQuota(tenant_id, new_higher_quota). Validate first withGPUConfig::simulateAllocation. - Ask tenant to free: if the tenant owns stale allocations, trigger their
cleanup path so
DeallocateGPU(size, tenant_id)is called. - Reduce other tenants: if global VRAM is also tight, lower quotas for lower-priority tenants first.
GetTenantHeadroom(tenant_id) > 0
TryAllocateGPU(..., tenant_id) == true
- Review tenant growth trend; adjust default quota policy.
- Consider per-tenant audit log alerts at 90% quota utilisation.
GPUKernelValidator::validate()returnsCHECKSUM_MISMATCHorUNKNOWN_KERNEL.- Kernel launches refused; workload fails or falls back to CPU.
themis_gpu_alloc_total{result="fail_global_limit"}may not be involved — this is a security gate, not an OOM.
GPUKernelValidator::getStats()
unknown_kernel_count > 0 → unregistered kernel submitted
checksum_mismatch_count > 0 → tampered or stale kernel blob
- Identify which
kernel_idfailed via the validation result or audit log. - If
CHECKSUM_MISMATCH: the blob on disk differs from the registered checksum. Possible causes: deployment race (new binary, old checksum), accidental overwrite, or active tampering. - If
UNKNOWN_KERNEL: a new kernel was deployed without registering it viaGPUKernelValidator::registerKernel().
- UNKNOWN_KERNEL — legitimate deployment: register the new kernel:
// Load canonical blob from trusted build artifact. GPUKernelValidator::GetInstance().registerKernel( "kernel_id", canonical_blob);
- CHECKSUM_MISMATCH — deployment race: re-register with the new blob after verifying its provenance.
- CHECKSUM_MISMATCH — suspected tampering: do not re-register. Quarantine the host, rotate secrets, and escalate to the security team.
GPUKernelValidator::isValid(kernel_id, blob) == true
No further checksum_mismatch_count increments
- Add pre-deployment step that auto-registers all kernel blobs and their checksums during the CI/CD pipeline.
- Integrate HMAC/signature verification (see
FUTURE_ENHANCEMENTS.md).
FALLBACK_RATE_HIGHalert firing.- p99 latency above SLO; CPU capacity spike.
themis_gpu_fallback_totalgrowing rapidly.
themis_gpu_fallback_total / themis_gpu_alloc_total > 0.20 (threshold)
p99 query latency metric
- Check whether
GPUSafeFailcircuit is open (CIRCUIT_OPENalert). - If circuit is open, see Runbook 2 (Device Unavailable).
- If circuit is closed but fallback rate is high: VRAM exhaustion is likely. See Runbook 1 (OOM).
- Check
GPUSafeFail::getStatus().error_ratefor sustained failures below the circuit-breaker threshold.
- Address the root cause (OOM or device issue) per the relevant runbook.
- If CPU fallback is intentional (scheduled maintenance), suppress the alert
by temporarily raising
fallback_rate_thresholdinGPUAlerts::Config. - Scale out CPU capacity to absorb fallback load while GPU is unavailable.
themis_gpu_fallback_total stops growing
FALLBACK_RATE_HIGH alert resolves
p99 latency returns to SLO
- Review
CPU fallback performance budgetsitem ingpu_roadmap.md. - Consider pre-warming CPU vector index to reduce cold-start latency when fallback is triggered.
The GpuBatchBackend (src/geo/gpu_backend_stub.cpp) handles spatial
intersection queries for the geo module. In CI and on machines without a
GPU it runs entirely on CPU via the circuit-breaker fallback path.
- Geo intersection queries return all-zero results or empty masks.
themis_gpu_fallback_total{reason="batch_cpu_fallback"}or{reason="device_unavailable"}counters growing.GpuBatchBackend::getStats().batch_avg_latency_usunusually high.- Geo index queries are slower than expected (backend is falling back).
GET /admin/gpu/stats
# Look for: circuit_open == true, exact_errors > 0, batch_fallbacks climbing
# Or via GPUMetrics snapshot:
themis_gpu_fallback_total{reason="batch_cpu_fallback"}
themis_gpu_fallback_total{reason="device_unavailable"}
Check the audit log for FALLBACK_TO_CPU and DEVICE_UNAVAILABLE events
tagged "geo_backend_init" or "batchIntersects: cpu fallback".
- Call
getGpuSpatialBackend()->isAvailable(). Iffalse:- Check
DeviceDiscovery::HasGPU()— no GPU present → CPU fallback is expected and correct. - Check
GPUSafeFail::getStatus().state— ifCIRCUIT_OPEN, the device has experienced repeated failures.
- Check
- Inspect
GpuBatchBackend::getStats()for non-zeroexact_errors. - Check
SpatialBatchInputspopulation: ifgeoms_a/geoms_bare empty andcount > 0the caller is not populating geometry pairs — all mask entries will be 0. This is a caller bug, not a backend bug. - If
batch_avg_latency_usis high: large polygons or many pairs per call increase the CPU fallback cost. Review the caller's batch size vs the configuredgpu_batch_threshold(default 64).
- No GPU / CPU-only environment (expected): CPU fallback is intentional. No action required; latency SLO must account for CPU cost.
- Circuit breaker open: follow Runbook 2 (Device Unavailable) to reset the circuit after hardware recovery.
- Caller not populating geometry vectors: fix the call site to set
SpatialBatchInputs::geoms_aandgeoms_balongsidecount. - High latency on large batches: reduce
countper call, or split into smaller sub-batches to stay withinfallback_budget_ms.
auto* backend = getGpuSpatialBackend();
backend->isAvailable(); // true when GPU present and healthy
backend->exactIntersects(pt, poly); // returns correct result
SpatialBatchInputs in;
in.count = 1;
in.geoms_a = {pt}; in.geoms_b = {poly};
auto r = backend->batchIntersects(in);
assert(r.mask[0] == 1u); // hit confirmed- Once CUDA/ROCm kernels are available for spatial ops, replace the
CPU compute loop in
batchIntersectswith a real kernel launch (seeFUTURE_ENHANCEMENTS §GPU_SPATIAL_KERNELS). - Add per-call latency histogram to
GPUMetricswhen a latency-histogram API is introduced.