-
Notifications
You must be signed in to change notification settings - Fork 1
HTTP_API_SPECIFICATION
ThemisDB provides a comprehensive RESTful HTTP API for LLM operations, enabling inference, model management, LoRA operations, and statistics retrieval.
Base URL: /api/v1/llm
Authentication: Bearer token or API key (configured in llm_config.yaml)
Content-Type: application/json
POST /api/v1/llm/inference
Execute LLM inference with a prompt.
Request Body:
{
"prompt": "What is ThemisDB?",
"model": "mistral-7b",
"lora_adapter": "general-qa",
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"stop_sequences": ["\n\n", "END"],
"stream": false
}Response:
{
"text": "ThemisDB is a distributed graph database...",
"tokens_generated": 45,
"inference_time_ms": 150,
"model_used": "mistral-7b",
"lora_used": "general-qa",
"cache_hit": false,
"finish_reason": "stop"
}Status Codes:
-
200 OK: Successful inference -
400 Bad Request: Invalid parameters -
404 Not Found: Model or LoRA not found -
429 Too Many Requests: Queue full (backpressure) -
500 Internal Server Error: Inference failure
POST /api/v1/llm/rag
Execute RAG (Retrieval-Augmented Generation) inference with vector search.
Request Body:
{
"query": "What are the main provisions in contract clause 3.4?",
"collection": "legal_documents",
"top_k": 5,
"similarity_threshold": 0.8,
"model": "mistral-7b",
"lora_adapter": "legal-qa",
"max_tokens": 512,
"temperature": 0.7,
"context_assembly": "concat"
}Response:
{
"text": "Contract clause 3.4 contains the following provisions...",
"tokens_generated": 87,
"inference_time_ms": 210,
"documents_retrieved": 5,
"documents_used": 3,
"retrieval_time_ms": 45,
"model_used": "mistral-7b",
"lora_used": "legal-qa",
"cache_hit": false,
"finish_reason": "stop"
}POST /api/v1/llm/inference (with stream: true)
Stream tokens as they are generated.
Request Body:
{
"prompt": "Write a story about...",
"model": "mistral-7b",
"stream": true,
"max_tokens": 1024
}Response (Server-Sent Events):
data: {"token": "Once", "index": 0}
data: {"token": " upon", "index": 1}
data: {"token": " a", "index": 2}
data: {"token": " time", "index": 3}
...
data: {"done": true, "tokens_generated": 245, "inference_time_ms": 2150}
Headers:
Content-Type: text/event-streamCache-Control: no-cacheConnection: keep-alive
POST /api/v1/llm/embed
Generate embeddings for text.
Request Body:
{
"text": "Sample text for embedding generation",
"model": "mistral-7b",
"normalize": true
}Response:
{
"embedding": [0.123, -0.456, 0.789, ...],
"dimension": 4096,
"model_used": "mistral-7b",
"inference_time_ms": 25
}GET /api/v1/llm/models
List all available models.
Response:
{
"models": [
{
"model_id": "mistral-7b",
"path": "/models/mistral-7b.gguf",
"status": "loaded",
"size_bytes": 6400000000,
"format": "GGUF",
"n_layers": 32,
"loaded_timestamp": "2024-01-15T10:30:00Z",
"last_used": "2024-01-15T12:45:30Z",
"usage_count": 1247
},
{
"model_id": "llama-3-8b",
"status": "available",
"size_bytes": 8500000000,
"format": "GGUF"
}
]
}POST /api/v1/llm/models/load
Load a model into memory.
Request Body:
{
"model_id": "mistral-7b",
"path": "/models/mistral-7b.gguf",
"options": {
"n_gpu_layers": 32,
"n_ctx": 4096,
"n_batch": 512,
"n_threads": 8,
"use_mmap": true,
"use_mlock": false
},
"pin": false
}Response:
{
"model_id": "mistral-7b",
"status": "loaded",
"load_time_ms": 2850,
"memory_used_mb": 6200
}POST /api/v1/llm/models/unload
Unload a model from memory.
Request Body:
{
"model_id": "mistral-7b"
}Response:
{
"model_id": "mistral-7b",
"status": "unloaded",
"memory_freed_mb": 6200
}GET /api/v1/llm/models/{model_id}
Get detailed information about a specific model.
Response:
{
"model_id": "mistral-7b",
"path": "/models/mistral-7b.gguf",
"status": "loaded",
"size_bytes": 6400000000,
"format": "GGUF",
"version": "v0.3",
"architecture": "llama",
"n_layers": 32,
"n_heads": 32,
"n_embd": 4096,
"n_vocab": 32000,
"context_length": 8192,
"loaded_timestamp": "2024-01-15T10:30:00Z",
"last_used": "2024-01-15T12:45:30Z",
"usage_count": 1247,
"memory_usage_mb": 6200,
"gpu_layers": 32,
"pinned": false
}POST /api/v1/llm/models/ingest
Upload and ingest a model into ThemisDB blob storage.
Request (multipart/form-data):
POST /api/v1/llm/models/ingest
Content-Type: multipart/form-data
--boundary
Content-Disposition: form-data; name="model_id"
llama-3-8b
--boundary
Content-Disposition: form-data; name="file"; filename="llama-3-8b.gguf"
Content-Type: application/octet-stream
[binary data]
--boundary
Content-Disposition: form-data; name="metadata"
Content-Type: application/json
{
"version": "v1.0",
"description": "Llama 3 8B quantized Q4",
"shard_affinity": "legal",
"replicate": true
}
--boundary--
Response:
{
"model_id": "llama-3-8b",
"version": "v1.0",
"urn": "urn:themis:model:llama-3-8b:v1",
"size_bytes": 8500000000,
"checksum": "sha256:abc123...",
"upload_time_ms": 45000,
"replication_status": "pending",
"shards_replicated": 0,
"total_shards": 4
}Note: For large models, use chunked upload with Content-Range headers.
GET /api/v1/llm/loras
List all available LoRA adapters.
Query Parameters:
-
model: Filter by base model -
status: Filter by status (loaded, available)
Response:
{
"loras": [
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"path": "/loras/legal-qa.bin",
"status": "loaded",
"size_bytes": 20971520,
"rank": 8,
"alpha": 16,
"loaded_timestamp": "2024-01-15T11:00:00Z",
"usage_count": 523
},
{
"lora_id": "medical-qa",
"base_model": "mistral-7b",
"status": "available",
"size_bytes": 20971520
}
]
}POST /api/v1/llm/loras/load
Load a LoRA adapter.
Request Body:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"path": "/loras/legal-qa.bin",
"scale": 1.0
}Response:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"status": "loaded",
"load_time_ms": 150,
"slot": 3
}POST /api/v1/llm/loras/unload
Unload a LoRA adapter.
Request Body:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b"
}Response:
{
"lora_id": "legal-qa",
"status": "unloaded",
"slot_freed": 3
}GET /api/v1/llm/loras/{lora_id}
Get detailed information about a specific LoRA.
Query Parameters:
-
base_model: Base model ID
Response:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"path": "/loras/legal-qa.bin",
"status": "loaded",
"size_bytes": 20971520,
"rank": 8,
"alpha": 16,
"target_modules": ["q_proj", "v_proj"],
"loaded_timestamp": "2024-01-15T11:00:00Z",
"last_used": "2024-01-15T12:40:15Z",
"usage_count": 523,
"slot": 3
}GET /api/v1/llm/stats
Get comprehensive LLM system statistics.
Response:
{
"uptime_seconds": 86400,
"total_requests": 15234,
"successful_requests": 15102,
"failed_requests": 132,
"active_requests": 8,
"queued_requests": 23,
"throughput": {
"requests_per_second": 128.5,
"tokens_per_second": 3456.2
},
"latency": {
"p50_ms": 24,
"p95_ms": 65,
"p99_ms": 180,
"avg_ms": 28
},
"models": {
"loaded": 2,
"total_available": 5,
"memory_used_mb": 12400
},
"loras": {
"loaded": 8,
"total_available": 24,
"memory_used_mb": 160
},
"workers": {
"total": 4,
"busy": 3,
"idle": 1,
"utilization": 0.75
},
"gpu": {
"utilization": 0.89,
"memory_used_mb": 18456,
"memory_total_mb": 24576
}
}GET /api/v1/llm/cache/stats
Get cache performance statistics.
Response:
{
"response_cache": {
"hits": 12456,
"misses": 2778,
"hit_rate": 0.818,
"total_entries": 5432,
"memory_used_mb": 890,
"avg_lookup_time_ms": 1.8
},
"prefix_cache": {
"hits": 8934,
"misses": 4823,
"hit_rate": 0.649,
"total_entries": 2145,
"memory_used_mb": 125,
"avg_tokens_saved": 45.3
},
"model_metadata_cache": {
"hits": 45678,
"misses": 123,
"hit_rate": 0.997,
"total_entries": 5
},
"lora_metadata_cache": {
"hits": 23456,
"misses": 245,
"hit_rate": 0.990,
"total_entries": 24
},
"kv_cache_buffer_pool": {
"total_buffers": 8,
"active_buffers": 4,
"buffer_reuse_count": 12456
}
}GET /api/v1/llm/workers
Get per-worker statistics.
Response:
{
"workers": [
{
"worker_id": 0,
"status": "busy",
"current_request_id": "req_abc123",
"requests_processed": 3821,
"total_processing_time_ms": 456782,
"avg_processing_time_ms": 119.5,
"utilization": 0.92
},
{
"worker_id": 1,
"status": "idle",
"requests_processed": 3756,
"total_processing_time_ms": 441234,
"avg_processing_time_ms": 117.5,
"utilization": 0.88
}
]
}GET /api/v1/llm/health
Check LLM service health.
Response:
{
"status": "healthy",
"timestamp": "2024-01-15T12:45:30Z",
"checks": {
"models_loaded": true,
"workers_active": true,
"gpu_available": true,
"queue_ok": true
}
}Status Codes:
-
200 OK: Healthy -
503 Service Unavailable: Unhealthy
POST /api/v1/llm/cache/clear
Clear caches (response, prefix, or all).
Request Body:
{
"cache_type": "response"
}Cache Types:
-
response: Response cache only -
prefix: Prefix cache only -
all: All caches
Response:
{
"cleared": "response",
"entries_removed": 5432,
"memory_freed_mb": 890
}All error responses follow this format:
{
"error": {
"code": "MODEL_NOT_FOUND",
"message": "Model 'invalid-model' not found",
"details": {
"available_models": ["mistral-7b", "llama-3-8b"]
}
}
}Error Codes:
-
INVALID_REQUEST: Malformed request -
MODEL_NOT_FOUND: Requested model not found -
LORA_NOT_FOUND: Requested LoRA not found -
MODEL_LOAD_FAILED: Failed to load model -
INFERENCE_FAILED: Inference error -
QUEUE_FULL: Request queue full -
INSUFFICIENT_MEMORY: Not enough memory -
INVALID_PARAMETERS: Invalid inference parameters
API endpoints are rate-limited per API key:
Headers:
-
X-RateLimit-Limit: Maximum requests per minute -
X-RateLimit-Remaining: Remaining requests in current window -
X-RateLimit-Reset: Timestamp when limit resets
Response (429 Too Many Requests):
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Retry after 42 seconds.",
"retry_after_seconds": 42
}
}All API requests require Bearer Token authentication.
Header:
Authorization: Bearer <token>
Example:
curl -X POST http://localhost:8080/api/v1/llm/inference \
-H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." \
-H "Content-Type: application/json" \
-d '{"prompt": "What is ThemisDB?", "model": "mistral-7b"}'Token Format: JWT (JSON Web Token)
Token Acquisition: Obtain from ThemisDB authentication endpoint:
curl -X POST http://localhost:8080/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "user", "password": "pass"}' \
| jq -r '.token'Token Expiration: Configurable (default: 24 hours)
Unauthorized Response (401):
{
"error": {
"code": "UNAUTHORIZED",
"message": "Invalid or expired token"
}
}API version is included in URL: /api/v1/llm/*
Future versions will use /api/v2/llm/*, etc.
Simple Inference:
curl -X POST http://localhost:8080/api/v1/llm/inference \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"prompt": "What is ThemisDB?",
"model": "mistral-7b",
"max_tokens": 100
}'RAG Query:
curl -X POST http://localhost:8080/api/v1/llm/rag \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"query": "Contract provisions in clause 3.4",
"collection": "legal_docs",
"top_k": 5,
"lora_adapter": "legal-qa"
}'Model Upload:
curl -X POST http://localhost:8080/api/v1/llm/models/ingest \
-H "Authorization: Bearer <token>" \
-F "model_id=llama-3-8b" \
-F "file=@/path/to/llama-3-8b.gguf" \
-F 'metadata={"version":"v1.0","replicate":true}'- Use streaming for long responses to improve perceived latency
- Leverage caching by structuring similar prompts consistently
- Pre-load frequently used models and LoRAs to avoid cold starts
- Monitor cache hit rates and adjust similarity thresholds
- Use batch inference via multiple concurrent requests for throughput
- Set appropriate timeouts (recommend 30s for inference, 5min for model loading)
- Handle 429 errors with exponential backoff
- Use RAG endpoint instead of manual vector search + inference
- Response caching: 75x speedup for cache hits
- Prefix caching: 65% hit rate, ~45 tokens saved per hit
- Concurrent requests: 128 req/s with 4 workers
- Model loading: ~3s cold start, ~0ms cached
- LoRA switching: ~5ms per switch
ThemisDB v1.3.4 | GitHub | Documentation | Discussions | License
Last synced: January 02, 2026 | Commit: 6add659
Version: 1.3.0 | Stand: Dezember 2025
- Übersicht
- Home
- Dokumentations-Index
- Quick Reference
- Sachstandsbericht 2025
- Features
- Roadmap
- Ecosystem Overview
- Strategische Übersicht
- Geo/Relational Storage
- RocksDB Storage
- MVCC Design
- Transaktionen
- Time-Series
- Memory Tuning
- Chain of Thought Storage
- Query Engine & AQL
- AQL Syntax
- Explain & Profile
- Rekursive Pfadabfragen
- Temporale Graphen
- Zeitbereichs-Abfragen
- Semantischer Cache
- Hybrid Queries (Phase 1.5)
- AQL Hybrid Queries
- Hybrid Queries README
- Hybrid Query Benchmarks
- Subquery Quick Reference
- Subquery Implementation
- Content Pipeline
- Architektur-Details
- Ingestion
- JSON Ingestion Spec
- Enterprise Ingestion Interface
- Geo-Processor Design
- Image-Processor Design
- Hybrid Search Design
- Fulltext API
- Hybrid Fusion API
- Stemming
- Performance Tuning
- Migration Guide
- Future Work
- Pagination Benchmarks
- Enterprise README
- Scalability Features
- HTTP Client Pool
- Build Guide
- Implementation Status
- Final Report
- Integration Analysis
- Enterprise Strategy
- Verschlüsselungsstrategie
- Verschlüsselungsdeployment
- Spaltenverschlüsselung
- Encryption Next Steps
- Multi-Party Encryption
- Key Rotation Strategy
- Security Encryption Gap Analysis
- Audit Logging
- Audit & Retention
- Compliance Audit
- Compliance
- Extended Compliance Features
- Governance-Strategie
- Compliance-Integration
- Governance Usage
- Security/Compliance Review
- Threat Model
- Security Hardening Guide
- Security Audit Checklist
- Security Audit Report
- Security Implementation
- Development README
- Code Quality Pipeline
- Developers Guide
- Cost Models
- Todo Liste
- Tool Todo
- Core Feature Todo
- Priorities
- Implementation Status
- Roadmap
- Future Work
- Next Steps Analysis
- AQL LET Implementation
- Development Audit
- Sprint Summary (2025-11-17)
- WAL Archiving
- Search Gap Analysis
- Source Documentation Plan
- Changefeed README
- Changefeed CMake Patch
- Changefeed OpenAPI
- Changefeed OpenAPI Auth
- Changefeed SSE Examples
- Changefeed Test Harness
- Changefeed Tests
- Dokumentations-Inventar
- Documentation Summary
- Documentation TODO
- Documentation Gap Analysis
- Documentation Consolidation
- Documentation Final Status
- Documentation Phase 3
- Documentation Cleanup Validation
- API
- Authentication
- Cache
- CDC
- Content
- Geo
- Governance
- Index
- LLM
- Query
- Security
- Server
- Storage
- Time Series
- Transaction
- Utils
Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/