HTTP_API_SPECIFICATION

HTTP API Specification for LLM Integration

Overview

ThemisDB provides a comprehensive RESTful HTTP API for LLM operations, enabling inference, model management, LoRA operations, and statistics retrieval.

Base URL: /api/v1/llm

Authentication: Bearer token or API key (configured in llm_config.yaml)

Content-Type: application/json

Endpoints

1. Inference Operations

1.1 Standard Inference

POST /api/v1/llm/inference

Execute LLM inference with a prompt.

Request Body:

{
  "prompt": "What is ThemisDB?",
  "model": "mistral-7b",
  "lora_adapter": "general-qa",
  "max_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "stop_sequences": ["\n\n", "END"],
  "stream": false
}

Response:

{
  "text": "ThemisDB is a distributed graph database...",
  "tokens_generated": 45,
  "inference_time_ms": 150,
  "model_used": "mistral-7b",
  "lora_used": "general-qa",
  "cache_hit": false,
  "finish_reason": "stop"
}

Status Codes:

200 OK: Successful inference
400 Bad Request: Invalid parameters
404 Not Found: Model or LoRA not found
429 Too Many Requests: Queue full (backpressure)
500 Internal Server Error: Inference failure

1.2 RAG Inference

POST /api/v1/llm/rag

Execute RAG (Retrieval-Augmented Generation) inference with vector search.

Request Body:

{
  "query": "What are the main provisions in contract clause 3.4?",
  "collection": "legal_documents",
  "top_k": 5,
  "similarity_threshold": 0.8,
  "model": "mistral-7b",
  "lora_adapter": "legal-qa",
  "max_tokens": 512,
  "temperature": 0.7,
  "context_assembly": "concat"
}

Response:

{
  "text": "Contract clause 3.4 contains the following provisions...",
  "tokens_generated": 87,
  "inference_time_ms": 210,
  "documents_retrieved": 5,
  "documents_used": 3,
  "retrieval_time_ms": 45,
  "model_used": "mistral-7b",
  "lora_used": "legal-qa",
  "cache_hit": false,
  "finish_reason": "stop"
}

1.3 Streaming Inference

POST /api/v1/llm/inference (with stream: true)

Stream tokens as they are generated.

Request Body:

{
  "prompt": "Write a story about...",
  "model": "mistral-7b",
  "stream": true,
  "max_tokens": 1024
}

Response (Server-Sent Events):

data: {"token": "Once", "index": 0}

data: {"token": " upon", "index": 1}

data: {"token": " a", "index": 2}

data: {"token": " time", "index": 3}

...

data: {"done": true, "tokens_generated": 245, "inference_time_ms": 2150}

Headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

1.4 Embedding Generation

POST /api/v1/llm/embed

Generate embeddings for text.

Request Body:

{
  "text": "Sample text for embedding generation",
  "model": "mistral-7b",
  "normalize": true
}

Response:

{
  "embedding": [0.123, -0.456, 0.789, ...],
  "dimension": 4096,
  "model_used": "mistral-7b",
  "inference_time_ms": 25
}

2. Model Management

2.1 List Models

GET /api/v1/llm/models

List all available models.

Response:

{
  "models": [
    {
      "model_id": "mistral-7b",
      "path": "/models/mistral-7b.gguf",
      "status": "loaded",
      "size_bytes": 6400000000,
      "format": "GGUF",
      "n_layers": 32,
      "loaded_timestamp": "2024-01-15T10:30:00Z",
      "last_used": "2024-01-15T12:45:30Z",
      "usage_count": 1247
    },
    {
      "model_id": "llama-3-8b",
      "status": "available",
      "size_bytes": 8500000000,
      "format": "GGUF"
    }
  ]
}

2.2 Load Model

POST /api/v1/llm/models/load

Load a model into memory.

Request Body:

{
  "model_id": "mistral-7b",
  "path": "/models/mistral-7b.gguf",
  "options": {
    "n_gpu_layers": 32,
    "n_ctx": 4096,
    "n_batch": 512,
    "n_threads": 8,
    "use_mmap": true,
    "use_mlock": false
  },
  "pin": false
}

Response:

{
  "model_id": "mistral-7b",
  "status": "loaded",
  "load_time_ms": 2850,
  "memory_used_mb": 6200
}

2.3 Unload Model

POST /api/v1/llm/models/unload

Unload a model from memory.

Request Body:

{
  "model_id": "mistral-7b"
}

Response:

{
  "model_id": "mistral-7b",
  "status": "unloaded",
  "memory_freed_mb": 6200
}

2.4 Get Model Info

GET /api/v1/llm/models/{model_id}

Get detailed information about a specific model.

Response:

{
  "model_id": "mistral-7b",
  "path": "/models/mistral-7b.gguf",
  "status": "loaded",
  "size_bytes": 6400000000,
  "format": "GGUF",
  "version": "v0.3",
  "architecture": "llama",
  "n_layers": 32,
  "n_heads": 32,
  "n_embd": 4096,
  "n_vocab": 32000,
  "context_length": 8192,
  "loaded_timestamp": "2024-01-15T10:30:00Z",
  "last_used": "2024-01-15T12:45:30Z",
  "usage_count": 1247,
  "memory_usage_mb": 6200,
  "gpu_layers": 32,
  "pinned": false
}

2.5 Ingest Model

POST /api/v1/llm/models/ingest

Upload and ingest a model into ThemisDB blob storage.

Request (multipart/form-data):

POST /api/v1/llm/models/ingest
Content-Type: multipart/form-data

--boundary
Content-Disposition: form-data; name="model_id"

llama-3-8b
--boundary
Content-Disposition: form-data; name="file"; filename="llama-3-8b.gguf"
Content-Type: application/octet-stream

[binary data]
--boundary
Content-Disposition: form-data; name="metadata"
Content-Type: application/json

{
  "version": "v1.0",
  "description": "Llama 3 8B quantized Q4",
  "shard_affinity": "legal",
  "replicate": true
}
--boundary--

Response:

{
  "model_id": "llama-3-8b",
  "version": "v1.0",
  "urn": "urn:themis:model:llama-3-8b:v1",
  "size_bytes": 8500000000,
  "checksum": "sha256:abc123...",
  "upload_time_ms": 45000,
  "replication_status": "pending",
  "shards_replicated": 0,
  "total_shards": 4
}

Note: For large models, use chunked upload with Content-Range headers.

3. LoRA Management

3.1 List LoRAs

GET /api/v1/llm/loras

List all available LoRA adapters.

Query Parameters:

model: Filter by base model
status: Filter by status (loaded, available)

Response:

{
  "loras": [
    {
      "lora_id": "legal-qa",
      "base_model": "mistral-7b",
      "path": "/loras/legal-qa.bin",
      "status": "loaded",
      "size_bytes": 20971520,
      "rank": 8,
      "alpha": 16,
      "loaded_timestamp": "2024-01-15T11:00:00Z",
      "usage_count": 523
    },
    {
      "lora_id": "medical-qa",
      "base_model": "mistral-7b",
      "status": "available",
      "size_bytes": 20971520
    }
  ]
}

3.2 Load LoRA

POST /api/v1/llm/loras/load

Load a LoRA adapter.

Request Body:

{
  "lora_id": "legal-qa",
  "base_model": "mistral-7b",
  "path": "/loras/legal-qa.bin",
  "scale": 1.0
}

Response:

{
  "lora_id": "legal-qa",
  "base_model": "mistral-7b",
  "status": "loaded",
  "load_time_ms": 150,
  "slot": 3
}

3.3 Unload LoRA

POST /api/v1/llm/loras/unload

Unload a LoRA adapter.

Request Body:

{
  "lora_id": "legal-qa",
  "base_model": "mistral-7b"
}

Response:

{
  "lora_id": "legal-qa",
  "status": "unloaded",
  "slot_freed": 3
}

3.4 Get LoRA Info

GET /api/v1/llm/loras/{lora_id}

Get detailed information about a specific LoRA.

Query Parameters:

base_model: Base model ID

Response:

{
  "lora_id": "legal-qa",
  "base_model": "mistral-7b",
  "path": "/loras/legal-qa.bin",
  "status": "loaded",
  "size_bytes": 20971520,
  "rank": 8,
  "alpha": 16,
  "target_modules": ["q_proj", "v_proj"],
  "loaded_timestamp": "2024-01-15T11:00:00Z",
  "last_used": "2024-01-15T12:40:15Z",
  "usage_count": 523,
  "slot": 3
}

4. Statistics & Monitoring

4.1 Get LLM Statistics

GET /api/v1/llm/stats

Get comprehensive LLM system statistics.

Response:

{
  "uptime_seconds": 86400,
  "total_requests": 15234,
  "successful_requests": 15102,
  "failed_requests": 132,
  "active_requests": 8,
  "queued_requests": 23,
  "throughput": {
    "requests_per_second": 128.5,
    "tokens_per_second": 3456.2
  },
  "latency": {
    "p50_ms": 24,
    "p95_ms": 65,
    "p99_ms": 180,
    "avg_ms": 28
  },
  "models": {
    "loaded": 2,
    "total_available": 5,
    "memory_used_mb": 12400
  },
  "loras": {
    "loaded": 8,
    "total_available": 24,
    "memory_used_mb": 160
  },
  "workers": {
    "total": 4,
    "busy": 3,
    "idle": 1,
    "utilization": 0.75
  },
  "gpu": {
    "utilization": 0.89,
    "memory_used_mb": 18456,
    "memory_total_mb": 24576
  }
}

4.2 Get Cache Statistics

GET /api/v1/llm/cache/stats

Get cache performance statistics.

Response:

{
  "response_cache": {
    "hits": 12456,
    "misses": 2778,
    "hit_rate": 0.818,
    "total_entries": 5432,
    "memory_used_mb": 890,
    "avg_lookup_time_ms": 1.8
  },
  "prefix_cache": {
    "hits": 8934,
    "misses": 4823,
    "hit_rate": 0.649,
    "total_entries": 2145,
    "memory_used_mb": 125,
    "avg_tokens_saved": 45.3
  },
  "model_metadata_cache": {
    "hits": 45678,
    "misses": 123,
    "hit_rate": 0.997,
    "total_entries": 5
  },
  "lora_metadata_cache": {
    "hits": 23456,
    "misses": 245,
    "hit_rate": 0.990,
    "total_entries": 24
  },
  "kv_cache_buffer_pool": {
    "total_buffers": 8,
    "active_buffers": 4,
    "buffer_reuse_count": 12456
  }
}

4.3 Get Worker Statistics

GET /api/v1/llm/workers

Get per-worker statistics.

Response:

{
  "workers": [
    {
      "worker_id": 0,
      "status": "busy",
      "current_request_id": "req_abc123",
      "requests_processed": 3821,
      "total_processing_time_ms": 456782,
      "avg_processing_time_ms": 119.5,
      "utilization": 0.92
    },
    {
      "worker_id": 1,
      "status": "idle",
      "requests_processed": 3756,
      "total_processing_time_ms": 441234,
      "avg_processing_time_ms": 117.5,
      "utilization": 0.88
    }
  ]
}

5. Health & Status

5.1 Health Check

GET /api/v1/llm/health

Check LLM service health.

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T12:45:30Z",
  "checks": {
    "models_loaded": true,
    "workers_active": true,
    "gpu_available": true,
    "queue_ok": true
  }
}

Status Codes:

200 OK: Healthy
503 Service Unavailable: Unhealthy

5.2 Clear Cache

POST /api/v1/llm/cache/clear

Clear caches (response, prefix, or all).

Request Body:

{
  "cache_type": "response"
}

Cache Types:

response: Response cache only
prefix: Prefix cache only
all: All caches

Response:

{
  "cleared": "response",
  "entries_removed": 5432,
  "memory_freed_mb": 890
}

Error Responses

All error responses follow this format:

{
  "error": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'invalid-model' not found",
    "details": {
      "available_models": ["mistral-7b", "llama-3-8b"]
    }
  }
}

Error Codes:

INVALID_REQUEST: Malformed request
MODEL_NOT_FOUND: Requested model not found
LORA_NOT_FOUND: Requested LoRA not found
MODEL_LOAD_FAILED: Failed to load model
INFERENCE_FAILED: Inference error
QUEUE_FULL: Request queue full
INSUFFICIENT_MEMORY: Not enough memory
INVALID_PARAMETERS: Invalid inference parameters

Rate Limiting

API endpoints are rate-limited per API key:

Headers:

X-RateLimit-Limit: Maximum requests per minute
X-RateLimit-Remaining: Remaining requests in current window
X-RateLimit-Reset: Timestamp when limit resets

Response (429 Too Many Requests):

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Retry after 42 seconds.",
    "retry_after_seconds": 42
  }
}

Authentication

All API requests require Bearer Token authentication.

Header:

Authorization: Bearer <token>

Example:

curl -X POST http://localhost:8080/api/v1/llm/inference \
  -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is ThemisDB?", "model": "mistral-7b"}'

Token Format: JWT (JSON Web Token)

Token Acquisition: Obtain from ThemisDB authentication endpoint:

curl -X POST http://localhost:8080/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}' \
  | jq -r '.token'

Token Expiration: Configurable (default: 24 hours)

Unauthorized Response (401):

{
  "error": {
    "code": "UNAUTHORIZED",
    "message": "Invalid or expired token"
  }
}

Versioning

API version is included in URL: /api/v1/llm/*

Future versions will use /api/v2/llm/*, etc.

Examples

cURL Examples

Simple Inference:

curl -X POST http://localhost:8080/api/v1/llm/inference \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "prompt": "What is ThemisDB?",
    "model": "mistral-7b",
    "max_tokens": 100
  }'

RAG Query:

curl -X POST http://localhost:8080/api/v1/llm/rag \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "query": "Contract provisions in clause 3.4",
    "collection": "legal_docs",
    "top_k": 5,
    "lora_adapter": "legal-qa"
  }'

Model Upload:

curl -X POST http://localhost:8080/api/v1/llm/models/ingest \
  -H "Authorization: Bearer <token>" \
  -F "model_id=llama-3-8b" \
  -F "file=@/path/to/llama-3-8b.gguf" \
  -F 'metadata={"version":"v1.0","replicate":true}'

Best Practices

Use streaming for long responses to improve perceived latency
Leverage caching by structuring similar prompts consistently
Pre-load frequently used models and LoRAs to avoid cold starts
Monitor cache hit rates and adjust similarity thresholds
Use batch inference via multiple concurrent requests for throughput
Set appropriate timeouts (recommend 30s for inference, 5min for model loading)
Handle 429 errors with exponential backoff
Use RAG endpoint instead of manual vector search + inference

Performance Considerations

Response caching: 75x speedup for cache hits
Prefix caching: 65% hit rate, ~45 tokens saved per hit
Concurrent requests: 128 req/s with 4 workers
Model loading: ~3s cold start, ~0ms cached
LoRA switching: ~5ms per switch

HTTP_API_SPECIFICATION

HTTP API Specification for LLM Integration

Overview

Endpoints

1. Inference Operations

1.1 Standard Inference

1.2 RAG Inference

1.3 Streaming Inference

1.4 Embedding Generation

2. Model Management

2.1 List Models

2.2 Load Model

2.3 Unload Model

2.4 Get Model Info

2.5 Ingest Model

3. LoRA Management

3.1 List LoRAs

3.2 Load LoRA

3.3 Unload LoRA

3.4 Get LoRA Info

4. Statistics & Monitoring

4.1 Get LLM Statistics

4.2 Get Cache Statistics

4.3 Get Worker Statistics

5. Health & Status

5.1 Health Check

5.2 Clear Cache

Error Responses

Rate Limiting

Authentication

Versioning

Examples

cURL Examples

Best Practices

Performance Considerations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!