Serverless Proxy - Universal LLM Gateway

A universal OpenAI-compatible API proxy that bridges standard API requests to multiple backend providers (RunPod, Ollama, OpenAI-compatible APIs, Together AI, etc.). Configure endpoints through a web admin UI and map virtual model names to actual backend models.

Overview

Client (OpenAI format) → Serverless Proxy (port 8002) → Configured Backends

Universal: Connect to any LLM backend (RunPod, Ollama, OpenAI, OAuth-based OpenAI-compatible providers, Together AI, etc.)
Virtual Models: Map user-facing model names to actual backend models
Admin UI: Configure endpoints and virtual models via web interface
Tool-Call Compatibility: Normalize misformatted model tool calls with DB-driven regex patterns
OpenAI-compatible: Works with any OpenAI client library

Quick Start

This guide walks you through getting the Serverless Proxy up and running in just a few minutes.

Prerequisites

Install Docker first if you don't have it:

Docker Desktop (Windows/Mac): https://www.docker.com/products/docker-desktop
Docker Engine (Linux): https://docs.docker.com/engine/install/

Step 1: Clone and Setup

# Clone the repository
git clone https://github.com/TyRoden/serverless_proxy.git
cd serverless_proxy

# Copy the example environment file
cp .env.example .env

Step 2: Configure Your Environment

Open the .env file in a text editor and check these settings:

# Required: Set AUTH_ENABLED to false for first-time setup (no auth service needed yet)
AUTH_ENABLED=false

# Optional: If using Ollama locally, it should work out of the box

Step 3: Start the Proxy

# Build and start the container
docker compose up -d --build

# Verify FastAPI is served via Uvicorn (required for API routes)
docker compose exec serverless-proxy sh -c "ps aux | grep uvicorn" || echo "WARNING: Uvicorn not running. Ensure serverless-proxy service uses 'uvicorn simple_bridge:app' in docker-compose.yml. See docs for details."

Step 4: Configure in the Admin UI

Open your browser and go to: http://localhost:5001/proxy-dashboard
You'll see the admin dashboard (no login needed since AUTH_ENABLED=false)

Add an Endpoint

Click + Add Endpoint under Endpoints
Fill in:
- Name: Something like "My Ollama" or "RunPod Production"
- URL: Your backend URL (e.g., http://localhost:11434 for local Ollama, or your RunPod endpoint URL)
- API Key: Your API key if required (leave blank for local Ollama)
- Type: Select the type (openwebui, openai, openai_oauth, ollama, runpod, anthropic, deepinfra, etc.)
- If you choose OpenAI OAuth, OAuth fields are shown and prefilled with OpenAI defaults
- Click Save

Add a Virtual Model

Click + Add Virtual Model under Virtual Models
Fill in:
- Name: What you want to call it (e.g., gpt-4, llama-production)
- Endpoint: Select the endpoint you just created
- Actual Model: The actual model name on the backend (e.g., gpt-4o, llama3:70b)
- Click Save

Step 5: Use the Proxy

Your AI tools can now connect to the proxy:

Service	URL
API Endpoint	`http://localhost:8002`
Admin UI	`http://localhost:5001/proxy-dashboard`

Example - Using with OpenWebUI or any OpenAI-compatible client:

Base URL: http://localhost:8002/v1
API Key: any-key-works (or your endpoint's key)
Model: the-virtual-model-name-you-created

Example - Test with curl:

curl http://localhost:8002/v1/models

curl -X POST http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-virtual-model-name",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Troubleshooting

# Check if the proxy is running
curl http://localhost:8002/health

# View container logs
docker logs serverless-proxy

# Restart the container
docker restart serverless-proxy

How to Update Safely

If you already have endpoints and virtual models configured, use this upgrade flow to avoid breaking your setup.

1) Back up the database first (required)

sqlite3 /mnt/ai/serverless-proxy/data/proxy.db \
  ".backup /mnt/ai/serverless-proxy/data/proxy-pre-upgrade-$(date +%Y%m%d-%H%M%S).db"

2) Stop the running container (recommended)

docker compose down

3) Pull the latest code

git pull

4) Rebuild and restart

docker compose up -d --build

5) Let migrations run automatically

On startup, the proxy runs additive schema migrations (new columns only). Existing endpoint rows are preserved.

6) Verify existing setup still works

# API health
curl http://localhost:8002/health

# List routed models
curl http://localhost:8002/v1/models

# Optional: test an existing virtual model
curl -X POST http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-existing-virtual-model",
    "messages": [{"role": "user", "content": "ping"}]
  }'

7) (Optional) Start using OAuth endpoints

Existing endpoints continue to work unchanged. To use OAuth, add a new endpoint with type openai_oauth and configure OAuth fields.

Rollback (if needed)

If something fails after upgrade:

Stop the container
Restore the DB backup
Restart with your prior image/commit

# Stop current container
docker compose down

# Example DB restore
cp /mnt/ai/serverless-proxy/data/proxy-pre-upgrade-YYYYMMDD-HHMMSS.db \
   /mnt/ai/serverless-proxy/data/proxy.db
docker compose up -d

Configuration

Health, Failover, and Cache

The proxy supports optional endpoint health polling, per-virtual-model failover, and non-streaming response cache.

For full operational details (runtime flow, strategy behavior, and every related setting), see:

docs/failover-cache-operations.md

Endpoint Health Polling

Health polling runs only when at least one virtual model has failover configured.
Polling is per endpoint and optional. Leave Health Check URL blank to disable active polling for that endpoint.
Polling interval is configured in Settings (health_check_interval).

Accepted healthy/unhealthy responses for health_check_url:

HTTP 2xx with no JSON body → healthy
HTTP 2xx with JSON {"healthy": true} or {"status": "ok"} → healthy
HTTP 2xx with JSON {"healthy": false} or {"status": "down"|"error"|"unhealthy"} → unhealthy
non-2xx response or network timeout/error → unhealthy/failure increment

Example health endpoints:

# simple
https://api.example.com/health

# Kubernetes style
https://api.example.com/healthz

# custom app route
https://api.example.com/internal/status

Failover Configuration

Failover is disabled by default and only applies to virtual models with explicit failover settings.

backup: try primary, then target list in order
rotational: rotate through targets
duplicate: automatically try other enabled virtual models with the same actual_model

Retry behavior and circuit controls:

Failover retries only retryable upstream failures (429, 500, 502, 503, 504).
Endpoints with open circuits are skipped.
Circuit state is tracked per endpoint and updated by failure thresholds/cooldown.

Failover and circuit controls:

Global defaults from Settings:
- circuit_failure_threshold
- circuit_failure_window
- circuit_cooldown_seconds
Per-virtual-model overrides (optional) in the failover form:
- max attempts
- failure threshold
- cooldown seconds

Non-Streaming Cache

Cache is applied only to non-stream requests (chat + embeddings).
Per virtual model, Enable Non-Stream Cache controls cache participation (cache_enabled).
Cache is bypassed when request header includes Cache-Control: no-store.
Tool-call responses are not cached.
Error responses are not cached.
TTLs are configured in Settings:
- cache_ttl_chat
- cache_ttl_embeddings

Usage and Savings Metrics

Usage dashboard includes cache metrics:

cache attempts
cache hits
cache hit rate
estimated cache savings

Settings Reference (Health/Failover/Cache)

UI Section	Setting	Meaning
Settings	`Chat Cache TTL (seconds)`	TTL for non-stream chat/completions cache entries
Settings	`Embeddings Cache TTL (seconds)`	TTL for non-stream embeddings cache entries
Settings	`Health Check Interval (seconds)`	Poll interval for endpoint health URLs
Settings	`Circuit Failure Threshold`	Retryable failure count before circuit opens
Settings	`Circuit Failure Window (seconds)`	Time window for counting failures
Settings	`Circuit Cooldown (seconds)`	How long a circuit stays open before retry
Endpoint modal	`Health Check URL (optional)`	Optional custom health URL for endpoint polling
Virtual model modal	`Enable Non-Stream Cache`	Enables/disables cache for that model
Virtual model modal	`Enable Failover`	Enables/disables failover for that model
Virtual model modal	`Failover Strategy`	`backup`, `rotational`, or `duplicate`
Virtual model modal	`Failover Targets`	Target virtual models for `backup`/`rotational`
Virtual model modal	`Max Attempts`	Optional per-model cap on failover tries
Virtual model modal	`Failure Threshold`	Optional per-model override for circuit threshold
Virtual model modal	`Cooldown Seconds`	Optional per-model override for circuit cooldown

Activity Visibility

When failover substitutes a route, Activity shows:

virtual model and routed model (virtual_model -> actual_model)
routed endpoint
failover note in the activity details row

Environment Variables

Variable	Description	Default
`API_PORT`	OpenAI-compatible API port	`8002`
`FLASK_PORT`	Admin UI port	`5001`
`DATABASE_PATH`	SQLite database path	`/data/proxy.db`
`TIMEOUT`	Request timeout (seconds)	`300`
`AUTH_ENABLED`	Enable admin authentication	`true`
`AIMENU_URL`	Auth service URL	`http://localhost:5000`

Authentication

By default, the admin dashboard requires authentication. See docs/authentication.md for:

How to disable authentication for fresh installs
How to implement your own auth service
Full API specification for the /session/validate endpoint

Tool Pattern Matching (Patterns Tab)

The admin dashboard includes a Patterns tab for fixing model-specific tool call formats without editing code.

Add/update/delete regex-based extraction patterns
Control match priority (higher first)
Map tool names and parameter keys into schema-compatible names
Support malformed or non-standard XML/bracket/inline formats

See docs/tool_patterns.md for full details and examples.

Qwen 3.5 tool-call compatibility

Qwen 3.5 may emit XML-style tool calls instead of OpenAI JSON function calls, for example:

<tool_call>
<function=read>
<parameter=filePath>
/mnt/ai/ai-queue-master/app/config.py
</parameter>
</function>
</tool_call>

or:

<tool_call>
<function=bash>
<parameter=command>
ls -la /mnt/ai/ai-queue-master/app/
</parameter>
<parameter=description>
List app directory
</parameter>
</function>
</tool_call>

The proxy supports these via DB-backed tool_patterns records (not hardcoded), so compatibility can be adjusted from the Patterns UI/API without code edits.

Docker Ports

Port	Service
`8002`	OpenAI-compatible API
`5001`	Admin UI API

Admin Dashboard

Access the admin dashboard at /proxy-dashboard. Authentication is handled by the AI Menu System.

Features

Endpoint Management: Add, edit, delete backend endpoints
Virtual Model Mapping: Map virtual model names to actual backend models
Activity Tab: Recent request feed (route/model/IP/source/status/latency) with filters and auto-refresh
Patterns Tab: Manage tool-call translation patterns in the UI
Model Discovery: Fetch available models from endpoints
Enable/Disable: Toggle endpoints and virtual models

Endpoint Configuration

Configure backend endpoints with:

Name: Friendly identifier
URL: Base URL (e.g., http://localhost:11434, https://api.runpod.ai/v2/xxxx)
API Key: Authorization token (if required)
Type: openwebui, openai, openai_oauth, ollama, vllm, together, runpod, anthropic, deepinfra, queue
Priority: Higher priority endpoints are preferred
Enabled: Enable/disable endpoint

OAuth Endpoint Configuration (`openai_oauth`)

Use openai_oauth when your provider requires OAuth instead of a static API key.

Important reference guide:

See docs/openai-oauth-setup.md for the full current setup flow, web OAuth instructions, Codex auth.json import path, model fallback guidance, token estimation notes, and reverse-proxy/Caddy requirements.

When selected in the dashboard, the form auto-fills OpenAI-compatible defaults:

url: https://chatgpt.com
oauth_enabled: true
oauth_grant_type: refresh_token
oauth_token_url: https://auth.openai.com/oauth/token
oauth_token_request_format: json
oauth_client_auth_method: client_secret_post

You can override all fields for non-OpenAI providers.

OAuth helper buttons in endpoint form:

Start Web OAuth - Launches browser PKCE login (then paste redirect URL/code to complete)
Import from Codex auth.json - Imports OAuth fields from local Codex/ChatGPT auth cache

Supported OAuth grant types

refresh_token
client_credentials

Supported token request compatibility options

Request format: json, form (application/x-www-form-urlencoded)
Client auth method: client_secret_post, client_secret_basic

Authentication precedence

For an endpoint, the proxy resolves auth in this order:

OAuth bearer token (if OAuth is enabled and fully configured)
Static API key bearer token (fallback)
No Authorization header

Token lifecycle and persistence

Access tokens are cached in memory for runtime efficiency
Refresh token rotations returned by provider are persisted to SQLite immediately
oauth_token_expires_at metadata is persisted
Behavior survives container restarts because durable OAuth state is stored in the DB

OpenAI OAuth routing behavior

openai_oauth is routed to the Codex/ChatGPT-style endpoint and payload by default:

POST /backend-api/codex/responses
Converts incoming OpenAI Chat Completions payload to OAuth-compatible responses payload
Converts responses back to OpenAI-style chat completion output for clients

Model listing is best-effort for OAuth backends; some tokens/scopes may not expose /models routes. If model discovery fails, set the model manually in your virtual model mapping (for this setup, gpt-5.4 is confirmed to work).

Security and encryption-ready schema

OAuth and encryption-ready columns are provisioned in endpoints for future at-rest encryption rollout. Current behavior remains plaintext secret storage (same model as existing api_key handling).

Detailed implementation and migration runbook:

docs/oauth-encryption-secrets-storage.md
docs/openai-oauth-setup.md

Virtual Models

Map virtual model names to actual backend models:

Virtual Name: What clients will request (e.g., gpt-4, prod-llama)
Endpoint: Which backend to route to
Actual Model: The model name on the backend (e.g., gpt-4o, llama3:70b)
Show Reasoning: Toggle chain-of-thought display (for models like MiniMax that output thinking separately)
Cost per 1M Input Tokens ($): Price per 1M input tokens you send
Cost per 1M Output Tokens ($): Price per 1M output tokens you receive
Cost per 1M Cached Input Tokens ($): Discounted price per 1M cached input tokens (see provider pricing)
Cost per 1M Cached Output Tokens ($): Discounted price per 1M cached output tokens

Cached Token Pricing

The proxy supports tracking and pricing for cached tokens:

How it works: When you make repeated requests with similar prompts, providers cache the input tokens
Pricing: Cached tokens are billed at a significantly discounted rate (typically 10-90% cheaper)
Configuration: Enter your provider's cached token pricing in the virtual model settings
Tracking: The Usage page displays cached token counts and costs separately
Supported Providers: OpenAI, DeepInfra, and Anthropic APIs return cached token information

To configure:

Look up your provider's pricing (e.g., DeepInfra pricing page shows "$0.26 / $0.13 cached")
Enter the base price in "Cost per 1M Input Tokens"
Enter the cached price in "Cost per 1M Cached Input Tokens"

Cost Tracking & Usage Monitoring

The proxy provides comprehensive cost tracking per model:

Per-model pricing: Configure input/output/cached token rates for each virtual model
Usage dashboard: View token counts, costs, and response times in the admin UI
Daily breakdown: Track usage patterns over time
Cost estimation: Automatic calculation based on configured rates

Configure pricing per virtual model:

Input tokens: Tokens sent in requests (prompt)
Output tokens: Tokens received in responses (completion)
Cached tokens: Discounted rate for cached input tokens (when providers support caching)

The Usage page shows:

Total requests and token counts
Input vs Output token breakdown
Cached token counts and costs
Average response times
Cost per model and daily trends

Activity Feed (Admin)

The admin dashboard includes an Activity tab for quick operational visibility.

Recent traffic table (newest first)
Default view: latest 100 rows, /health excluded
Filter by status, model, IP, and path
Auto-refresh every 10 seconds (toggleable)
Metadata-only storage (no prompts/tool args/response bodies)

API Endpoints

OpenAI-Compatible API (port 8002)

# List models
curl http://localhost:8002/v1/models

# Chat completions
curl -X POST http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "my-virtual-model", "messages": [{"role": "user", "content": "Hello!"}]}'

Supported Endpoints

GET /v1/models - List available models (virtual models + default)
POST /v1/chat/completions - Chat completions
POST /v1/completions - Text completions
POST /v1/embeddings - Embeddings
GET /health - Health check

Ollama Compatibility API (port 8002)

The proxy now supports both Ollama-native routes and OpenAI-compatible routes when a virtual model maps to an ollama endpoint.

Runtime inference routes (Phase 1)

GET /api/tags
GET /api/version
POST /api/chat
POST /api/generate
POST /api/embed
POST /api/embeddings (alias)

OpenAI client compatibility aliases

POST /chat/completions -> OpenAI handler alias
POST /api/chat/completions -> OpenAI handler alias
GET /models, GET /api/models, GET /api/v1/models -> model listing aliases
POST /embeddings, POST /api/v1/embeddings -> embeddings aliases

Full native surface passthrough (Phase 2)

Forwarded to configured Ollama endpoint (resolved by model backend first, or default enabled Ollama endpoint):

POST /api/show
GET|POST /api/ps
POST /api/pull
POST /api/push
POST /api/create
POST /api/copy
DELETE|POST /api/delete
HEAD /api/blobs/{digest}
POST /api/blobs/{digest}

Behavior details

Upstream selection for ollama virtual models:
- tries Ollama OpenAI-compatible upstream POST /v1/chat/completions first
- falls back to native POST /api/chat if upstream returns 404/405
Message normalization:
- converts OpenAI block-style messages[].content arrays to Ollama-safe string content for native /api/chat
Streaming:
- native Ollama routes stream as NDJSON (application/x-ndjson)
- OpenAI routes stream as SSE (text/event-stream)
Embeddings:
- for Ollama endpoints, proxy tries /api/embed then falls back to /api/embeddings
- capability errors from upstream are returned as non-200 responses (for example 501 if model lacks embedding support)

Diagnostics for compatibility debugging

Request/response diagnostics under debug mode:
- [HTTP_IN], [HTTP_OUT], [HTTP_ERR]
Ollama upstream payload diagnostics:
- [OLLAMA_400] ... body=... payload=...
Response marker header:
- X-Proxy: serverless-proxy

Conformance and roadmap docs

Runtime conformance smoke script: scripts/ollama_runtime_conformance.sh
Full-surface conformance script: scripts/ollama_full_surface_conformance.sh
Compatibility roadmap and checklist: docs/ollama-compatibility-roadmap.md
Version-dependent compatibility notes: docs/ollama-version-notes.md

Run conformance checks:

OLLAMA_PROXY_BASE_URL=http://localhost:8002 \
OLLAMA_TEST_MODEL=gemma4:26b \
./scripts/ollama_runtime_conformance.sh

# Full-surface non-mutating checks
./scripts/ollama_full_surface_conformance.sh

# Full-surface including mutating lifecycle checks
OLLAMA_RUN_MUTATING=1 ./scripts/ollama_full_surface_conformance.sh

Admin API (port 5001)

Endpoint	Method	Description
`/api/admin/endpoints`	GET, POST	List/create endpoints
`/api/admin/activity`	GET	Recent activity feed (FastAPI)
`/api/admin/endpoints/activity`	GET	Recent activity feed (Flask/admin-compatible alias)
`/endpoints`	GET, POST	Manage endpoints
`/endpoints/<id>`	PUT	Update endpoint
`/endpoints/<id>/delete`	GET, DELETE	Delete endpoint
`/endpoints/<id>/test`	POST	Test endpoint connection
`/endpoints/<id>/models`	GET	Fetch available models
`/api/admin/oauth/openai/start-web-auth`	POST	Start OpenAI web OAuth flow
`/api/admin/oauth/openai/complete-web-auth`	POST	Complete OpenAI web OAuth with pasted URL/code
`/api/admin/oauth/openai/import-codex`	POST	Import OAuth fields from Codex `auth.json`
`/api/admin/oauth/openai/auth-result`	GET	Poll OAuth popup result by state
`/api/admin/oauth/openai/callback`	GET	OAuth callback handler
`/api/admin/virtual-models`	GET	List virtual models
`/virtual-models`	POST	Create virtual model
`/virtual-models/<id>`	PUT	Update virtual model
`/virtual-models/<id>/delete`	GET, DELETE	Delete virtual model
`/api/admin/tool-patterns`	GET, POST	List/create tool patterns
`/api/admin/tool-patterns/<id>`	PUT, DELETE	Update/delete tool pattern

OAuth fields are accepted by endpoint create/update APIs (/endpoints, /endpoints/<id>, /api/admin/endpoints, /api/admin/endpoints/<id>).

Common OAuth payload fields:

oauth_enabled (bool)
oauth_grant_type (refresh_token or client_credentials)
oauth_token_url
oauth_client_id
oauth_client_secret
oauth_scope
oauth_refresh_token
oauth_token_request_format (json or form)
oauth_client_auth_method (client_secret_post or client_secret_basic)

Reverse Proxy Routing Notes (Caddy)

If you front the dashboard with Caddy (or another reverse proxy), route admin paths to the correct backend services before catch-all /api/* rules.

127.0.0.1:5001 (Flask/admin):
- /api/admin/endpoints*
- /api/admin/virtual-models*
- /api/admin/oauth/*
- /api/admin/endpoints/activity
- /endpoints*, /virtual-models*
127.0.0.1:8002 (FastAPI/API):
- /api/admin/usage*
- /api/admin/activity

Example Caddy matchers:

@proxy-api-usage path /api/admin/usage*
handle @proxy-api-usage {
    reverse_proxy 127.0.0.1:8002
}

@proxy-api-oauth path /api/admin/oauth/*
handle @proxy-api-oauth {
    reverse_proxy 127.0.0.1:5001
}

@proxy-api path /api/admin/endpoints* /api/admin/virtual-models* /api/admin/endpoints/activity /endpoints* /virtual-models*
handle @proxy-api {
    reverse_proxy 127.0.0.1:5001
}

Backend Types

Type	Description
`openwebui`	OpenWebUI API (`/api/chat/completions`, `/api/models`, `/api/v1/embeddings`)
`openai`	OpenAI-compatible API (`/v1/chat/completions`, `/v1/models`, `/v1/embeddings`)
`openai_oauth`	OpenAI OAuth/Codex style backend (`/backend-api/codex/responses`) with OpenAI chat-completions request/response translation
`ollama`	Ollama API (native `/api/` + OpenAI-compatible `/v1/` bridging)
`vllm`	vLLM API
`together`	Together AI
`runpod`	RunPod Serverless
`anthropic`	Anthropic Messages API (`/v1/messages`)
`deepinfra`	DeepInfra OpenAI-compatible API (`/v1/openai/chat/completions`)
`queue`	AI Queue endpoint (`/v1/chat/completions`, `/v1/embeddings`)

AI Queue Integration (Optional)

Route requests through AI Queue Master for priority queuing and request tracking.

USE_AI_QUEUE=true
AI_QUEUE_URL=http://host.docker.internal:8102
AI_QUEUE_API_KEY=your_queue_api_key
AI_QUEUE_PRIORITY=NORMAL

Features

Tool call parsing — Automatically extracts tool calls from model output
Chain-of-thought stripping — Removes reasoning prefixes
Streaming & non-streaming — Full SSE streaming support
Job polling — Automatically polls for queued job completion
Session-based auth — Uses AI Menu System for admin authentication
Claude Code / OpenCode support — Compatible with AI coding assistants

Supporting AI Coding Assistants (Claude Code, OpenCode, Cursor, etc.)

AI coding assistants require specific configurations to work properly. The proxy includes special handling to ensure compatibility:

Proxy Adjustments for AI Coding Assistants

Tool call normalization — Automatically fixes malformed tool calls from models
System prompt preservation — Maintains context across code generation sessions
Streaming optimization — Real-time tool execution for interactive coding
Response format conversion — Ensures OpenAI-compatible format for tool results
Error handling — Graceful fallbacks when models produce unexpected output
Claude Code compatibility — Claude Code works best with OpenAI-compatible endpoints through the proxy, even when using non-OpenAI models

Model Requirements

Use models with strong tool-calling capabilities. Recommended:

Qwen series (e.g., Qwen3-80B, Qwen3-Coder) - Excellent tool calling
Claude 3.5+ - Native tool support via Anthropic API
DeepSeek-V3 - Good tool calling performance

Endpoint Configuration

For best results with coding assistants:

Use OpenAI-compatible or DeepInfra endpoint types
Enable streaming for real-time tool execution
Configure adequate max_tokens (8192-128000 for code generation)

Virtual Model Setup

When creating virtual models for coding assistants:

Set appropriate max_tokens to allow long code outputs
Use models that support tool calls (check provider docs)
For Anthropic models, ensure endpoint type is set to anthropic

Troubleshooting

Tools not executing:

Check model supports tool calls (not all models do)
Verify streaming is enabled
Check response format in logs

Code execution errors:

Verify model output is valid JSON for tool calls
Check custom headers if required by your setup

# View container logs
docker logs serverless-proxy

# Restart container
docker restart serverless-proxy

# Check health
curl http://localhost:8002/health

Project Structure

.
├── simple_bridge.py          # Main proxy application (FastAPI + Flask)
├── docker-compose.yml        # Docker Compose configuration
├── Dockerfile                # Container image definition
├── requirements.txt          # Python dependencies
├── templates/
│   └── admin_dashboard.html # Admin UI (static HTML)
├── .env.example              # Environment variable template
├── README.md
└── CHANGELOG.md

License

MIT License — see LICENSE.md

Acknowledgments

Based on RunPod serverless API patterns. Extended with virtual model configuration, Anthropic API compatibility, and admin UI capabilities.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Serverless Proxy - Universal LLM Gateway

Overview

Quick Start

Prerequisites

Step 1: Clone and Setup

Step 2: Configure Your Environment

Step 3: Start the Proxy

Step 4: Configure in the Admin UI

Add an Endpoint

Add a Virtual Model

Step 5: Use the Proxy

Troubleshooting

How to Update Safely

1) Back up the database first (required)

2) Stop the running container (recommended)

3) Pull the latest code

4) Rebuild and restart

5) Let migrations run automatically

6) Verify existing setup still works

7) (Optional) Start using OAuth endpoints

Rollback (if needed)

Configuration

Health, Failover, and Cache

Endpoint Health Polling

Failover Configuration

Non-Streaming Cache

Usage and Savings Metrics

Settings Reference (Health/Failover/Cache)

Activity Visibility

Environment Variables

Authentication

Tool Pattern Matching (Patterns Tab)

Qwen 3.5 tool-call compatibility

Docker Ports

Admin Dashboard

Features

Endpoint Configuration

OAuth Endpoint Configuration (openai_oauth)

Supported OAuth grant types

Supported token request compatibility options

Authentication precedence

Token lifecycle and persistence

OpenAI OAuth routing behavior

Security and encryption-ready schema

Virtual Models

Cached Token Pricing

Cost Tracking & Usage Monitoring

Activity Feed (Admin)

API Endpoints

OpenAI-Compatible API (port 8002)

Supported Endpoints

Ollama Compatibility API (port 8002)

Runtime inference routes (Phase 1)

OpenAI client compatibility aliases

Full native surface passthrough (Phase 2)

Behavior details

Diagnostics for compatibility debugging

Conformance and roadmap docs

Admin API (port 5001)

Reverse Proxy Routing Notes (Caddy)

Backend Types

AI Queue Integration (Optional)

Features

Supporting AI Coding Assistants (Claude Code, OpenCode, Cursor, etc.)

Proxy Adjustments for AI Coding Assistants

Model Requirements

Endpoint Configuration

Virtual Model Setup

Troubleshooting

Project Structure

License

Acknowledgments

OAuth Endpoint Configuration (`openai_oauth`)