Skip to content

hyper07/ollama_exo_proxy_server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Exo Proxy Fortress: Secure AI Cluster Management πŸ›‘οΈ

License Python Version Built with Release GitHub stars

Secure your distributed AI infrastructure. Exo Proxy Fortress is the ultimate security and management layer for your Exo AI clusters, designed to be set up in 60 seconds by anyone, on any operating system.

Note: This is a fork of ParisNeo/exo_proxy_server with significant enhancements including full Dockerization and RAG (Retrieval-Augmented Generation) implementation for knowledge base management and semantic search capabilities.

Whether you're running a home AI cluster or managing enterprise-scale distributed inference, this tool transforms your vulnerable Exo endpoints into a managed, secure, and deeply customizable AI command center.


πŸ“‹ Project Status & Roadmap

βœ… Completed Features

  • Full Dockerization with MongoDB, Redis, Nginx, and automatic SSL certificate generation
  • RAG Implementation with knowledge base management
  • Document upload and automatic chunking
  • Vector embeddings with ChromaDB
  • Semantic search and retrieval
  • Chat playground integration
  • Gunicorn Production Setup with application-level cluster management
  • SSL Certificate Upload via Web UI
  • Multi-Server Management for Exo clusters and Ollama servers
  • Automatic SSL Generation for localhost
  • Enhanced Documentation
  • KV Cache System with Redis
  • API response caching
  • Embedding caching for RAG operations
  • RAG query result caching
  • Model metadata caching

🚧 In Progress

  • RAG Performance Optimization
  • Enhanced Error Handling for RAG operations
  • Documentation Improvements

πŸ“‹ Planned Features

  • PDF and binary document format support
  • Automatic document metadata extraction
  • Multi-language document support
  • Document versioning and update management
  • RAG usage statistics and metrics
  • Knowledge base performance monitoring
  • Query analytics and optimization suggestions
  • RESTful API for knowledge base management
  • Webhook support for document indexing events
  • Batch document upload API
  • Document-level access control for knowledge bases
  • Encrypted document storage option
  • Audit logging for RAG operations
  • Webhook integrations for external systems
  • Export/import knowledge bases
  • Backup and restore functionality
  • Real-time document indexing progress
  • Advanced search filters for knowledge bases
  • Document preview in chat context
  • Distributed ChromaDB support
  • Caching layer for frequently accessed documents
  • Async document processing queue

Why You Need Exo Proxy Fortress

Exo enables running massive AI models across multiple devices with automatic discovery, RDMA over Thunderbolt, and tensor parallelism. However, exposing Exo clusters directly to the internet introduces significant security risks. Exo Proxy Fortress provides enterprise-grade security while preserving all the power of distributed AI inference.

Key Benefits

  • πŸ›‘οΈ Rock-Solid Security:

    • API Key Authentication: Eliminate anonymous access to your AI clusters.
    • Endpoint Blocking: Prevent API key holders from accessing sensitive endpoints like pull, delete, and create to protect your servers from abuse.
    • One-Click HTTPS/SSL: Encrypt all traffic with easy certificate uploads or path-based configuration.
    • IP Filtering: Create granular allow/deny lists to control exactly which machines can connect.
    • Rate Limiting & Brute-Force Protection: Prevent abuse and secure your admin interface (powered by Redis).
  • πŸš€ High-Performance Distributed Engine:

    • Intelligent Load Balancing: Distribute requests across multiple Exo servers, Ollama instances, or mixed clusters for maximum speed and high availability.
    • Smart Model Routing: Automatically route requests only to servers that have the specific model available, preventing failed requests and saving compute resources.
    • Automatic Retries: Handle temporary server hiccups gracefully with exponential backoff strategy.
    • Application-Level Management: Built on Gunicorn for production-grade performance, allowing you to manage multiple cluster servers without Kubernetes complexity.
  • 🌐 Centralized Cluster & Model Management:

    • Monitor and manage all your Exo nodes, Ollama servers, or mixed environments from a single web interface.
    • Pull, update, and delete models on any connected server directly from the proxy's web UIβ€”no terminal commands required.
    • Track model distribution, device health, and cluster performance in real-time.
    • Add or remove servers dynamically through the web UI without infrastructure changes.
  • πŸ§ͺ Model Playgrounds & RAG:

    • Interactive Chat Playground: Test models with streaming responses, multi-modal inputs (paste images directly!), and full conversation history management (import/export).
    • RAG Knowledge Bases: Upload documents, automatically chunk and index them with vector embeddings, and retrieve relevant context during chat conversations for more accurate responses.
  • πŸ“Š Mission Control Dashboard:

    • Real-time monitoring of proxy health (CPU, Memory, Disk), active models across all cluster nodes, live load balancer status, and API rate-limit queues.
  • πŸ“ˆ Comprehensive Analytics Suite:

    • Interactive charts for daily and hourly requests, model popularity, and cluster load.
    • Per-user analytics with exportable data (CSV or PNG).
  • πŸ‘€ Granular User & API Key Management:

    • Create and manage users with sortable tables showing key counts, total requests, and last activity.
    • Manage individual API keys with per-key rate limits, and temporarily disable or re-enable keys on the fly.
  • 🎨 Radical Theming Engine:

    • Choose from over a dozen stunning UI themes including Material Design, Cyberpunk, CRT Terminal, and Brutalist styles.
  • ✨ Effortless Setup:

    • One-command Docker deployment. No complex configuration required.

🧠 RAG (Retrieval-Augmented Generation)

This fork includes a complete RAG implementation that allows you to build knowledge bases from your documents and enhance chat interactions with semantic search. The RAG system uses KV caching to dramatically improve performance by caching embeddings and query results.

⚑ KV Cache System

The proxy includes a comprehensive Redis-based KV cache system that significantly improves performance:

Cache Types

  • Embedding Cache: Caches generated embeddings for document chunks and queries, avoiding expensive regeneration. Embeddings are cached for 7 days by default.
  • RAG Query Cache: Caches semantic search results for frequently asked questions, reducing vector database queries.
  • API Response Cache: Caches non-streaming API responses (like model lists) to reduce backend server load.
  • Model Metadata Cache: Caches server model information to reduce database queries.

Benefits

  • Faster RAG Operations: Embedding generation is expensive - caching dramatically speeds up document indexing and query processing
  • Reduced Backend Load: Cached responses reduce requests to Exo/Ollama servers
  • Better User Experience: Faster response times for repeated queries
  • Cost Savings: Fewer API calls to embedding models means lower compute costs

Cache Configuration

The cache system is automatically enabled when Redis is available. Cache TTLs (Time To Live) are configurable:

  • Embeddings: 7 days (configurable)
  • RAG queries: 1 hour (configurable)
  • API responses: 5 minutes (configurable)
  • Model metadata: 10 minutes (configurable)

Cache statistics and management can be accessed through the admin interface (coming soon).


Key RAG Features

  • Knowledge Base Management: Create multiple knowledge bases to organize your documents by topic, project, or domain.
  • Document Upload & Indexing: Upload text files (.txt, .md, .json, etc.) and automatically chunk them with configurable size and overlap.
  • Vector Embeddings: Documents are automatically embedded using your Exo embedding models and stored in ChromaDB for fast retrieval.
  • Semantic Search: Query your knowledge bases to find the most relevant document chunks based on semantic similarity.
  • Chat Integration: Select knowledge bases in the Chat Playground to automatically include relevant context in your conversations.
  • Multi-KB Support: Query multiple knowledge bases simultaneously for comprehensive context retrieval.

How to Use RAG

  1. Navigate to Knowledge Bases (RAG) in the sidebar
  2. Create a new knowledge base and specify an embedding model
  3. Upload documents to your knowledge base (they'll be automatically indexed)
  4. In the Chat Playground, select one or more knowledge bases from the dropdown
  5. Your queries will automatically include relevant context from your documents

πŸ›‘οΈ Harden Your Defenses: Endpoint Blocking

Giving every user an API key shouldn't mean giving them the keys to the kingdom. By default, Exo Proxy Fortress blocks access to dangerous and resource-intensive API endpoints for all API key holders.

  • Prevent Resource Abuse: Stop users from accessing sensitive management endpoints that could affect cluster performance.
  • Protect Your Models: Prevent API users from modifying model configurations on your backend nodes.
  • Full Admin Control: As an administrator, you can still perform all management actions securely through the web UI's Cluster Management page.
  • Customizable: You have full control to change which endpoints are blocked via the Settings -> Endpoint Security menu.

πŸ”’ Encrypt Everything with One-Click HTTPS/SSL

Securing your AI traffic is now dead simple. In the Settings -> HTTPS/SSL menu, you have two easy options:

  1. Upload & Go (Easiest): Simply upload your key.pem and cert.pem files directly through the UI. The files are automatically saved to the .ssl directory and configured for use. After uploading, restart the server for HTTPS to take effect.
  2. Path-Based: If your certificates are already on the server (e.g., managed by Certbot), just provide the full file paths in the text fields.

Important: After uploading SSL certificates or changing SSL settings, you must restart the server for the changes to take effect. The uploaded files are stored in the .ssl directory in the project root.

For local testing, you can generate a self-signed certificate with OpenSSL:

openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes -subj "/CN=localhost"

A server restart is required to apply changes, ensuring your connection is fully encrypted and secure from eavesdropping.


πŸš€ Quick Start

One command (HTTP only):

docker-compose up -d

This starts the app, MongoDB, and Redis (HTTP on port 8080).

Access the web interface: http://localhost:8080
Default admin: admin / changeme (change this!)

Optional: Full HTTPS with Certbot + Nginx

  1. Point your DNS A/AAAA to this host (e.g., example.com).

  2. Issue a cert (webroot):

docker-compose run --rm certbot certonly \
    --webroot -w /var/www/certbot \
    -d example.com \
    --email you@example.com --agree-tos --no-eff-email
  1. Edit infra/nginx/conf.d/exo.conf and replace example.com with your domain (both server blocks).

  2. Start/restart with Nginx TLS termination:

docker-compose up -d nginx app redis mongodb
  1. If you want the app itself to also serve HTTPS (optional), set in settings or .env:
  • SSL_CERTFILE=/etc/letsencrypt/live/example.com/fullchain.pem
  • SSL_KEYFILE=/etc/letsencrypt/live/example.com/privkey.pem and restart the app container.

Renewal (cron-friendly):

docker-compose run --rm certbot renew --webroot -w /var/www/certbot
docker-compose restart nginx

Manual Setup (Optional)

If you want to customize settings before starting:

# 1. Run setup wizard
docker-compose run --rm app python setup_wizard.py

# 2. Start services
docker-compose up -d

Native Python Installation (Advanced)

For development or custom deployments, use the legacy installer:

./run.sh

Then choose option 2 for native Python installation.


πŸ”‘ Using the API with API Keys

Once you've created an API key through the admin interface, you can use it to authenticate your requests to the proxy. All API requests must include your API key in the Authorization header using the Bearer token format.

Getting Your API Key

  1. Log in to the admin interface at http://localhost:8080/admin
  2. Navigate to Users β†’ Select a user β†’ Create API Key
  3. Copy the generated API key (you'll only see it once!)

API Request Examples

Using cURL

Chat Completion:

curl -X POST http://localhost:8080/api/chat/completions \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-1b",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "stream": false
  }'

Using Python

import requests

API_KEY = "your_api_key_here"
BASE_URL = "http://localhost:8080"

# Chat completion
response = requests.post(
    f"{BASE_URL}/api/chat/completions",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "model": "llama-3.2-1b",
        "messages": [
            {"role": "user", "content": "What is machine learning?"}
        ],
        "stream": False
    }
)
print(response.json())

Using JavaScript/Node.js

const API_KEY = "your_api_key_here";
const BASE_URL = "http://localhost:8080";

// Chat completion
fetch(`${BASE_URL}/api/chat/completions`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "llama-3.2-1b",
    messages: [
      { role: "user", content: "Explain quantum computing" }
    ],
    stream: false
  })
})
  .then(res => res.json())
  .then(data => console.log(data))
  .catch(err => console.error("Error:", err));

Streaming Responses

For streaming responses, set "stream": true in your request:

curl -X POST http://localhost:8080/api/chat/completions \
  -H "Authorization: Bearer your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-1b",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true
  }'

Important Notes

  • API Key Format: Your API key should be included exactly as generated (format: op_prefix_secret)
  • HTTPS: If you've configured HTTPS, replace http:// with https:// in all examples
  • Rate Limiting: Each API key has configurable rate limits set by the administrator
  • Endpoint Restrictions: Some sensitive management endpoints may be blocked for API key users by default

Troubleshooting API Requests

"Method Not Allowed" Error:

  • Ensure you're using the correct HTTP method (POST for /api/chat/completions, etc.)
  • Verify your API key is valid and active in the admin interface
  • Check that the endpoint isn't blocked in Settings β†’ Endpoint Security
  • Try restarting the proxy server if you recently made configuration changes
  • Verify the route is accessible: curl -X GET https://proxy.local/api/models -H "Authorization: Bearer your_api_key"

"401 Unauthorized" Error:

  • Check that your API key is correctly formatted in the Authorization header: Bearer your_api_key_here
  • Ensure there's a space between "Bearer" and your API key
  • Verify the API key hasn't been revoked or disabled
  • Make sure you're using the full API key (both prefix and secret parts)

"403 Forbidden" Error:

  • The endpoint you're trying to access may be blocked for API key users
  • Check Settings β†’ Endpoint Security to see which endpoints are restricted
  • Some management endpoints are blocked by default for security

Connection Issues:

  • Verify the proxy server is running and accessible
  • Check that you're using the correct base URL (including port if not using standard 80/443)
  • For HTTPS, ensure your SSL certificates are properly configured

Visual Showcase

Step 1: The Command Center Dashboard

Your new mission control. Instantly see system health, active models, server status, and live rate-limit queues, all updating automatically.

Model List

Step 2: Manage Your Servers & Models

No more SSH or terminal juggling. Add all your Exo instances, then pull, update, and delete models on any server with a few clicks.

EXO Cluster List

Step 3: Manage Users & Drill Down into Analytics

The User Management page gives you a sortable, high-level overview. From here, click "View Usage" to dive into a dedicated analytics page for any specific user.

User Management

Step 4: Test & Benchmark in the Playgrounds

Use the built-in playground to evaluate your models. The Chat Playground provides a familiar UI to test conversational models with streaming and image support. You can also create Knowledge Bases and enable RAG to enhance your chat interactions with document context.

Step 5: EXO Master API Tester

Test the EXO Master API endpoints directly. This interface makes direct calls to your EXO instance (no proxy layer). Configure your EXO instance URL and test endpoints like /models, /state, and /place_instance with ease.

![EXO API Tester](assets/EXO API TESTER.gif)


πŸš€ Architecture: Gunicorn-Based Application-Level Management

This proxy uses Gunicorn with Uvicorn workers instead of Kubernetes for deployment. This architectural choice provides significant advantages for managing distributed AI clusters:

Why Gunicorn Instead of Kubernetes?

  • Application-Level Cluster Management: Unlike Kubernetes which operates at the infrastructure level, Gunicorn allows us to manage multiple Ollama or Exo cluster servers directly at the application level. This means you can add, remove, and configure backend servers through the web UI without touching infrastructure configuration.

  • Simplified Deployment: No need for complex Kubernetes manifests, service discovery, or ingress controllers. The proxy handles all server management through its built-in load balancer and routing logic.

  • Unified Control Plane: All your distributed AI servers (whether Exo clusters, Ollama instances, or mixed environments) are managed from a single application interface. The proxy automatically discovers models, routes requests intelligently, and provides unified analytics across all your servers.

  • Flexible Server Configuration: Add servers dynamically through the web interface. Each server can have different configurations, API keys, and capabilities, all managed at the application level without infrastructure changes.

  • Production-Ready Performance: Gunicorn with Uvicorn workers provides excellent performance for async FastAPI applications, with configurable worker processes for optimal resource utilization.

Multi-Server Management

The proxy excels at managing heterogeneous AI infrastructure:

  • Multiple Exo Clusters: Connect multiple Exo cluster nodes and the proxy will intelligently route requests based on model availability
  • Mixed Environments: Combine Exo clusters with Ollama servers in a single unified interface
  • Dynamic Discovery: Automatically discover and catalog all available models across all connected servers
  • Smart Routing: Route requests only to servers that have the required model, preventing failed requests

🐳 Docker Deployment (Recommended)

This version has been fully dockerized for easy deployment. The easiest way to run Exo Proxy Fortress is with Docker Compose, which provides the full stack (app, MongoDB, Redis, Nginx) in isolated containers with automatic SSL certificate generation for localhost. The application runs with Gunicorn for production-grade performance and reliability.

Quick Start

./start.sh

Manual Control

# Setup (first time only)
./setup.sh

# Start services
docker-compose up -d

# View logs
docker-compose logs -f app

# Stop services
docker-compose down

Docker Commands

  • Start: docker-compose up -d
  • Stop: docker-compose down
  • Restart: docker-compose restart
  • Logs: docker-compose logs -f app
  • Rebuild: docker-compose up --build

The setup automatically handles:

  • MongoDB database initialization
  • Redis caching setup
  • Nginx reverse proxy with automatic SSL certificate generation (for localhost)
  • Gunicorn WSGI server with Uvicorn workers for production deployment
  • Volume persistence for data
  • Automatic service dependencies
  • ChromaDB vector database for RAG functionality

Production Configuration

The Docker setup uses Gunicorn with configurable workers (default: 4). You can customize this via environment variables:

# Set number of worker processes
GUNICORN_WORKERS=8

# Set bind address and port
GUNICORN_BIND=0.0.0.0:8080

πŸ› οΈ Development Mode (Hot Reload)

For developers who want to modify the codebase with instant feedback:

Prerequisites

# Install development dependencies
pip install -r requirements.txt
pip install beanie motor redis

# Or with Poetry (if available)
poetry install

Quick Development Start

# Use the development script with optimized hot reload
python dev.py

Manual Development Start

# Activate virtual environment
source venv/bin/activate

# Start with hot reload
uvicorn app.main:app --host 0.0.0.0 --port 8000 \
  --reload \
  --reload-delay 0.1 \
  --reload-include "*.py" \
  --reload-include "*.html" \
  --reload-include "*.css" \
  --reload-include "*.js" \
  --reload-include "*.jinja2" \
  --log-level info

Hot Reload Features

  • Instant Python code reloading - Changes to .py files reload automatically
  • Template hot reload - HTML template changes reflect immediately (no caching)
  • Static file watching - CSS/JS changes trigger reload
  • Fast reload delay - 0.1 second response time
  • Comprehensive file watching - Watches entire project directory

Development Tips

  • Server runs on http://localhost:8000
  • Template changes are picked up immediately (no server restart needed)
  • Use browser developer tools to clear cache if needed
  • Check terminal for reload confirmation messages

Resetting Your Installation (Troubleshooting)

WARNING: IRREVERSIBLE ACTION

The reset scripts are for troubleshooting or starting over completely. They will PERMANENTLY DELETE your database, configuration, and Python environment.

If you encounter critical errors or wish to perform a completely fresh installation, use the provided reset scripts.

On Windows: Double-click the reset.bat file.

On macOS or Linux:

chmod +x reset.sh
./reset.sh

Credits and Acknowledgements

The Exo Proxy was developed with passion by the open-source community. A special thank you to:

  • Exo - The distributed AI inference framework that powers this proxy.
  • ParisNeo/exo_proxy_server - The original project that this fork is based on.
  • hyper07 - Dockerization, RAG implementation, and additional enhancements.
  • All contributors who have helped find and fix bugs.
  • The teams behind FastAPI, MongoDB, ChromaDB, Jinja2, Chart.js, and Tailwind CSS.

Visit this project on GitHub to contribute, report issues, or star the repository!


License

This project is licensed under the Apache License 2.0. Feel free to use, modify, and distribute.

About

Secure, dockerized proxy and management layer for Exo & Ollama AI clusters with load balancing, API key security, analytics, and built-in RAG (Retrieval-Augmented Generation).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors