You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.
A from scratch LLM inference engine build in PyTorch with custom GPT2/LLaMA/ transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks
Fork of OpenAI and Anthropic compatible server for Apple Silicon. Native MLX backend, 500+ tok/s. Run LLMs and vision-language models with continuous batching, MCP tool calling, and multimodal support.
Mistral-7B inference server hitting around 15x throughput gains through continuous batching, priority scheduling, and CPU-optimized execution. Exposes a FastAPI surface with per-request latency and queue metrics.