continuous-batching

Here are 13 public repositories matching this topic...

cubist38 / mlx-openai-server

A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.

flux queue speech-recognition image-generation whisper vision-api mlx fastapi multi-models apple-silicon continuous-batching tool-calling structured-outputs mlx-lm mlx-vlm openai-compatible

Updated Apr 27, 2026
Python

psmarter / mini-infer

Star

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

machine-learning cuda inference pytorch transformer triton moe quantization language-model inference-engine kv-cache tensor-parallelism llm speculative-decoding pagedattention continuous-batching

Updated Apr 24, 2026
Python

lumia431 / photon_infer

Star

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

modern-cpp inference-engine ai-infra vllm llm-inference paged-attention continuous-batching

Updated Jan 2, 2026
C++

gty111 / gLLM

Star

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

pipeline-parallelism tensor-parallelism llm-serving llm-inference pagedattention continuous-batching qwen3 token-throttling chunked-prefill

Updated Apr 15, 2026
Python

caimari / vtts

Star

Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.

text-to-speech pytorch tts speech-synthesis voice-synthesis voice-cloning voice-agent gpu-inference vllm continuous-batching real-time-tts qwen3-tts

Updated Mar 15, 2026
Python

achi9629 / llm-inference-engine

Star

A from scratch LLM inference engine build in PyTorch with custom GPT2/LLaMA/ transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks

nlp deep-learning transformers autoregressive mistral inference-engine model-serving fastapi gpt-2 gpt2 kv-cache llm llm-serving vllm llm-inference paged-attention mistral-7b continuous-batching paged-kv-cache

Updated Apr 10, 2026
Python

laywens / vllm-mlx

Star

Fork of OpenAI and Anthropic compatible server for Apple Silicon. Native MLX backend, 500+ tok/s. Run LLMs and vision-language models with continuous batching, MCP tool calling, and multimodal support.

inference-server mlx multimodal apple-silicon llm vllm local-ai continuous-batching tool-calling openai-compatible

Updated Mar 20, 2026
Python

maxime-dlabai / mlx-continuous-batching

Star

OpenAI-compatible server with continuous batching for MLX on Apple Silicon

macos inference text-generation mlx apple-silicon openai-api llm continuous-batching

Updated Dec 4, 2025
Python

nagababumo / Efficiently-Serving-LLMs

Star

batching lora quantization lorax low-rank-adaptation continuous-batching multi-lora

Updated Jun 19, 2024
Jupyter Notebook

AdaXL / adaptive-llm-scheduler

Star

Adaptive LLM inference scheduler simulation — continuous batching, priority preemption, KV-cache routing, and speculative decoding in Python/asyncio.

research gpu scheduling inference asyncio kv-cache llm speculative-decoding continuous-batching

Updated Mar 10, 2026
Python

LessUp / hetero-paged-infer

Star

High-Performance LLM Inference Engine with PagedAttention & Continuous Batching in Rust

rust machine-learning high-performance inference transformer gpu-computing production-ready systems-programming inference-engine serving kv-cache llm vllm llm-inference paged-attention continuous-batching

Updated Apr 27, 2026
Rust

JhoneCasali / llm-batch

Star

Process batches of large language model tasks efficiently using multithreading in C++ for faster and scalable LLM workflows.

Updated Apr 28, 2026
C++

dakshjain-1616 / Multi-Query-Batch-Inference-Optimization

Star

Mistral-7B inference server hitting around 15x throughput gains through continuous batching, priority scheduling, and CPU-optimized execution. Exposes a FastAPI surface with per-request latency and queue metrics.

python api optimization inference observability mistral fastapi llm continuous-batching

Updated Feb 11, 2026
Python

Improve this page

Add a description, image, and links to the continuous-batching topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the continuous-batching topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continuous-batching

Here are 13 public repositories matching this topic...

cubist38 / mlx-openai-server

psmarter / mini-infer

lumia431 / photon_infer

gty111 / gLLM

caimari / vtts

achi9629 / llm-inference-engine

laywens / vllm-mlx

maxime-dlabai / mlx-continuous-batching

nagababumo / Efficiently-Serving-LLMs

AdaXL / adaptive-llm-scheduler

LessUp / hetero-paged-infer

JhoneCasali / llm-batch

dakshjain-1616 / Multi-Query-Batch-Inference-Optimization

Improve this page

Add this topic to your repo