Start the server:
# Simple mode - maximum throughput for single user
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# Continuous batching - for multiple concurrent users
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batchingUse with OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="mlx-community/Llama-3.2-3B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Or with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "default", "messages": [{"role": "user", "content": "Hello!"}]}'from vllm_mlx.models import MLXLanguageModel
model = MLXLanguageModel("mlx-community/Llama-3.2-3B-Instruct-4bit")
model.load()
# Generate text
output = model.generate("What is the capital of France?", max_tokens=100)
print(output.text)
# Streaming
for chunk in model.stream_generate("Tell me a story"):
print(chunk.text, end="", flush=True)vllm-mlx-chat --model mlx-community/Llama-3.2-3B-Instruct-4bitOpens a web interface at http://localhost:7860
For image/video understanding, use a VLM model:
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}],
max_tokens=256
)Separate the model's thinking process from the final answer:
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 × 23?"}]
)
print(response.choices[0].message.content) # Final answerGenerate text embeddings for semantic search and RAG:
vllm-mlx serve mlx-community/Qwen3-4B-4bit --embedding-model mlx-community/multilingual-e5-small-mlxresponse = client.embeddings.create(
model="mlx-community/multilingual-e5-small-mlx",
input="Hello world"
)Enable function calling with any supported model:
vllm-mlx serve mlx-community/Devstral-Small-2507-4bit \
--enable-auto-tool-choice --tool-call-parser mistral- Server Guide - Full server configuration
- Python API - Direct API usage
- Multimodal Guide - Images and video
- Audio Guide - Speech-to-Text and Text-to-Speech
- Embeddings Guide - Text embeddings
- Reasoning Models - Thinking models
- Tool Calling - Function calling
- Supported Models - Available models