Go SDK for building agentic applications backed by a local or self-hosted vLLM OpenAI-compatible server.
- Package:
vllmsdk - Default backend:
http://127.0.0.1:8000/v1
go get github.com/ethpandaops/vllm-agent-sdk-goThe SDK resolves configuration from explicit options first, then environment variables, then defaults.
| Variable | Description | Default |
|---|---|---|
VLLM_BASE_URL |
vLLM server base URL | http://127.0.0.1:8000/v1 |
VLLM_API_KEY |
Bearer auth token (optional, only if your server enforces auth) | (none) |
VLLM_MODEL |
Model name | (none — must be set via env or WithModel()) |
VLLM_AGENT_SESSION_STORE_PATH |
Local session store directory | (none) |
Example-only variables (not resolved by the core SDK):
| Variable | Description | Default |
|---|---|---|
VLLM_IMAGE_MODEL |
Image-capable model for multimodal examples | QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ |
VLLM_VISION_MODEL |
Vision model for multimodal input examples | Falls back to VLLM_IMAGE_MODEL, then VLLM_MODEL |
VLLM_IMAGE_OUTPUT_DIR |
Directory for saving generated images | (none) |
All settings follow the same resolution order:
- Explicit option (e.g.
WithBaseURL(...),WithAPIKey(...),WithModel(...)) - Environment variable (
VLLM_BASE_URL,VLLM_API_KEY,VLLM_MODEL) - Built-in default (where applicable)
The repo ships a sibling-style Makefile:
make testruns race-enabled package tests with coverage output.make test-integrationruns./integration/...with-tags=integration.make auditruns the aggregate quality gate.
Integration setup:
- Set
VLLM_BASE_URLor default tohttp://127.0.0.1:8000/v1. - Set
VLLM_MODELto the model served by your vLLM instance. - Set
VLLM_API_KEYif your vLLM server enforces bearer auth. - Integration tests skip when the local vLLM server is unavailable.
package main
import (
"context"
"fmt"
"time"
vllmsdk "github.com/ethpandaops/vllm-agent-sdk-go"
)
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
// Model resolved from VLLM_MODEL env var, or set explicitly:
for msg, err := range vllmsdk.Query(
ctx,
vllmsdk.Text("Write a two-line haiku about Go concurrency."),
// vllmsdk.WithModel("QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ"),
) {
if err != nil {
panic(err)
}
if result, ok := msg.(*vllmsdk.ResultMessage); ok && result.Result != nil {
fmt.Println(*result.Result)
}
}
}Query(ctx, content, ...opts)andQueryStream(...)returniter.Seq2[Message, error].NewClient()exposesStart,StartWithContent,StartWithStream,Query,ReceiveMessages,ReceiveResponse,Interrupt,SetPermissionMode,SetModel,ListModels,ListModelsResponse,GetMCPStatus,RewindFiles, andClose.- Unsupported peer-parity controls such as
ReconnectMCPServer,ToggleMCPServer,StopTask, andSendToolResultare present onClientand return typedUnsupportedControlErrors. UserMessageContentis the canonical input shape. UseText(...)for text-only calls andBlocks(...)withImageInput(...),FileInput(...),AudioInput(...), orVideoInput(...)for multimodal chat-completions requests.WithSDKTools(...)registers high-level in-process tools undermcp__sdk__<name>.WithOnUserInput(...)handles SDK-owned user-input prompts built on top of tool calling.ListModels(...)andListModelsResponse(...)usevLLMmodel discovery via/v1/models.StatSession(...),ListSessions(...), andGetSessionMessages(...)operate on the SDK's local persisted session store.
- Discovery uses
/v1/models. - Returned
ModelInfovalues are projected from the OpenAI-compatible model cards that vLLM serves, so provider-rich VLLM metadata is no longer guaranteed. ModelInfostill exposes helper methods such asCostTier(),SupportsToolCalling(),SupportsStructuredOutput(),SupportsReasoning(),SupportsImageInput(),SupportsImageOutput(),SupportsWebSearch(),SupportsPromptCaching(),MaxContextLength(), and parsed pricing helpers.
- Generated images are surfaced as
*ImageBlockvalues insideAssistantMessage.Content. ImageBlock.Decode()returns raw bytes plus media type for data-URL-backed images.ImageBlock.Save(path)writes generated images to disk.- Live image-generation coverage is available behind the integration build tag when
VLLM_IMAGE_MODELis set.
Multimodal input in this SDK is block-based and targets the vLLM OpenAI-compatible chat surface.
content := vllmsdk.Blocks(
vllmsdk.TextInput("Compare these two screenshots and the attached spec file."),
vllmsdk.ImageInput("https://example.com/before.png"),
vllmsdk.ImageInput("data:image/png;base64,..."),
vllmsdk.FileInput("spec.pdf", "data:application/pdf;base64,..."),
)
for msg, err := range vllmsdk.Query(ctx, content,
// vllmsdk.WithModel("QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ"),
) {
_ = msg
_ = err
}ImageInput(...)accepts a normal URL or a base64 data URL.FileInput(...)accepts a filename plusfile_dataURL/data URL.AudioInput(...)accepts base64 audio data plus a format.VideoInput(...)accepts a normal URL or a data URL.- Responses mode is routed to the vLLM
/v1/responsessurface when selected.
Session APIs are local SDK APIs, not remote vLLM server sessions.
- They read from the SDK session store configured with
WithSessionStorePath(...)orVLLM_AGENT_SESSION_STORE_PATH. - They do not derive from chat
session_id. - They do not derive from Responses
previous_response_id.
vLLM does not have meaningful backend equivalents for some sibling control-plane methods. The SDK exposes those methods where peer parity matters, but they fail explicitly with UnsupportedControlError instead of faking semantics.
The SDK provides opt-in OpenTelemetry metrics and distributed tracing. When no provider is configured all recording is a pure noop — zero overhead.
| Option | Description |
|---|---|
WithMeterProvider(mp) |
Sets an OTel metric.MeterProvider for SDK metrics |
WithTracerProvider(tp) |
Sets an OTel trace.TracerProvider for SDK spans |
WithPrometheusRegisterer(reg) |
Convenience: creates an OTel MeterProvider backed by a Prometheus Registerer |
GenAI semantic convention metrics:
| Metric | Type | Description |
|---|---|---|
gen_ai.client.operation.duration |
Histogram (s) | Duration of query operations |
gen_ai.client.token.usage |
Counter | Token usage by type (input/output) |
gen_ai.client.time_to_first_token |
Histogram (s) | Time to first content token |
gen_ai.client.time_per_output_token |
Histogram (s) | Inter-token arrival time |
vLLM-specific metrics:
| Metric | Type | Description |
|---|---|---|
vllm.http.requests |
Counter | HTTP requests by status class and retry |
vllm.tool.calls |
Counter | Tool calls by name and outcome |
vllm.tool.duration |
Histogram (s) | Tool call duration |
vllm.checkpoint.operations |
Counter | Checkpoint create/restore operations |
vllm.model.load_errors |
Counter | Model listing errors |
vllm.hook.duration |
Histogram (s) | Hook execution duration by event |
| Span | Kind | Description |
|---|---|---|
gen_ai.query |
Client | Root span per Query/QueryStream call |
gen_ai.stream |
Internal | Streaming request (child of query) |
http.request |
Client | Individual HTTP request |
tool.execute |
Internal | Tool invocation |
hook.run |
Internal | Hook dispatch |
vllm.list_models |
Client | Model listing HTTP call |
reg := prometheus.NewRegistry()
for msg, err := range vllmsdk.Query(ctx,
vllmsdk.Text("Hello"),
vllmsdk.WithPrometheusRegisterer(reg),
vllmsdk.WithModel("my-model"),
) {
// ...
}
// Serve metrics
http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))See examples/prometheus_observability for a complete working example.
Runnable examples live under examples.