azure-foundry-provider is a highly specialized, production-grade AI SDK provider designed for Azure AI Foundry and Azure OpenAI-compatible endpoints.
This repository also includes packages/azure-foundry-inspect, a standalone read-only utility that uses the Azure Cognitive Services management SDK to inspect one Azure Foundry resource and emit JSON or HTML for its account metadata, deployments, usages, and model capacities.
While many providers rely on fragile model-name heuristics to decide how to route requests, this provider is built on a URL-first routing architecture. By treating the full Azure endpoint URL (including query parameters) as the absolute source of truth, routing, API versions, and operation modes are handled with mathematical determinism. This eliminates "magic" string matching and makes it the preferred choice for enterprise environments where reliability is non-negotiable.
The implementation features a sophisticated quota management system known as the Governor. It goes beyond static limits by implementing adaptive throttling that parses real-time x-ratelimit-* headers directly from Azure responses. This allows the provider to dynamically apply "soft" or "hard" cooldowns, preventing 429 failures before they occur. Combined with jittered exponential backoff and abort-aware request queueing, the Governor ensures robust performance even under high-concurrency workloads.
A standout feature of the provider is its operation-mismatch fallback for languageModel(...). Azure's model-operation compatibility can vary; when languageModel(...) is routed through the wrong operation because mode came from URL inference or global apiMode, the provider can detect a known Azure operation-mismatch error and retry exactly once through the opposite transport. This recovery stays disabled for strict transport accessors and for models with an explicit per-model apiMode.
The codebase adheres to the highest standards of modern TypeScript development, utilizing strict compiler configurations to eliminate boundary errors between the Azure API and your application. The architecture is highly modular, with specialized components for request sanitization and error analysis, and is backed by a comprehensive test suite (>90% coverage) that utilizes deterministic time-injection to verify complex throttling logic.
- URL-first routing from copied Azure endpoint URLs (no model-name endpoint heuristics)
- Supports both chat and responses operation paths
- Supports Azure v1 operation paths and
/openai/v1base root with mode-driven routing - Chat transport uses OpenAI-compatible semantics (system role remains
system,max_tokensis used) - Request policy control for tools (
auto,off,on) - Built-in retries and 429 handling with exponential backoff + jitter
- Event-driven waiter queue for
maxConcurrentadmission (wake-on-release, abort-aware waits) - Optional static quota controls (
rpm,tpm,maxConcurrent,maxOutputTokensCap) - Adaptive throttling from Azure
x-ratelimit-*headers - Request sanitization for chat history compatibility (removes assistant
reasoning_content/reasoningfields) - Automatic bidirectional fallback in
languageModel(...)for known operation-mismatch errors when mode is global or inferred - Optional observability callbacks (
onRetry,onAdaptiveCooldown,onSanitizedRetry,onFallback) - Timeout support via
AbortSignal
Accepted hostnames:
*.services.ai.azure.com*.cognitiveservices.azure.com*.openai.azure.com
Accepted operation suffixes:
/models/chat/completions/chat/completions/responses/openai/v1/chat/completions/openai/v1/responses/openai/v1(base root, requiresapiModeconfiguration)
Examples:
https://<id>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-previewhttps://<res>.cognitiveservices.azure.com/openai/chat/completions?api-version=previewhttps://<res>.cognitiveservices.azure.com/openai/responses?api-version=previewhttps://<res>.openai.azure.com/openai/chat/completions?api-version=2024-05-01-previewhttps://<res>.openai.azure.com/openai/v1/chat/completionshttps://<res>.services.ai.azure.com/openai/v1/responseshttps://<res>.cognitiveservices.azure.com/openai/v1
Note on api-version:
/models/chat/completionsrequiresapi-version./openai/v1/*endpoints do not requireapi-version./openai/v1base endpoint requires effectiveapiMode(globalapiModeor per-modelmodelOptions[modelId].apiMode).
import { createAzureFoundryProvider } from "azure-foundry-provider"
const provider = createAzureFoundryProvider({
endpoint:
"https://my-resource.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview",
apiKey: process.env.AZURE_API_KEY,
})
const model = provider.languageModel("DeepSeek-V3.1"){
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/bun/providers/azure-foundry-provider/src/index.ts",
"models": {
"deepseek-v3.1": {
"id": "DeepSeek-V3.1",
"name": "DeepSeek V3.1",
"tool_call": false,
"reasoning": false,
"limit": { "context": 64000, "output": 1024 },
"modalities": { "input": ["text"], "output": ["text"] }
}
},
"options": {
"endpoint": "https://<id>.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview",
"apiKey": "{env:AZURE_API_KEY}",
"timeout": 90000,
"quota": {
"adaptive": {
"enabled": true
}
}
}
}
}
}OpenCode supports switching reasoning effort levels (variants) for reasoning-capable models. When configured correctly, the Azure Foundry provider passes variant options through to the model automatically.
- Models with
reasoning: trueexpose variant options in OpenCode's UI - Variants are passed as
providerOptionsto the model at call time
Set reasoning: true and the appropriate npm value. OpenCode will auto-generate variants based on the npm package:
{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "@ai-sdk/openai-compatible",
"models": {
"gpt-5.4-mini": {
"id": "gpt-5.4-mini",
"name": "GPT-5.4 Mini",
"reasoning": true,
"tool_call": true,
"limit": { "context": 128000, "output": 16000 }
}
},
"options": {
"endpoint": "https://YOUR-ENDPOINT.services.ai.azure.com/openai/v1/chat/completions",
"apiKey": "{env:AZURE_API_KEY}"
}
}
}
}Auto-generated variant shapes by npm value:
npm value |
Auto-generated variants | Options per variant |
|---|---|---|
@ai-sdk/openai-compatible |
low, medium, high |
{ reasoningEffort } |
@ai-sdk/azure |
minimal, low, medium, high |
{ reasoningEffort, reasoningSummary: "auto", include: ["reasoning.encrypted_content"] } |
Define variants explicitly in the model config:
{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "@ai-sdk/openai-compatible",
"models": {
"my-reasoning-model": {
"id": "my-reasoning-model",
"name": "My Reasoning Model",
"reasoning": true,
"limit": { "context": 128000, "output": 16000 },
"variants": {
"low": { "reasoningEffort": "low" },
"medium": { "reasoningEffort": "medium" },
"high": { "reasoningEffort": "high" },
"custom": { "reasoningEffort": "high", "temperature": 0.5 }
}
}
},
"options": {
"endpoint": "https://YOUR-ENDPOINT.services.ai.azure.com/openai/v1/chat/completions",
"apiKey": "{env:AZURE_API_KEY}"
}
}
}
}{
"provider": {
"<provider-name>": {
"models": {
"<model-alias>": {
"reasoning": true, // Required to enable variant switching UI
"variants": {
// Optional: override auto-generated variants
"<variant-name>": {
// Any valid providerOptions for the model
"reasoningEffort": "low|medium|high|minimal"
// Additional options as needed
}
}
}
}
}
}
}reasoning: true— Required to show the variant selector in OpenCode's UInpmfield — Determines auto-generated variant structure if you don't define custom variants- Custom variants — Fully customizable; any fields in the variant object are merged into call options
- No provider changes needed — The Azure Foundry provider passes
providerOptionsthrough automatically
Variants don't appear in OpenCode UI:
- Ensure
reasoning: trueis set on the model - Check that the model is visible in OpenCode's model list
Variant options don't take effect:
- Verify the variant name matches what's configured
- Check that
providerOptionsare being passed (enableonRetrycallback to inspect)
This repo ships a derived OpenCode schema at schemas/opencode.azure-foundry.schema.json.
It composes with the upstream OpenCode schema at https://opencode.ai/config.json and adds IntelliSense for Azure Foundry provider entries whose npm value contains azure-foundry-provider.
Use it by pointing your local config at the schema file:
{
"$schema": "./schemas/opencode.azure-foundry.schema.json"
}Or map it in .vscode/settings.json:
{
"json.schemas": [
{
"fileMatch": ["opencode.json"],
"url": "./schemas/opencode.azure-foundry.schema.json"
}
]
}Notes:
- The derived schema keeps upstream OpenCode validation and only narrows
provider.<name>.optionsfor Azure Foundry provider entries. - Runtime-only options such as
fetchand callback hooks are intentionally omitted because they are not JSON-serializable OpenCode config values. modelOptionsandquota.modelsare keyed by the runtime model id the provider receives. OpenCode examples may use aliases or ids, so treat those keys carefully.- Endpoint semantics such as
/openai/v1requiring effectiveapiModeand/models/chat/completionsrequiringapi-versionare documented in schema descriptions, but JSON Schema cannot fully enforce all runtime constraints.
Creates a provider that implements AI SDK ProviderV2 plus convenience methods:
provider(modelId)provider.languageModel(modelId)provider.chat(modelId)provider.responses(modelId)
Unsupported model families intentionally throw NoSuchModelError:
provider.textEmbeddingModel(modelId)provider.imageModel(modelId)
type AzureFoundryOptions = {
endpoint?: string
apiKey?: string
headers?: Record<string, string>
apiMode?: "chat" | "responses"
toolPolicy?: "auto" | "off" | "on"
timeout?: number | false
quota?: QuotaOptions
cooldownScope?: "global" | "per-model"
assistantReasoningSanitization?: "auto" | "always" | "never"
modelOptions?: Record<
string,
{
apiMode?: "chat" | "responses"
assistantReasoningSanitization?: "auto" | "always" | "never"
}
>
onRetry?: (event: RetryEvent) => void
onAdaptiveCooldown?: (event: AdaptiveCooldownEvent) => void
onSanitizedRetry?: (event: SanitizedRetryEvent) => void
onFallback?: (event: FallbackEvent) => void
fetch?: FetchFunction
name?: string
}endpoint:- Full URL to Azure endpoint.
- If omitted, loads from
AZURE_FOUNDRY_ENDPOINT. - Must use
https://. /models/chat/completionsrequiresapi-versionquery.
apiKey:- API key value.
- If omitted, loads from
AZURE_API_KEY.
headers:- Extra headers to include on every request.
- If
Authorizationorapi-keyis present, provider does not injectapi-keyautomatically.
apiMode:- Optional override:
"chat"or"responses". - If omitted, mode is inferred from URL path.
- Override rewrites only operation suffix while preserving origin, path prefix, and query params.
- Optional override:
toolPolicy(default"auto"):"auto": pass-through."off": strips tools and enforcestoolChoice: { type: "none" }."on": if tools exist and tool choice is not fixed, forcestoolChoice: { type: "required" }.
timeout:number: request timeout in milliseconds.false: explicitly disables timeout.undefined: no timeout wrapper.
quota:- Static quota limits + retry + adaptive throttling options.
cooldownScope(default"global"):- Controls how cooldown is applied after rate-limit pressure.
"global": one cooldown can pause all models using this provider instance."per-model": cooldown is isolated to the model that triggered it.
assistantReasoningSanitization(default"auto"):- Global policy for assistant reasoning field sanitization.
"always": sanitize before first request."auto": send raw first, sanitize only when endpoint rejects reasoning fields."never": never sanitize.
modelOptions:- Model-specific overrides for provider behavior.
- Supports per-model
apiModeandassistantReasoningSanitization.
onRetry:- Optional callback emitted before retry waits on retryable responses.
- Use it to understand retry pressure and tune retry/quota settings.
- Event contract:
{ eventVersion: "v1", phase: "retry", attempt, reason, status?, retryAfterMs?, modelId? }.
onAdaptiveCooldown:- Optional callback emitted when adaptive ratelimit headers trigger cooldown.
- Use it to correlate cooldown windows with endpoint pressure.
- Event contract:
{ eventVersion: "v1", phase: "adaptive_cooldown", cooldownMs, reason, remainingRequests?, remainingTokens?, modelId? }.
onSanitizedRetry:- Optional callback emitted when
assistantReasoningSanitization: "auto"retries after schema rejection. - Use it to identify strict endpoints/models that should move to per-model
"always"sanitization. - Event contract:
{ eventVersion: "v1", phase: "sanitized_retry", reason, sanitizedFields, status?, modelId? }.
- Optional callback emitted when
onFallback:- Optional callback emitted when
languageModel(...)falls back across transports on a known operation mismatch. - Use it to find models that should be explicitly configured for their correct mode.
- Event contract:
{ eventVersion: "v1", phase: "fallback", fromMode: "chat" | "responses", toMode: "chat" | "responses", reason, status?, modelId? }. - Fallback remains disabled for strict transport accessors and when
modelOptions[modelId].apiModeis explicitly set.
- Optional callback emitted when
fetch:- Custom fetch implementation.
name(default"azure-foundry"):- Provider id prefix in
model.provider(for diagnostics).
- Provider id prefix in
/chat/completionsor/models/chat/completions->chat/responses->responses
If the URL path and apiMode differ, the provider rewrites the operation suffix and keeps:
- hostname/origin
- any path prefix before operation suffix
- all query parameters in original order
Per-model mode override is also supported via modelOptions[modelId].apiMode and takes precedence over global apiMode.
provider.languageModel(modelId) can retry exactly once through the opposite transport when Azure returns a known operation-mismatch error for the attempted operation, for example:
The chatCompletion operation does not work with the specified model ...- a corresponding responses-operation rejection for
/responses
Fallback is allowed only when the first attempt mode came from:
- URL inference
- global
apiMode
Fallback is disabled when:
modelOptions[modelId].apiModeis explicitly set- you use
provider.chat(modelId) - you use
provider.responses(modelId)
Guardrails:
- generic
400errors do not trigger fallback - fallback requires a known Azure operation-mismatch signal for the operation that was actually attempted
- retry happens at most once and only across the opposite transport
This keeps explicit per-model mode selection and transport-specific accessors deterministic while improving resilience for mixed model setups.
The provider can emit callback events for retry, adaptive cooldown, sanitization retry, and cross-transport fallback decisions.
Why these callbacks exist:
- explain provider decisions in production without changing transport behavior
- help tune retry/quota/sanitization settings from real runtime signals
- make it easier to detect model/endpoint mismatches early
Callbacks are optional. If you do not set them, behavior is unchanged.
| Callback | Required fields | Optional fields | Typical reasons |
|---|---|---|---|
onRetry |
eventVersion, phase, attempt, reason |
status, retryAfterMs, modelId |
status_429, retryable_status |
onAdaptiveCooldown |
eventVersion, phase, cooldownMs, reason |
remainingRequests, remainingTokens, modelId |
requests_depleted, tokens_depleted, low_watermark |
onSanitizedRetry |
eventVersion, phase, reason, sanitizedFields |
status, modelId |
schema_rejection |
onFallback |
eventVersion, phase, fromMode, toMode, reason |
status, modelId |
chat_operation_mismatch, responses_operation_mismatch |
Callback payloads are metadata-only. They intentionally exclude:
- raw headers
- raw request/response bodies
- API keys and bearer tokens
import { createAzureFoundryProvider } from "azure-foundry-provider"
const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
onRetry: (event) => {
console.info("provider.retry", event)
},
onAdaptiveCooldown: (event) => {
console.info("provider.cooldown", event)
},
onSanitizedRetry: (event) => {
console.info("provider.sanitized_retry", event)
},
onFallback: (event) => {
console.info("provider.fallback", event)
},
})import { createAzureFoundryProvider } from "azure-foundry-provider"
type Metrics = {
count: (name: string, tags?: Record<string, string>) => void
histogram: (name: string, value: number, tags?: Record<string, string>) => void
}
const metrics: Metrics = {
count: (name, tags) => {
// wire to your metrics backend
},
histogram: (name, value, tags) => {
// wire to your metrics backend
},
}
const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
onRetry: (event) => {
metrics.count("azure_provider_retry_total", {
reason: event.reason,
status: String(event.status ?? "none"),
model: event.modelId ?? "unknown",
})
},
onAdaptiveCooldown: (event) => {
metrics.histogram("azure_provider_cooldown_ms", event.cooldownMs, {
reason: event.reason,
model: event.modelId ?? "unknown",
})
},
onFallback: (event) => {
metrics.count("azure_provider_fallback_total", {
reason: event.reason,
from: event.fromMode,
to: event.toMode,
model: event.modelId ?? "unknown",
})
},
})Use callbacks to turn runtime observations into explicit config:
- If
onFallbackfires repeatedly for a model, setmodelOptions[modelId].apiModeto the mode that model actually requires. - If
onSanitizedRetryfires repeatedly for a model, setmodelOptions[modelId].assistantReasoningSanitization = "always". - If
onRetry+onAdaptiveCooldownrates are high, tunequota.retryand review endpoint capacity.
Priority:
- Use
headers.Authorizationorheaders["api-key"]if explicitly provided. - Else inject
api-keyfromapiKeyoption. - Else inject
api-keyfromAZURE_API_KEY.
User-Agent suffix is automatically appended as azure-foundry-provider/<version>.
For chat requests, provider applies compatibility safeguards:
- Preserves
systemrole (no remap todeveloper) - Uses
max_tokensfor output budget
Assistant reasoning field sanitization is configurable:
- Global:
assistantReasoningSanitization - Per model:
modelOptions[modelId].assistantReasoningSanitization - Effective policy precedence:
- per-model override
- global policy
- default
auto
When sanitization is active, the provider removes assistant fields that break strict endpoints:
reasoning_contentreasoning
In auto mode, provider retries once with sanitized assistant fields after 400 schema-like rejections for reasoning fields and then remembers this path for that model in-process.
Some orchestration stacks include prior assistant thinking in conversation history using fields such as reasoning_content (or similar metadata). Several Azure Foundry chat endpoints reject those assistant fields as unknown/forbidden input. When that happens, requests fail even though the rest of the payload is valid.
When an endpoint supports assistant reasoning fields, preserving them can improve multi-turn behavior:
- Better continuity across long tasks: the model can reuse its prior intermediate plan instead of rebuilding context from scratch.
- More stable tool workflows: follow-up calls can align with earlier tool-selection rationale, reducing unnecessary tool churn.
- Fewer repeated clarifications: prior reasoning state can help the model avoid re-asking already resolved constraints.
- Better long-horizon decomposition: complex tasks that span many turns often benefit when the model can reference previous internal decomposition.
Trade-off: compatibility varies by endpoint/model. Strict validators (notably some Mistral Foundry chat paths) reject reasoning_content/reasoning, so pass-through is not universally safe.
Practical recommendation:
- Use global
assistantReasoningSanitization: "auto". - Set strict models to
"always"viamodelOptions. - Use
"never"only when you know the endpoint accepts reasoning fields and you explicitly want pass-through behavior.
Typical failure pattern on strict models/endpoints (for example Mistral via strict Foundry chat validation):
- HTTP
400 Bad Request - validation details mention forbidden extra fields
- error details reference assistant reasoning fields in message history, for example:
type: "extra_forbidden"- location like
messages[*].assistant.reasoning_content
Concrete example observed:
{
"detail": [
{
"type": "extra_forbidden",
"loc": ["body", "messages", 2, "assistant", "reasoning_content"],
"msg": "Extra inputs are not permitted"
},
{
"type": "extra_forbidden",
"loc": ["body", "messages", 4, "assistant", "reasoning_content"],
"msg": "Extra inputs are not permitted"
}
]
}If you observe this class of error, set model-specific sanitization to always for that model to skip the first failing roundtrip and improve latency.
const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
assistantReasoningSanitization: "auto",
modelOptions: {
"Mistral-Large-3": {
assistantReasoningSanitization: "always",
},
},
})Policy guidance:
auto: best default when model behavior is mixed/unknown.always: best for models you know reject assistant reasoning fields (avoids one failed HTTP 400 attempt).never: only use when endpoint explicitly supports these fields and you require exact pass-through.
type QuotaRule = {
rpm?: number
tpm?: number
maxConcurrent?: number
maxOutputTokensCap?: number
}
type QuotaRetryOptions = {
maxAttempts?: number
baseDelayMs?: number
maxDelayMs?: number
jitterRatio?: number
honorRetryAfter?: boolean
cooldownOn429Ms?: number
}
type QuotaAdaptiveOptions = {
enabled?: boolean
minCooldownMs?: number
lowWatermarkRatio?: number
lowCooldownMs?: number
}
type QuotaOptions = {
default?: QuotaRule
models?: Record<string, QuotaRule>
retry?: QuotaRetryOptions
adaptive?: QuotaAdaptiveOptions
}Retry defaults (used unless overridden):
maxAttempts: 4baseDelayMs: 1200maxDelayMs: 30000jitterRatio: 0.25honorRetryAfter: truecooldownOn429Ms: 10000
Adaptive defaults (used unless overridden):
enabled: trueminCooldownMs: 1000lowWatermarkRatio: 0.1lowCooldownMs: 250
Static limits (default and models) are opt-in only.
- Queues requests when any configured limit would be exceeded.
- Supports per-model overrides by model id string in request body.
- Applies output token clamping via
maxOutputTokensCap. - Retries retryable statuses (
429,408,500,502,503,504) with bounded backoff. - Honors
Retry-Afteron429when enabled. - Uses adaptive cooldown from headers when near/at budget floor:
x-ratelimit-limit-requestsx-ratelimit-limit-tokensx-ratelimit-remaining-requestsx-ratelimit-remaining-tokens
- Waits are abort-aware; canceled requests do not stay queued indefinitely.
The governor maintains sliding windows for request-rate and token-rate accounting. These windows are pruned frequently, so their data structure matters for both throughput and latency.
In a naive FIFO queue, pruning old entries with shift() repeatedly can become expensive because each shift() reindexes the remaining array. Under sustained load, this adds avoidable CPU overhead in a hot path.
To avoid that, the provider uses head-index pruning:
- events are appended to arrays (
requests,tokens) - pruning advances a head pointer (
requestHead,tokenHead) instead of shifting array elements - active window length is computed from
array.length - head - periodic compaction trims consumed prefixes when head growth crosses thresholds
This keeps pruning cost proportional to the number of expired entries without repeated reindexing work.
For each model window, the governor tracks:
requests: request timestamps for RPM checksrequestHead: start index of currently active request timestampstokens:{ at, tokens }events for TPM checkstokenHead: start index of currently active token events
At each acquire loop:
- Calculate the minimum active timestamp (
now - windowMs). - Advance
requestHeadwhile old request timestamps are out of window. - Advance
tokenHeadwhile old token events are out of window. - Evaluate waits (
maxConcurrent, RPM, TPM) against active slices. - Append current event on successful admission.
Compaction policy:
- when head index grows large relative to array size, the window compacts the live slice and resets head to
0 - this avoids unbounded stale prefix growth while preserving simple, predictable behavior
- Lower CPU churn in prune-heavy workloads.
- Better tail latency stability when many requests age out in bursts.
- No change to external quota semantics; this is an internal queue-maintenance optimization.
type TokenEvent = { at: number; tokens: number }
let requests: number[] = []
let requestHead = 0
let tokens: TokenEvent[] = []
let tokenHead = 0
function prune(now: number, windowMs: number) {
const min = now - windowMs
while (requestHead < requests.length && requests[requestHead]! < min) {
requestHead += 1
}
while (tokenHead < tokens.length && tokens[tokenHead]!.at < min) {
tokenHead += 1
}
}
function activeRequests() {
return requests.slice(requestHead)
}
function activeTokens() {
return tokens.slice(tokenHead)
}The real implementation also adds bounded compaction and integrates these active windows directly into RPM/TPM wait calculations.
When maxConcurrent is configured, the governor must decide when waiting requests are allowed to start. The provider uses an event-driven waiter queue for this path.
A polling approach (for example waking every fixed interval) adds avoidable wakeups and increases contention jitter. Under sustained load, polling can make latency less predictable because many wait cycles are spent checking unchanged state.
The event-driven queue removes that polling loop:
- if capacity is available, request is admitted immediately
- if capacity is full, request registers a waiter and sleeps
- when a running request releases capacity, one waiter is signaled
This converts concurrency waiting from timer-driven checks to release-driven notifications.
For each model window, the governor keeps a FIFO waiter list used only for maxConcurrent contention.
Acquire path (simplified):
- Check current
activecount againstmaxConcurrent. - If at capacity, enqueue a waiter callback.
- If
AbortSignaltriggers while queued, remove waiter and reject withAbortError. - On wakeup, re-enter admission checks and proceed when allowed.
Release path (simplified):
- Decrement
activecount. - Pop the next waiter from the FIFO queue.
- Signal exactly one waiter to continue.
This preserves deterministic queueing behavior while avoiding broadcast wakeups.
Queued waits remain abort-aware:
- aborted waiters are removed from the queue
- aborted requests do not hold a slot and do not block later waiters
This avoids stale waiters accumulating during client-side timeouts or cancellations.
- Fewer unnecessary timer wakeups in
maxConcurrentcontention scenarios. - Better tail latency stability compared with fixed-interval polling.
- No change to public quota semantics; only the internal waiting strategy is changed.
const waiters: Array<() => void> = []
let active = 0
const maxConcurrent = 1
async function acquire(signal?: AbortSignal) {
while (active >= maxConcurrent) {
await new Promise<void>((resolve, reject) => {
const onWake = () => {
signal?.removeEventListener("abort", onAbort)
resolve()
}
const onAbort = () => {
const i = waiters.indexOf(onWake)
if (i >= 0) waiters.splice(i, 1)
reject(new DOMException("aborted", "AbortError"))
}
waiters.push(onWake)
signal?.addEventListener("abort", onAbort, { once: true })
})
}
active += 1
return () => {
active = Math.max(0, active - 1)
const next = waiters.shift()
next?.()
}
}In the provider, this waiter mechanism is integrated with RPM/TPM checks, adaptive cooldown, retry behavior, and abort-aware request handling.
Token-per-minute (TPM) limiting can become expensive if each admission check re-sums all active token events in the window. To keep the hot path stable under load, the provider maintains a rolling token sum per model window.
A scan-based TPM check often looks like this:
- prune old token events
- sum all remaining token values
- compare
sum + pendingTokensagainsttpm
That repeated summation adds avoidable work at high request rates.
The provider avoids this by tracking a running aggregate:
- append token events on admit
- keep
tokenSumas the current active-window total - subtract evicted event values during prune
This makes the common-path accounting constant-time with amortized pruning work.
For each model window, the governor tracks:
tokens: token events ({ at, tokens })tokenHead: index of first active token eventtokenSum: running total of active-window token usage
Admission flow for TPM (simplified):
- Prune expired token events (
at < now - windowMs). - For each evicted event, decrement
tokenSum. - Fast check: if
tokenSum + pendingTokens <= tpm, admit immediately. - If over limit, compute the next admissible time by walking forward from
tokenHeaduntil the projected sum fits. - On admit, append event and increment
tokenSum.
The fast check is O(1). The fallback walk only occurs when the request is currently over budget.
- The running sum always reflects active-window token usage after prune.
- Each committed token event is added once and removed once.
- Oversized single requests (
pendingTokens > tpm) keep existing behavior and are not blocked by TPM wait logic. - No external API/contract changes: this is internal governor accounting behavior.
- Lower CPU overhead in TPM-heavy workloads.
- More predictable latency during sustained token traffic.
- Fewer full-window token summations in steady-state admission checks.
type TokenEvent = { at: number; tokens: number }
let tokenEvents: TokenEvent[] = []
let tokenHead = 0
let tokenSum = 0
function prune(now: number, windowMs: number) {
const min = now - windowMs
while (tokenHead < tokenEvents.length && tokenEvents[tokenHead]!.at < min) {
tokenSum -= tokenEvents[tokenHead]!.tokens
tokenHead += 1
}
}
function canAdmit(now: number, windowMs: number, tpm: number, pendingTokens: number) {
prune(now, windowMs)
return pendingTokens > tpm || tokenSum + pendingTokens <= tpm
}
function commit(now: number, pendingTokens: number) {
tokenEvents.push({ at: now, tokens: pendingTokens })
tokenSum += pendingTokens
}The provider combines this accounting with head-index pruning, event-driven maxConcurrent waiting, adaptive cooldown, and retry behavior in one admission loop.
Cooldown is the governor's temporary pause mechanism when rate-limit pressure is detected (for example from 429 handling or adaptive header signals). cooldownScope controls who the pause applies to.
In mixed-model workloads, one model can be much noisier than another. Without scope control, a cooldown triggered by one model can slow unrelated traffic.
Use cooldownScope to choose behavior explicitly:
"global"(default): conservative shared backpressure across the provider instance"per-model": isolate cooldown impact to the model that triggered it
"global":
- one cooldown window is shared by all models using the provider instance
- simplest and most conservative behavior for shared quotas
- best when all models map to the same constrained upstream budget
"per-model":
- cooldown windows are tracked per model id
- model
Acan be paused while modelBcontinues ifBhas available budget - best for mixed-model deployments where isolation matters
Use "global" when:
- you want strict backpressure for the whole provider
- your deployment has one shared quota envelope and fairness between models is less important than global stability
Use "per-model" when:
- one model is frequently rate-limited and should not slow all others
- you run heterogeneous model traffic and want better isolation
- Adaptive throttling (
x-ratelimit-*) still decides when to apply cooldown. - Retry policy (
Retry-After, jitter/backoff,cooldownOn429Ms) still decides delay magnitudes. cooldownScopechanges only the cooldown target (all models vs triggering model).
- Default is
"global"for backward-compatible behavior. - Scope is configured per provider instance via
cooldownScope. - If omitted, behavior is identical to prior global cooldown behavior.
const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
quota: {
adaptive: { enabled: true },
retry: { maxAttempts: 4, cooldownOn429Ms: 10_000 },
},
// cooldownScope defaults to "global"
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
cooldownScope: "per-model",
quota: {
adaptive: { enabled: true },
retry: { maxAttempts: 4, cooldownOn429Ms: 10_000 },
},
})To validate your choice in production:
- track
onAdaptiveCooldownandonRetrybymodelId - compare cooldown and retry rates before/after changing
cooldownScope - if unrelated models are throttled together too often, switch to
"per-model"
const provider = createAzureFoundryProvider({
endpoint:
"https://ais123.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview",
apiKey: process.env.AZURE_API_KEY,
quota: {
adaptive: { enabled: true },
},
})const provider = createAzureFoundryProvider({
endpoint:
"https://ais123.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview",
apiKey: process.env.AZURE_API_KEY,
timeout: 90_000,
quota: {
default: {
rpm: 6,
tpm: 20_000,
maxConcurrent: 1,
maxOutputTokensCap: 1024,
},
models: {
"Kimi-K2.5": {
rpm: 3,
tpm: 12_000,
maxConcurrent: 1,
maxOutputTokensCap: 768,
},
"Kimi-K2-Thinking": {
rpm: 2,
tpm: 8_000,
maxConcurrent: 1,
maxOutputTokensCap: 640,
},
"Mistral-Large-3": {
rpm: 4,
tpm: 16_000,
maxConcurrent: 1,
maxOutputTokensCap: 1024,
},
},
},
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
quota: {
retry: {
maxAttempts: 5,
baseDelayMs: 800,
maxDelayMs: 20_000,
jitterRatio: 0.2,
honorRetryAfter: true,
cooldownOn429Ms: 5000,
},
},
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
quota: {
adaptive: {
enabled: true,
minCooldownMs: 1000,
lowWatermarkRatio: 0.1,
lowCooldownMs: 250,
},
},
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
quota: {
adaptive: { enabled: false },
},
})const provider = createAzureFoundryProvider({
endpoint: "https://myres.cognitiveservices.azure.com/openai/chat/completions?api-version=preview",
apiMode: "responses",
apiKey: process.env.AZURE_API_KEY,
})
const model = provider.languageModel("gpt-4.1")const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
toolPolicy: "off",
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
toolPolicy: "on",
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
headers: {
Authorization: `Bearer ${process.env.AZURE_ACCESS_TOKEN}`,
},
})// 45s timeout
const providerA = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
timeout: 45_000,
})
// explicitly disable timeout wrapper
const providerB = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
timeout: false,
})const tracedFetch: typeof fetch = Object.assign(
async (input: RequestInfo | URL, init?: RequestInit) => {
const start = Date.now()
const response = await fetch(input, init)
const ms = Date.now() - start
console.log("Azure call", response.status, `${ms}ms`)
return response
},
{ preconnect: fetch.preconnect },
)
const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
fetch: tracedFetch,
})import { parseEndpoint } from "azure-foundry-provider"
const parsed = parseEndpoint(
"https://foo.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview&x=1",
)
console.log(parsed.mode) // chat
console.log(parsed.requestURL)const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
assistantReasoningSanitization: "auto",
})const provider = createAzureFoundryProvider({
endpoint: process.env.AZURE_FOUNDRY_ENDPOINT!,
apiKey: process.env.AZURE_API_KEY,
apiMode: "chat",
assistantReasoningSanitization: "auto",
modelOptions: {
"DeepSeek-V3.1": {
apiMode: "responses",
},
"Mistral-Large-3": {
assistantReasoningSanitization: "always",
},
},
})const provider = createAzureFoundryProvider({
endpoint: "https://YOUR-RESOURCE.cognitiveservices.azure.com/openai/v1",
apiKey: process.env.AZURE_API_KEY,
apiMode: "chat",
modelOptions: {
"gpt-5.3-codex": {
apiMode: "responses",
},
"Kimi-K2.5": {
apiMode: "responses",
},
},
})This pattern is useful when a provider points to a single v1 base root but individual models must use different operations.
Global apiMode controls the first attempt for models without a per-model override. If a model has modelOptions[modelId].apiMode, that per-model mode is strict for that model and disables automatic cross-transport recovery.
{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"deepseek-v3.1": {
"id": "DeepSeek-V3.1",
"name": "DeepSeek V3.1"
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"apiMode": "chat",
"quota": {
"default": {
"rpm": 60,
"tpm": 100000,
"maxConcurrent": 4
}
}
}
}
}
}{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"kimi-k2.5": {
"id": "FW-Kimi-K2.5",
"name": "Kimi K2.5"
},
"mistral-large-3": {
"id": "Mistral-Large-3",
"name": "Mistral Large 3"
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"apiMode": "chat",
"quota": {
"default": {
"rpm": 30,
"tpm": 50000,
"maxConcurrent": 2
},
"models": {
"FW-Kimi-K2.5": {
"rpm": 10,
"tpm": 20000,
"maxConcurrent": 1
},
"Mistral-Large-3": {
"rpm": 20,
"tpm": 40000,
"maxConcurrent": 2
}
}
}
}
}
}
}{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"deepseek-v3.1": {
"id": "DeepSeek-V3.1",
"name": "DeepSeek V3.1"
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"quota": {
"retry": {
"maxAttempts": 5,
"baseDelayMs": 800,
"maxDelayMs": 20000,
"jitterRatio": 0.2,
"honorRetryAfter": true,
"cooldownOn429Ms": 5000
}
}
}
}
}
}{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"deepseek-v3.1": {
"id": "DeepSeek-V3.1",
"name": "DeepSeek V3.1"
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"quota": {
"adaptive": {
"enabled": true,
"minCooldownMs": 1000,
"lowWatermarkRatio": 0.1,
"lowCooldownMs": 250
}
}
}
}
}
}{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"deepseek-v3.1": {
"id": "DeepSeek-V3.1",
"name": "DeepSeek V3.1"
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"quota": {
"adaptive": {
"enabled": false
}
}
}
}
}
}{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"kimi-k2.5": {
"id": "FW-Kimi-K2.5",
"name": "Kimi K2.5"
},
"mistral-large-3": {
"id": "Mistral-Large-3",
"name": "Mistral Large 3"
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"cooldownScope": "per-model",
"quota": {
"adaptive": {
"enabled": true
},
"retry": {
"maxAttempts": 4,
"cooldownOn429Ms": 10000
}
}
}
}
}
}{
"provider": {
"azure-foundry": {
"name": "Azure Foundry",
"npm": "file:///usr/local/provider/azure-foundry-provider/index.js",
"models": {
"mistral-large-3": {
"id": "Mistral-Large-3",
"name": "Mistral Large 3",
"modalities": { "input": ["text"], "output": ["text"] }
},
"deepseek-v3.1": {
"id": "DeepSeek-V3.1",
"name": "DeepSeek V3.1",
"modalities": { "input": ["text"], "output": ["text"] }
}
},
"options": {
"endpoint": "https://ais123.services.ai.azure.com/openai/v1",
"apiKey": "{env:AZURE_API_KEY}",
"assistantReasoningSanitization": "auto",
"modelOptions": {
"Mistral-Large-3": {
"assistantReasoningSanitization": "always"
}
}
}
}
}
}AZURE_FOUNDRY_ENDPOINT: fallback foroptions.endpointAZURE_API_KEY: fallback foroptions.apiKey
If you see an error like The chatCompletion operation does not work with the specified model, the model likely does not support the chat operation.
- Fix: Update your endpoint or set
modelOptions[modelId].apiMode = "responses"for that model. - Automatic recovery:
provider.languageModel(modelId)can retry once through responses when the initial mode came from URL inference or globalapiMode. - Strict cases:
provider.chat(modelId)does not fallback, and a per-modelapiMode: "chat"also stays strict.
If you see the corresponding mismatch for /responses, the model likely requires chat instead.
- Fix: Update your endpoint or set
modelOptions[modelId].apiMode = "chat"for that model. - Automatic recovery:
provider.languageModel(modelId)can retry once through chat when the initial mode came from URL inference or globalapiMode. - Strict cases:
provider.responses(modelId)does not fallback, and a per-modelapiMode: "responses"also stays strict.
Generic 400 Bad Request responses do not trigger fallback. Only known Azure operation-mismatch errors for the attempted operation qualify.
Some strict Azure Foundry endpoints (notably Mistral-based ones) reject assistant messages that contain reasoning_content or reasoning fields in their history.
- Symptom: You receive an
extra_forbiddenvalidation error. - Fix: Use
assistantReasoningSanitization: "auto"(default) or set it to"always"for that specific model inmodelOptionsto skip the failing round-trip.
The provider handles 429 errors automatically via retries and adaptive throttling.
- If you are still hitting limits: Check your
rpmandtpmsettings in thequotablock. - Adaptive Throttling: Ensure
quota.adaptive.enabledistrue(default) to allow the provider to react to Azure's ratelimit headers before a failure occurs.
Unsupported Azure hostname: Ensure your host matches*.services.ai.azure.com,*.cognitiveservices.azure.com, or*.openai.azure.com.Unsupported endpoint path: Path must end with/chat/completions,/responses,/models/chat/completions, or a supported/openai/v1variant.Missing required api-version: Add?api-version=...to your Foundry URL if using/models/chat/completions.Endpoint path /openai/v1 requires apiMode: When using the base v1 root, you must explicitly setapiModeglobally or per-model.content_filter/ResponsibleAIPolicyViolation: This is Azure's content policy, not a transport error. Adjust the prompt and retry.
- Query parameters are preserved as provided in the endpoint URL.
- Chat and responses requests route deterministically from endpoint parsing +
apiModeoverride. - Retry/backoff is active even if you do not configure static quota limits.
- Adaptive throttling is enabled by default and uses Azure ratelimit headers when available.
createAzureFoundryProviderazureFoundryProvider(default instance with environment-based settings)parseEndpoint- Types:
AzureFoundryOptionsAzureFoundryProviderApiMode,HostType,PathType,ParsedEndpointToolPolicyQuotaOptions,QuotaRule,QuotaRetryOptions,QuotaAdaptiveOptionsAssistantReasoningSanitizationPolicy,ModelRequestOptions,RequestPolicyOptions