How BenchmarkConfig gets built, validated, and used — from user input to benchmark execution.
For usage examples and flag reference, see CLI_QUICK_REFERENCE.md.
$ uv run inference-endpoint -h
Usage: inference-endpoint COMMAND
╭─ Commands ───────────────────────────────────────────────────────────────────╮
│ benchmark Run benchmarks (offline, online, from-config) │
│ eval Run accuracy evaluation. (not yet implemented) │
│ info Show system information. │
│ init Generate config template. │
│ probe Test endpoint connectivity. │
│ validate-yaml Validate YAML configuration file. │
╰──────────────────────────────────────────────────────────────────────────────╯
| Subcommand | Purpose | Config source |
|---|---|---|
benchmark offline |
Max throughput (all queries at t=0) | CLI flags → Pydantic |
benchmark online |
Sustained QPS with load pattern | CLI flags → Pydantic |
benchmark from-config |
Run from YAML file | YAML → Pydantic |
Global options: --version, -v (INFO), -vv (DEBUG). Verbosity is handled by the meta-app, not BenchmarkConfig.
CLI YAML
┌─────────┴─────────┐ benchmark from-config
offline online │
│ │ │ yaml.safe_load(path)
│ cyclopts builds │ cyclopts builds │ resolve_env_vars(data)
│ OfflineBenchmark │ OnlineBenchmark │ TypeAdapter picks subclass
│ Config │ Config │ by type field
│ │ │
│ inject --dataset │ inject --dataset │ optional --timeout/--mode
│ via with_updates │ via with_updates │ via with_updates
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────────────────────────┐
│ _resolve_and_validate() (model_validator) │
└─────────────────────────────┬──────────────────────────────┘
│
▼
setup → execute → finalize
Both paths produce the same subclass with the same defaults. A YAML file with type: offline gets OfflineBenchmarkConfig — identical to what benchmark offline constructs.
-
cyclopts constructs the subclass directly.
OfflineBenchmarkConfig/OnlineBenchmarkConfigare Pydantic models inconfig/schema.pywith@cyclopts.Parameter(name="*"). cyclopts generates flags from their fields. -
Type locked at class level.
OfflineBenchmarkConfig.typeisLiteral[TestType.OFFLINE]— determined by subcommand, not user input. -
Datasets injected after construction.
--datasetstrings are parsed by aBeforeValidatoron thedatasetsfield, then merged viaconfig.with_updates(datasets=...).
-
from_yaml_file(path)loads YAML, resolves${VAR}env vars on parsed values, then passes the dict to a PydanticTypeAdapterwithDiscriminator. -
Auto-selects subclass.
type: "offline"→OfflineBenchmarkConfig,type: "online"→OnlineBenchmarkConfig, others → baseBenchmarkConfig. -
Optional CLI overrides.
--timeoutand--modeapplied viaconfig.with_updates(...)which re-runs validators.
OfflineBenchmarkConfig and OnlineBenchmarkConfig exist in the schema (not just CLI) so both paths share them:
BenchmarkConfig (base — submission/eval fallback)
├── OfflineBenchmarkConfig (type=OFFLINE, OfflineSettings)
└── OnlineBenchmarkConfig (type=ONLINE, OnlineSettings)
They provide:
- Type safety —
Literaltype field, impossible to mismatch - Unified defaults — CLI and YAML get identical subclass behavior
- Per-mode
--help— each subcommand shows only relevant flags
| Aspect | offline |
online |
|---|---|---|
| Streaming default | AUTO → OFF | AUTO → ON |
| Settings class | OfflineSettings |
OnlineSettings |
--dataset is repeatable and accepts a string with TOML-style dotted paths:
--dataset [perf|acc:]<path>[,key=value...]
The first segment is the file path, optionally prefixed with perf: or acc: to set the dataset type (defaults to performance). Additional comma-separated key=value pairs set Dataset fields using dotted paths for nesting.
# Simple
--dataset data.jsonl
# Accuracy dataset
--dataset acc:eval.jsonl
# With samples limit and column remap
--dataset data.csv,samples=500,parser.prompt=article
# With accuracy config
--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer
# Multiple datasets
--dataset perf:train.jsonl --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1 --mode bothParser remaps use parser.TARGET=SOURCE — "rename my dataset's SOURCE column to TARGET". Valid targets are derived from MakeAdapterCompatible (prompt, system). Invalid targets are rejected at parse time. Invalid source columns are rejected at dataset load time.
Pydantic validates all fields: extra="forbid" on Dataset and AccuracyConfig catches typos like --dataset data.jsonl,samles=500. Format is auto-detected from file extension.
The only YAML-only features are submission_ref and benchmark_mode (for official submissions).
Validation is layered, executing in order:
1. cyclopts → required args? unknown flags?
2. Pydantic fields → type coercion, ge/le constraints
3. Sub-model validators:
├── RuntimeConfig._validate_durations → max >= min duration
├── LoadPattern._validate_completeness → poisson needs qps, concurrency needs target
└── HTTPClientConfig._workers_not_zero → num_workers != 0
4. BenchmarkConfig._resolve_and_validate:
├── resolve defaults (name, streaming, model name from submission_ref)
├── load pattern type vs test type (offline→max_throughput, online→poisson/concurrency)
├── submission needs benchmark_mode
└── duplicate dataset detection
5. Runtime (execute.py) → files exist, endpoints reachable
Sub-models self-validate their own constraints. BenchmarkConfig only handles cross-model checks.
Errors from cyclopts (missing args, unknown flags, Pydantic validation) go through cli_error_formatter in config/utils.py:
$ uv run inference-endpoint benchmark offline
╭── Error ─────────────────────────────────────────────────────────────────────╮
│ Required: --dataset │
╰──────────────────────────────────────────────────────────────────────────────╯
$ uv run inference-endpoint benchmark offline --endpoints x --model M --dataset D --workers abc
╭── Error ─────────────────────────────────────────────────────────────────────╮
│ settings.client.num_workers: Input should be a valid integer, unable to │
│ string as an integer │
╰──────────────────────────────────────────────────────────────────────────────╯
The formatter resolves aliases (shows --dataset not --endpoint-config.endpoints) and strips Pydantic boilerplate.
Exception Class Exit Code When
───────────────────── ───────── ─────────────────────────────────
InputValidationError 2 Bad user input, invalid config
SetupError 3 Dataset load failure, connection error
ExecutionError 4 Benchmark failed after setup
CLIError 1 Generic CLI error (base class)
The reserved eval command currently raises CLIError with a tracking issue link rather than a
dedicated exception type.
Annotate the schema field — zero CLI code changes:
class HTTPClientConfig(WithUpdatesMixin, BaseModel):
buffer_size: Annotated[
int,
cyclopts.Parameter(alias="--buffer-size", help="Socket buffer size"),
] = 4096
# → --client.buffer-size AND --buffer-size- Dotted paths auto-generated in kebab-case from model hierarchy
- Shorthands explicit via
cyclopts.Parameter(alias=...) - Booleans get
--no-negation show=Falsehides from--help
BenchmarkConfig is frozen. Use with_updates() to produce new instances with re-validation:
config = config.with_updates(timeout=300, datasets=["new_data.jsonl"])