CLI Design: Config Loading, Validation, and Execution

How BenchmarkConfig gets built, validated, and used — from user input to benchmark execution.

For usage examples and flag reference, see CLI_QUICK_REFERENCE.md.

Command Structure

$ uv run inference-endpoint -h
Usage: inference-endpoint COMMAND

╭─ Commands ───────────────────────────────────────────────────────────────────╮
│ benchmark      Run benchmarks (offline, online, from-config)                 │
│ eval           Run accuracy evaluation. (not yet implemented)                │
│ info           Show system information.                                      │
│ init           Generate config template.                                     │
│ probe          Test endpoint connectivity.                                   │
│ validate-yaml  Validate YAML configuration file.                             │
╰──────────────────────────────────────────────────────────────────────────────╯

Subcommand	Purpose	Config source
`benchmark offline`	Max throughput (all queries at t=0)	CLI flags → Pydantic
`benchmark online`	Sustained QPS with load pattern	CLI flags → Pydantic
`benchmark from-config`	Run from YAML file	YAML → Pydantic

Global options: --version, -v (INFO), -vv (DEBUG). Verbosity is handled by the meta-app, not BenchmarkConfig.

Config Construction

              CLI                                          YAML
    ┌─────────┴─────────┐                          benchmark from-config
 offline              online                               │
    │                    │                                 │  yaml.safe_load(path)
    │  cyclopts builds   │  cyclopts builds                │  resolve_env_vars(data)
    │  OfflineBenchmark  │  OnlineBenchmark                │  TypeAdapter picks subclass
    │  Config            │  Config                         │  by type field
    │                    │                                 │
    │  inject --dataset  │  inject --dataset               │  optional --timeout/--mode
    │  via with_updates  │  via with_updates               │  via with_updates
    │                    │                                 │
    ▼                    ▼                                 ▼
  ┌────────────────────────────────────────────────────────────┐
  │  _resolve_and_validate() (model_validator)                 │
  └─────────────────────────────┬──────────────────────────────┘
                                │
                                ▼
                  setup → execute → finalize

Both paths produce the same subclass with the same defaults. A YAML file with type: offline gets OfflineBenchmarkConfig — identical to what benchmark offline constructs.

CLI path

cyclopts constructs the subclass directly. OfflineBenchmarkConfig / OnlineBenchmarkConfig are Pydantic models in config/schema.py with @cyclopts.Parameter(name="*"). cyclopts generates flags from their fields.
Type locked at class level. OfflineBenchmarkConfig.type is Literal[TestType.OFFLINE] — determined by subcommand, not user input.
Datasets injected after construction. --dataset strings are parsed by a BeforeValidator on the datasets field, then merged via config.with_updates(datasets=...).

YAML path

from_yaml_file(path) loads YAML, resolves ${VAR} env vars on parsed values, then passes the dict to a Pydantic TypeAdapter with Discriminator.
Auto-selects subclass. type: "offline" → OfflineBenchmarkConfig, type: "online" → OnlineBenchmarkConfig, others → base BenchmarkConfig.
Optional CLI overrides. --timeout and --mode applied via config.with_updates(...) which re-runs validators.

Why subclasses?

OfflineBenchmarkConfig and OnlineBenchmarkConfig exist in the schema (not just CLI) so both paths share them:

BenchmarkConfig (base — submission/eval fallback)
  ├── OfflineBenchmarkConfig  (type=OFFLINE, OfflineSettings)
  └── OnlineBenchmarkConfig   (type=ONLINE, OnlineSettings)

They provide:

Type safety — Literal type field, impossible to mismatch
Unified defaults — CLI and YAML get identical subclass behavior
Per-mode --help — each subcommand shows only relevant flags

Aspect	`offline`	`online`
Streaming default	AUTO → OFF	AUTO → ON
Settings class	`OfflineSettings`	`OnlineSettings`

Dataset Specification

--dataset is repeatable and accepts a string with TOML-style dotted paths:

--dataset [perf|acc:]<path>[,key=value...]

The first segment is the file path, optionally prefixed with perf: or acc: to set the dataset type (defaults to performance). Additional comma-separated key=value pairs set Dataset fields using dotted paths for nesting.

# Simple
--dataset data.jsonl

# Accuracy dataset
--dataset acc:eval.jsonl

# With samples limit and column remap
--dataset data.csv,samples=500,parser.prompt=article

# With accuracy config
--dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer

# Multiple datasets
--dataset perf:train.jsonl --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1 --mode both

Parser remaps use parser.TARGET=SOURCE — "rename my dataset's SOURCE column to TARGET". Valid targets are derived from MakeAdapterCompatible (prompt, system). Invalid targets are rejected at parse time. Invalid source columns are rejected at dataset load time.

Pydantic validates all fields: extra="forbid" on Dataset and AccuracyConfig catches typos like --dataset data.jsonl,samles=500. Format is auto-detected from file extension.

The only YAML-only features are submission_ref and benchmark_mode (for official submissions).

Validation

Validation is layered, executing in order:

 1. cyclopts        → required args? unknown flags?
 2. Pydantic fields → type coercion, ge/le constraints
 3. Sub-model validators:
    ├── RuntimeConfig._validate_durations    → max >= min duration
    ├── LoadPattern._validate_completeness   → poisson needs qps, concurrency needs target
    └── HTTPClientConfig._workers_not_zero   → num_workers != 0
 4. BenchmarkConfig._resolve_and_validate:
    ├── resolve defaults (name, streaming, model name from submission_ref)
    ├── load pattern type vs test type (offline→max_throughput, online→poisson/concurrency)
    ├── submission needs benchmark_mode
    └── duplicate dataset detection
 5. Runtime (execute.py) → files exist, endpoints reachable

Sub-models self-validate their own constraints. BenchmarkConfig only handles cross-model checks.

Error formatting

Errors from cyclopts (missing args, unknown flags, Pydantic validation) go through cli_error_formatter in config/utils.py:

$ uv run inference-endpoint benchmark offline
╭── Error ─────────────────────────────────────────────────────────────────────╮
│ Required: --dataset                                                          │
╰──────────────────────────────────────────────────────────────────────────────╯

$ uv run inference-endpoint benchmark offline --endpoints x --model M --dataset D --workers abc
╭── Error ─────────────────────────────────────────────────────────────────────╮
│   settings.client.num_workers: Input should be a valid integer, unable to    │
│ string as an integer                                                         │
╰──────────────────────────────────────────────────────────────────────────────╯

The formatter resolves aliases (shows --dataset not --endpoint-config.endpoints) and strips Pydantic boilerplate.

Error Handling

Exception Class         Exit Code   When
─────────────────────   ─────────   ─────────────────────────────────
InputValidationError    2           Bad user input, invalid config
SetupError              3           Dataset load failure, connection error
ExecutionError          4           Benchmark failed after setup
CLIError                1           Generic CLI error (base class)

The reserved eval command currently raises CLIError with a tracking issue link rather than a dedicated exception type.

Development Guide

Adding a CLI flag

Annotate the schema field — zero CLI code changes:

class HTTPClientConfig(WithUpdatesMixin, BaseModel):
    buffer_size: Annotated[
        int,
        cyclopts.Parameter(alias="--buffer-size", help="Socket buffer size"),
    ] = 4096
    # → --client.buffer-size AND --buffer-size

Flag generation rules

Dotted paths auto-generated in kebab-case from model hierarchy
Shorthands explicit via cyclopts.Parameter(alias=...)
Booleans get --no- negation
show=False hides from --help

Config modification

BenchmarkConfig is frozen. Use with_updates() to produce new instances with re-validation:

config = config.with_updates(timeout=300, datasets=["new_data.jsonl"])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI Design: Config Loading, Validation, and Execution

Command Structure

Config Construction

CLI path

YAML path

Why subclasses?

Dataset Specification

Validation

Error formatting

Error Handling

Development Guide

Adding a CLI flag

Flag generation rules

Config modification

FilesExpand file tree

CLI_DESIGN.md

Latest commit

History

CLI_DESIGN.md

File metadata and controls

CLI Design: Config Loading, Validation, and Execution

Command Structure

Config Construction

CLI path

YAML path

Why subclasses?

Dataset Specification

Validation

Error formatting

Error Handling

Development Guide

Adding a CLI flag

Flag generation rules

Config modification