Key terminology for understanding the evals framework.
Term
Definition
Model
The LLM being evaluated (e.g., google/gemini-2.5-pro, anthropic/claude-3-5-haiku)
Task
An Inspect AI evaluation function that processes samples (e.g., question_answer, bug_fix, code_gen)
Sample
A single test case containing an input prompt and expected output (grading criteria)
Variant
Named configuration that modifies how a task runs — controls context injection, MCP tools, and skill availability
Eval Run
A complete execution of a task against one or more models, producing results for all samples
Term
Definition
task.yaml
Task definition file specifying the task function, samples, and optional variant restrictions
job.yaml
Runtime configuration defining what to run — filters tasks, variants, and models for a specific run
EvalSet JSON
Resolved configuration produced by the Dart dataset_config_dart package and consumed by the Python runner
Term
Definition
Context File
Markdown file with YAML frontmatter providing additional context injected into prompts
Workspace Template
Reusable project scaffolds (Flutter app, Dart package) mounted in the sandbox
Sandbox
Containerized environment (Podman/Docker) for safe code execution during evaluations
MCP Server
Model Context Protocol server providing tools/context to the model during evaluation
Term
Definition
Scorer
Logic that determines if a model's output is correct (e.g., model-graded semantic match)
Accuracy
Percentage of samples scored as correct in an eval run
Package
Definition
dataset_config_dart
Dart package that parses dataset YAML and resolves it into EvalSet JSON via a layered Parser → Resolver → Writer pipeline
dash_evals
Python package that consumes EvalSet JSON (or direct CLI arguments) and executes evaluations using Inspect AI
devals_cli
Dart CLI (devals) for creating and managing tasks, samples, and jobs
Dart (dataset_config_dart)
Class
Definition
EvalSet
Top-level container representing a fully resolved evaluation configuration, serialized to JSON for the runner
Task
Inspect domain task definition with a name, task function reference, dataset, and configuration
Sample
An input/target test case with optional metadata, workspace, and sandbox configuration
Variant
Named variant configuration with context files, MCP servers, and skills
TaskInfo
Lightweight task metadata (name and function reference)
ParsedTask
Intermediate representation produced by parsers, consumed by the resolver
Job
Parsed job file with runtime overrides and task/variant/model filters
ConfigResolver
Facade providing single-call convenience API for the full parse → resolve → write pipeline
Class
Definition
json_runner
Module that reads EvalSet JSON, resolves task functions via importlib, builds inspect_ai.Task objects, and calls eval_set()
args_runner
Module that builds a single task from direct CLI arguments (--task, --model, --dataset)
See the Configuration Reference for detailed configuration file documentation.
Learn more about Inspect AI