Skip to content

Latest commit

 

History

History
73 lines (54 loc) · 3.64 KB

File metadata and controls

73 lines (54 loc) · 3.64 KB

Glossary

Key terminology for understanding the evals framework.

Core Concepts

Term Definition
Model The LLM being evaluated (e.g., google/gemini-2.5-pro, anthropic/claude-3-5-haiku)
Task An Inspect AI evaluation function that processes samples (e.g., question_answer, bug_fix, code_gen)
Sample A single test case containing an input prompt and expected output (grading criteria)
Variant Named configuration that modifies how a task runs — controls context injection, MCP tools, and skill availability
Eval Run A complete execution of a task against one or more models, producing results for all samples

Configuration Files

Term Definition
task.yaml Task definition file specifying the task function, samples, and optional variant restrictions
job.yaml Runtime configuration defining what to run — filters tasks, variants, and models for a specific run
EvalSet JSON Resolved configuration produced by the Dart dataset_config_dart package and consumed by the Python runner

Resources

Term Definition
Context File Markdown file with YAML frontmatter providing additional context injected into prompts
Workspace Template Reusable project scaffolds (Flutter app, Dart package) mounted in the sandbox
Sandbox Containerized environment (Podman/Docker) for safe code execution during evaluations
MCP Server Model Context Protocol server providing tools/context to the model during evaluation

Scoring

Term Definition
Scorer Logic that determines if a model's output is correct (e.g., model-graded semantic match)
Accuracy Percentage of samples scored as correct in an eval run

Key Packages

Package Definition
dataset_config_dart Dart package that parses dataset YAML and resolves it into EvalSet JSON via a layered Parser → Resolver → Writer pipeline
dash_evals Python package that consumes EvalSet JSON (or direct CLI arguments) and executes evaluations using Inspect AI
devals_cli Dart CLI (devals) for creating and managing tasks, samples, and jobs

Internal Classes

Dart (dataset_config_dart)

Class Definition
EvalSet Top-level container representing a fully resolved evaluation configuration, serialized to JSON for the runner
Task Inspect domain task definition with a name, task function reference, dataset, and configuration
Sample An input/target test case with optional metadata, workspace, and sandbox configuration
Variant Named variant configuration with context files, MCP servers, and skills
TaskInfo Lightweight task metadata (name and function reference)
ParsedTask Intermediate representation produced by parsers, consumed by the resolver
Job Parsed job file with runtime overrides and task/variant/model filters
ConfigResolver Facade providing single-call convenience API for the full parse → resolve → write pipeline

Python (dash_evals)

Class Definition
json_runner Module that reads EvalSet JSON, resolves task functions via importlib, builds inspect_ai.Task objects, and calls eval_set()
args_runner Module that builds a single task from direct CLI arguments (--task, --model, --dataset)

See the Configuration Reference for detailed configuration file documentation.

Learn more about Inspect AI