From 4e75faf9e27595ffc8011c9a94aae40eef313455 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Thu, 12 Mar 2026 12:47:42 -0700
Subject: [PATCH 01/21] feat: Introduce a dedicated `yaml_config.md` for
 detailed configuration fields, refactoring `configuration_reference.md` to
 link to it and updating `index.md` and `custom.css`.

---
 docs/_static/custom.css                   |  19 +
 docs/reference/configuration_reference.md | 242 +-------
 docs/reference/index.md                   |   1 +
 docs/reference/yaml_config.md             | 691 ++++++++++++++++++++++
 4 files changed, 718 insertions(+), 235 deletions(-)
 create mode 100644 docs/reference/yaml_config.md

diff --git a/docs/_static/custom.css b/docs/_static/custom.css
index 9243a6b..a0d57ff 100644
--- a/docs/_static/custom.css
+++ b/docs/_static/custom.css
@@ -418,3 +418,22 @@ html[data-theme="dark"] .sig > span.pre:not(:first-child) {
 html[data-theme="dark"] .sig-paren {
     color: #888888;
 }
+
+
+/* ============================================
+   COLLAPSIBLE SIDEBARS ON WIDE SCREENS
+   ============================================ */
+
+.bd-sidebar-primary {
+    padding-right: 30px;
+    width: auto !important;
+}
+
+
+.bd-sidebar-secondary  {
+    width: auto !important;
+}
+
+.bd-article-container {
+    max-width: none !important;
+}
\ No newline at end of file
diff --git a/docs/reference/configuration_reference.md b/docs/reference/configuration_reference.md
index deb2193..a1f4d68 100644
--- a/docs/reference/configuration_reference.md
+++ b/docs/reference/configuration_reference.md
@@ -12,6 +12,8 @@ The evaluation framework uses the `eval/` directory as its entry point. It conta
 
 Configuration is parsed and resolved by the Dart `dataset_config_dart` package, which produces an EvalSet JSON manifest consumed by the Python `dash_evals`.
 
+> **See also:** [YAML Configuration Fields](yaml_config.md) for a complete field-by-field reference with Dart and Python cross-references.
+
 ## Directory Structure
 
 ```
@@ -84,77 +86,7 @@ samples:
         The fix should handle the disposed controller properly.
 ```
 
-### Task-Level Fields
-
-#### Core Fields
-
-| Field | Type | Required | Description |
-|-------|------|----------|-------------|
-| `func` | string | Yes | Name of the `@task` function (resolved dynamically via `importlib`) |
-| `description` | string | No | Human-readable description |
-| `samples` | object | Yes | Samples config with `inline` and/or `paths` keys |
-| `allowed_variants` | list | No | Whitelist of variant names this task accepts (omit to accept all) |
-| `system_message` | string | No | Custom system prompt for this task |
-| `workspace` | object | No | Default workspace for all samples |
-| `tests` | object | No | Default test files for all samples |
-
-#### Inspect AI Task Parameters
-
-These map directly to [Inspect AI's `Task` constructor](https://inspect.aisi.org.uk/reference/inspect_ai.html#task). All are optional and override any `task_defaults` set in the job file.
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `model` | string | Default model for this task (overrides the eval model) |
-| `config` | object | Model generation config (e.g., `{temperature: 0.2, max_tokens: 4096}`) |
-| `model_roles` | object | Named roles for use in `get_model()` |
-| `sandbox` | string/object | Sandbox environment type or `[type, config_path]` |
-| `approval` | string/object | Tool use approval policies |
-| `epochs` | int/object | Number of times to repeat each sample (optionally with score reducer) |
-| `fail_on_error` | number/bool | `true` = fail on first error, `0.0–1.0` = fail if proportion exceeds threshold |
-| `continue_on_fail` | bool | Continue running if `fail_on_error` condition is met |
-| `message_limit` | int | Max total messages per sample |
-| `token_limit` | int | Max total tokens per sample |
-| `time_limit` | int | Max clock time (seconds) per sample |
-| `working_limit` | int | Max working time (seconds) per sample (excludes wait time) |
-| `cost_limit` | float | Max cost (dollars) per sample |
-| `early_stopping` | string/object | Early stopping callbacks |
-| `display_name` | string | Task display name (e.g., for plotting) |
-| `version` | int | Version of task spec (to distinguish evolutions) |
-| `metadata` | object | Additional metadata to associate with the task |
-
-### Samples Object
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `inline` | list | Inline sample definitions |
-| `paths` | list | Glob patterns for external sample YAML files (relative to task dir) |
-
-### Sample Fields (inline in task.yaml)
-
-#### Core Fields
-
-| Field | Type | Required | Description |
-|-------|------|----------|-------------|
-| `id` | string | Yes | Unique sample identifier |
-| `input` | string | Yes | The prompt given to the model |
-| `target` | string | Yes | Expected output or grading criteria |
-| `difficulty` | string | No | `easy`, `medium`, or `hard` |
-| `tags` | list | No | Categories for filtering |
-| `system_message` | string | No | Override system prompt for this sample |
-| `metadata` | object | No | Arbitrary metadata |
-| `workspace` | object | No | Override task-level workspace |
-| `tests` | object | No | Override task-level tests |
-
-#### Inspect AI Sample Parameters
-
-These map directly to [Inspect AI's `Sample`](https://inspect.aisi.org.uk/reference/inspect_ai.dataset.html#sample).
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `choices` | list | Answer choices for multiple-choice evaluations |
-| `sandbox` | string/object | Override sandbox environment for this sample |
-| `files` | object | Files to copy into the sandbox (`{destination: source}`) |
-| `setup` | string | Setup script to run in the sandbox before evaluation |
+For the complete list of task fields (including Inspect AI `Task` parameters), see the [Task fields table](yaml_config.md#task).
 
 ### Workspace/Tests References
 
@@ -201,34 +133,7 @@ samples:
         category: language_fundamentals
 ```
 
----
-
-### Core Fields
-
-| Field | Type | Required | Description |
-|-------|------|----------|-------------|
-| `id` | string | Yes | Unique sample identifier |
-| `input` | string | Yes | The prompt given to the model |
-| `target` | string | Yes | Expected output or grading criteria |
-| `difficulty` | string | No | `easy`, `medium`, or `hard` |
-| `tags` | list | No | Categories for filtering |
-| `system_message` | string | No | Override system prompt for this sample |
-| `metadata` | object | No | Arbitrary metadata |
-| `workspace` | object | No | Override task-level workspace |
-| `tests` | object | No | Override task-level tests |
-
----
-
-### Inspect AI Sample Parameters
-
-These map directly to [Inspect AI's `Sample`](https://inspect.aisi.org.uk/reference/inspect_ai.dataset.html#sample).
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `choices` | list | Answer choices for multiple-choice evaluations |
-| `sandbox` | string/object | Override sandbox environment for this sample |
-| `files` | object | Files to copy into the sandbox (`{destination: source}`) |
-| `setup` | string | Setup script to run in the sandbox before evaluation |
+For the complete list of sample fields, see the [Sample fields table](yaml_config.md#sample).
 
 ### Multiple Choice Example
 
@@ -256,33 +161,6 @@ These map directly to [Inspect AI's `Sample`](https://inspect.aisi.org.uk/refere
   setup: "cd /workspace && flutter pub get"
 ```
 
----
-
-### Workspace & Tests References
-
-Workspaces and test paths can be specified at task level (inherited by all samples) or per-sample (overrides task level).
-
-```yaml
-# Reference a reusable template
-workspace:
-  template: flutter_app
-
-# Reference a path relative to task directory
-workspace:
-  path: ./project
-
-# Clone from git
-workspace:
-  git: https://github.com/example/repo.git
-
-# Shorthand (equivalent to path:)
-workspace: ./project
-```
-
-> [!NOTE]
-> Paths in `workspace` and `tests` are resolved **relative to the task directory** (e.g., `tasks/flutter_bug_fix/`).
-
-
 ---
 
 ## Job files
@@ -330,110 +208,13 @@ task_defaults:
 #   log_images: true
 ```
 
-
-### Core Job Fields
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `logs_dir` | string | Override logs directory (default: `../logs`) |
-| `sandbox_type` | string | Sandbox type: `local`, `docker`, or `podman` (default: `local`) |
-| `max_connections` | int | Max concurrent API connections (default: `10`) |
-| `max_retries` | int | Max retry attempts for failed samples (default: `3`) |
-| `save_examples` | bool | If `true`, copies the agent's final workspace to `<logs_dir>/<run>/examples/` after each sample. (default: `false`) |
-| `models` | list | Filter to specific models — omit to run all |
-| `variants` | map | Named variant definitions (see Variants section) — omit to run all defined in task files |
-| `tasks` | object | Task discovery and overrides (see below) |
-
-### Inspect AI eval_set() Parameters
-
-All [Inspect AI `eval_set()` parameters](https://inspect.aisi.org.uk/reference/inspect_ai.html#eval_set) are available as top-level keys in the job file. These control retry behavior, concurrency, logging, and more.
-
-#### Retry & Error Handling
-
-| Field | Type | Default | Description |
-|-------|------|---------|-------------|
-| `retry_attempts` | int | `10` | Max retry attempts before giving up |
-| `retry_wait` | float | `60` | Seconds between retries (exponential backoff) |
-| `retry_connections` | float | `0.5` | Reduce max_connections at this rate per retry |
-| `retry_cleanup` | bool | `true` | Cleanup failed log files after retries |
-| `retry_on_error` | int | — | Retry samples on error (per-sample) |
-| `fail_on_error` | float | `0.05` | Fail if error proportion exceeds threshold |
-| `continue_on_fail` | bool | — | Continue running even if fail_on_error is met |
-| `debug_errors` | bool | `false` | Raise task errors for debugging |
-
-#### Concurrency
-
-| Field | Type | Default | Description |
-|-------|------|---------|-------------|
-| `max_samples` | int | `max_connections` | Max concurrent samples per task |
-| `max_tasks` | int | `max(4, models)` | Max tasks to run in parallel |
-| `max_subprocesses` | int | `cpu_count` | Max subprocesses in parallel |
-| `max_sandboxes` | int | — | Max sandboxes per-provider in parallel |
-
-#### Logging
-
-| Field | Type | Default | Description |
-|-------|------|---------|-------------|
-| `log_level` | string | `info` | Console log level (`debug`, `info`, `warning`, `error`) |
-| `log_level_transcript` | string | `info` | Log file level |
-| `log_format` | string | `json` | Log format (`eval` or `json`) |
-| `log_samples` | bool | `true` | Log detailed samples and scores |
-| `log_realtime` | bool | `true` | Log events in realtime |
-| `log_images` | bool | `false` | Log base64-encoded images |
-| `log_buffer` | int | — | Samples to buffer before log write |
-| `log_shared` | int | — | Sync sample events for realtime viewing |
-| `log_dir_allow_dirty` | bool | `false` | Allow log dir with unrelated logs |
-
-#### Model Configuration
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `model_base_url` | string | Base URL for the model API |
-| `model_args` | object | Model creation arguments |
-| `model_roles` | object | Named roles for `get_model()` |
-| `task_args` | object | Task creation arguments |
-| `model_cost_config` | object | Model prices for cost tracking |
-
-#### Sample Control
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `limit` | int/list | Limit samples (count or `[start, end]` range) |
-| `sample_id` | string/list | Evaluate specific sample(s) |
-| `sample_shuffle` | bool/int | Shuffle samples (pass seed for deterministic order) |
-| `epochs` | int/object | Repeat samples and optional score reducer |
-
-#### Limits (Applied to All Samples)
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `message_limit` | int | Max messages per sample |
-| `token_limit` | int | Max tokens per sample |
-| `time_limit` | int | Max clock time (seconds) per sample |
-| `working_limit` | int | Max working time (seconds) per sample |
-| `cost_limit` | float | Max cost (dollars) per sample |
-
-#### Miscellaneous
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `tags` | list | Tags for this evaluation run |
-| `metadata` | object | Metadata for this evaluation run |
-| `trace` | bool | Trace model interactions to terminal |
-| `display` | string | Task display type (default: `full`) |
-| `score` | bool | Score output (default: `true`) |
-| `approval` | string/object | Tool use approval policies |
-| `solver` | string/object | Alternative solver(s) |
-| `sandbox_cleanup` | bool | Cleanup sandbox after task (default: `true`) |
-| `bundle_dir` | string | Directory for bundled logs + viewer |
-| `bundle_overwrite` | bool | Overwrite files in bundle_dir |
-| `eval_set_id` | string | Custom ID for the eval set |
+For the complete list of job fields (including all Inspect AI `eval_set()` parameters), see the [Job fields table](yaml_config.md#job).
 
 ### Pass-Through Sections
 
 #### `task_defaults`
 
-Default [Task parameters](#inspect-ai-task-parameters) applied to **every task** in this job. Per-task overrides from `task.yaml` take precedence.
+Default [Task parameters](yaml_config.md#task) applied to **every task** in this job. Per-task overrides from `task.yaml` take precedence.
 
 ```yaml
 task_defaults:
@@ -467,11 +248,6 @@ tasks:
       exclude-samples: [slow_test]   # Exclude these samples
 ```
 
-| Field | Type | Description |
-|-------|------|-------------|
-| `paths` | list | Glob patterns for discovering task directories |
-| `inline` | object | Per-task configuration overrides |
-
 ---
 
 ## Variants
@@ -486,11 +262,7 @@ variants:
   full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
 ```
 
-| Field | Type | Default | Description |
-|-------|------|---------|-------------|
-| `context_files` | list | `[]` | Paths or glob patterns to context files (relative to task dir) |
-| `skills` | list | `[]` | Paths or glob patterns to skill directories (relative to task dir) |
-| `mcp_servers` | list | `[]` | MCP server identifiers |
+Variant sub-fields (`context_files`, `mcp_servers`, `skills`, `flutter_channel`) are documented in the [Job fields table](yaml_config.md#job).
 
 Tasks can optionally restrict which variants apply to them via `allowed_variants` in their `task.yaml`:
 
diff --git a/docs/reference/index.md b/docs/reference/index.md
index 1576729..86879cb 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -8,6 +8,7 @@ API documentation, CLI usage, and other reference material.
 glossary
 cli
 configuration_reference
+yaml_config
 ```
 
 ```{toctree}
diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
new file mode 100644
index 0000000..8e6e632
--- /dev/null
+++ b/docs/reference/yaml_config.md
@@ -0,0 +1,691 @@
+# YAML Configuration Fields
+
+This page provides a complete field-by-field reference for each YAML configuration file type, cross-referenced with the corresponding Dart and Python object field names.
+
+## Job
+
+Job files define runtime settings for an evaluation run, including sandbox configuration, rate limits, model selection, variant definitions, and pass-through parameters for Inspect AI's `eval_set()` and `Task` constructors. Located in `eval/jobs/`.
+
+```{list-table}
+:header-rows: 1
+:widths: 20 8 5 12 12 43
+
+* - Field name
+  - YAML type
+  - Optional
+  - Dart field
+  - Python field
+  - Description
+* - `log_dir`
+  - string
+  - N
+  - `logDir`
+  - `log_dir`
+  - Directory to write evaluation logs to
+* - `sandbox_type`
+  - string
+  - Y
+  - `sandboxType`
+  - `sandbox_type`
+  - Sandbox type: `local`, `docker`, or `podman` (default: `local`)
+* - `max_connections`
+  - int
+  - Y
+  - `maxConnections`
+  - `max_connections`
+  - Maximum concurrent API connections (default: `10`)
+* - `models`
+  - list
+  - Y
+  - `models`
+  - `models`
+  - Filter to specific models — omit to use defaults
+* - `variants`
+  - map
+  - Y
+  - `variants`
+  - `variants`
+  - Named variant definitions (keys are names, values are config maps)
+* - `variants`\
+    &nbsp;&nbsp;`.<name>`\
+    &nbsp;&nbsp;`.context_files`
+  - list
+  - Y
+  -
+  -
+  - Paths or glob patterns to context files
+* - `variants`\
+    &nbsp;&nbsp;`.<name>`\
+    &nbsp;&nbsp;`.mcp_servers`
+  - list
+  - Y
+  -
+  -
+  - MCP server identifiers
+* - `variants`\
+    &nbsp;&nbsp;`.<name>`\
+    &nbsp;&nbsp;`.skills`
+  - list
+  - Y
+  -
+  -
+  - Paths or glob patterns to skill directories
+* - `variants`\
+    &nbsp;&nbsp;`.<name>`\
+    &nbsp;&nbsp;`.flutter_channel`
+  - string
+  - Y
+  -
+  -
+  - Flutter SDK channel (`stable`, `beta`, `main`)
+* - `task_paths`
+  - list
+  - Y
+  - `taskPaths`
+  - `task_paths`
+  - Glob patterns for discovering task directories (relative to dataset root)
+* - `tasks`
+  - object
+  - Y
+  - `tasks`
+  - `tasks`
+  - Per-task configurations with inline overrides
+* - `tasks`\
+    &nbsp;&nbsp;`.<task_id>`\
+    &nbsp;&nbsp;`.include-samples`
+  - list
+  - Y
+  - `JobTask.includeSamples`
+  - `JobTask.include_samples`
+  - Only run these sample IDs
+* - `tasks`\
+    &nbsp;&nbsp;`.<task_id>`\
+    &nbsp;&nbsp;`.exclude-samples`
+  - list
+  - Y
+  - `JobTask.excludeSamples`
+  - `JobTask.exclude_samples`
+  - Exclude these sample IDs
+* - `tasks`\
+    &nbsp;&nbsp;`.<task_id>`\
+    &nbsp;&nbsp;`.system_message`
+  - string
+  - Y
+  - `JobTask.systemMessage`
+  - `JobTask.system_message`
+  - Override system message for this task
+* - `save_examples`
+  - bool
+  - Y
+  - `saveExamples`
+  - `save_examples`
+  - Copy final workspace to `<logDir>/examples/` after each sample (default: `false`)
+* - `retry_attempts`
+  - int
+  - Y
+  - `retryAttempts`
+  - `retry_attempts`
+  - Max retry attempts before giving up
+* - `max_retries`
+  - int
+  - Y
+  - `maxRetries`
+  - `max_retries`
+  - Max retry attempts for failed samples
+* - `retry_wait`
+  - float
+  - Y
+  - `retryWait`
+  - `retry_wait`
+  - Seconds between retries (exponential backoff)
+* - `retry_connections`
+  - float
+  - Y
+  - `retryConnections`
+  - `retry_connections`
+  - Reduce `max_connections` at this rate per retry
+* - `retry_cleanup`
+  - bool
+  - Y
+  - `retryCleanup`
+  - `retry_cleanup`
+  - Cleanup failed log files after retries
+* - `fail_on_error`
+  - float
+  - Y
+  - `failOnError`
+  - `fail_on_error`
+  - Fail if error proportion exceeds threshold (`0.0–1.0`)
+* - `continue_on_fail`
+  - bool
+  - Y
+  - `continueOnFail`
+  - `continue_on_fail`
+  - Continue running even if `fail_on_error` condition is met
+* - `retry_on_error`
+  - int
+  - Y
+  - `retryOnError`
+  - `retry_on_error`
+  - Retry samples on error (per-sample)
+* - `debug_errors`
+  - bool
+  - Y
+  - `debugErrors`
+  - `debug_errors`
+  - Raise task errors for debugging
+* - `max_samples`
+  - int
+  - Y
+  - `maxSamples`
+  - `max_samples`
+  - Max concurrent samples per task
+* - `max_tasks`
+  - int
+  - Y
+  - `maxTasks`
+  - `max_tasks`
+  - Max tasks to run in parallel
+* - `max_subprocesses`
+  - int
+  - Y
+  - `maxSubprocesses`
+  - `max_subprocesses`
+  - Max subprocesses in parallel
+* - `max_sandboxes`
+  - int
+  - Y
+  - `maxSandboxes`
+  - `max_sandboxes`
+  - Max sandboxes (per-provider) in parallel
+* - `log_level`
+  - string
+  - Y
+  - `logLevel`
+  - `log_level`
+  - Console log level (`debug`, `info`, `warning`, `error`)
+* - `log_level_transcript`
+  - string
+  - Y
+  - `logLevelTranscript`
+  - `log_level_transcript`
+  - Log file level
+* - `log_format`
+  - string
+  - Y
+  - `logFormat`
+  - `log_format`
+  - Log format (`eval` or `json`)
+* - `log_samples`
+  - bool
+  - Y
+  - `logSamples`
+  - `log_samples`
+  - Log detailed samples and scores
+* - `log_realtime`
+  - bool
+  - Y
+  - `logRealtime`
+  - `log_realtime`
+  - Log events in realtime
+* - `log_images`
+  - bool
+  - Y
+  - `logImages`
+  - `log_images`
+  - Log base64-encoded images
+* - `log_buffer`
+  - int
+  - Y
+  - `logBuffer`
+  - `log_buffer`
+  - Samples to buffer before log write
+* - `log_shared`
+  - int
+  - Y
+  - `logShared`
+  - `log_shared`
+  - Sync sample events for realtime viewing
+* - `log_dir_allow_dirty`
+  - bool
+  - Y
+  - `logDirAllowDirty`
+  - `log_dir_allow_dirty`
+  - Allow log dir with unrelated logs
+* - `model_base_url`
+  - string
+  - Y
+  - `modelBaseUrl`
+  - `model_base_url`
+  - Base URL for the model API
+* - `model_args`
+  - object
+  - Y
+  - `modelArgs`
+  - `model_args`
+  - Model creation arguments
+* - `model_roles`
+  - object
+  - Y
+  - `modelRoles`
+  - `model_roles`
+  - Named roles for `get_model()`
+* - `task_args`
+  - object
+  - Y
+  - `taskArgs`
+  - `task_args`
+  - Task creation arguments
+* - `model_cost_config`
+  - object
+  - Y
+  - `modelCostConfig`
+  - `model_cost_config`
+  - Model prices for cost tracking
+* - `limit`
+  - int/list
+  - Y
+  - `limit`
+  - `limit`
+  - Limit samples (count or `[start, end]` range)
+* - `sample_id`
+  - string/list
+  - Y
+  - `sampleId`
+  - `sample_id`
+  - Evaluate specific sample(s)
+* - `sample_shuffle`
+  - bool/int
+  - Y
+  - `sampleShuffle`
+  - `sample_shuffle`
+  - Shuffle samples (pass seed for deterministic order)
+* - `epochs`
+  - int/object
+  - Y
+  - `epochs`
+  - `epochs`
+  - Repeat samples and optional score reducer
+* - `message_limit`
+  - int
+  - Y
+  - `messageLimit`
+  - `message_limit`
+  - Max messages per sample
+* - `token_limit`
+  - int
+  - Y
+  - `tokenLimit`
+  - `token_limit`
+  - Max tokens per sample
+* - `time_limit`
+  - int
+  - Y
+  - `timeLimit`
+  - `time_limit`
+  - Max clock time (seconds) per sample
+* - `working_limit`
+  - int
+  - Y
+  - `workingLimit`
+  - `working_limit`
+  - Max working time (seconds) per sample
+* - `cost_limit`
+  - float
+  - Y
+  - `costLimit`
+  - `cost_limit`
+  - Max cost (dollars) per sample
+* - `tags`
+  - list
+  - Y
+  - `tags`
+  - `tags`
+  - Tags for this evaluation run
+* - `metadata`
+  - object
+  - Y
+  - `metadata`
+  - `metadata`
+  - Metadata for this evaluation run
+* - `trace`
+  - bool
+  - Y
+  - `trace`
+  - `trace`
+  - Trace model interactions to terminal
+* - `display`
+  - string
+  - Y
+  - `display`
+  - `display`
+  - Task display type (default: `full`)
+* - `score`
+  - bool
+  - Y
+  - `score`
+  - `score`
+  - Score output (default: `true`)
+* - `approval`
+  - string/object
+  - Y
+  - `approval`
+  - `approval`
+  - Tool use approval policies
+* - `solver`
+  - string/object
+  - Y
+  - `solver`
+  - `solver`
+  - Alternative solver(s)
+* - `sandbox_cleanup`
+  - bool
+  - Y
+  - `sandboxCleanup`
+  - `sandbox_cleanup`
+  - Cleanup sandbox after task
+* - `bundle_dir`
+  - string
+  - Y
+  - `bundleDir`
+  - `bundle_dir`
+  - Directory for bundled logs + viewer
+* - `bundle_overwrite`
+  - bool
+  - Y
+  - `bundleOverwrite`
+  - `bundle_overwrite`
+  - Overwrite files in `bundle_dir`
+* - `eval_set_id`
+  - string
+  - Y
+  - `evalSetId`
+  - `eval_set_id`
+  - Custom ID for the eval set
+* - `eval_set_overrides`
+  - object
+  - Y
+  - `evalSetOverrides`
+  - `eval_set_overrides`
+  - Additional `eval_set()` kwargs not covered by top-level fields
+* - `task_defaults`
+  - object
+  - Y
+  - `taskDefaults`
+  - `task_defaults`
+  - Default `Task` kwargs applied to every task in this job
+```
+
+## Task
+
+Task files define a single evaluation task with its samples, prompt configuration, and optional Inspect AI `Task` parameter overrides. Located in `eval/tasks/<task_id>/task.yaml`.
+
+```{list-table}
+:header-rows: 1
+:widths: 20 8 5 12 12 43
+
+* - Field name
+  - YAML type
+  - Optional
+  - Dart field
+  - Python field
+  - Description
+* - `func`
+  - string
+  - Y
+  -
+  -
+  - Name of the `@task` function (defaults to directory name)
+* - `id`
+  - string
+  - Y
+  -
+  -
+  - Task identifier (defaults to directory name)
+* - `description`
+  - string
+  - Y
+  -
+  -
+  - Human-readable description
+* - `system_message`
+  - string
+  - Y
+  -
+  -
+  - Custom system prompt for this task
+* - `samples`
+  - object
+  - N
+  -
+  -
+  - Samples config with `inline` and/or `paths` keys
+* - `samples`\
+    &nbsp;&nbsp;`.inline`
+  - list
+  - Y
+  -
+  -
+  - Inline sample definitions (list of sample objects)
+* - `samples`\
+    &nbsp;&nbsp;`.paths`
+  - list
+  - Y
+  -
+  -
+  - Glob patterns for external sample YAML files (relative to task dir)
+* - `allowed_variants`
+  - list
+  - Y
+  -
+  -
+  - Whitelist of variant names this task accepts
+* - `workspace`
+  - string/object
+  - Y
+  -
+  -
+  - Default workspace for all samples
+* - `tests`
+  - string/object
+  - Y
+  -
+  -
+  - Default test files for all samples
+* - `model`
+  - string
+  - Y
+  - `model`
+  - `model`
+  - Default model for this task
+* - `config`
+  - object
+  - Y
+  - `config`
+  - `config`
+  - Model generation config (e.g. `{temperature: 0.2}`)
+* - `model_roles`
+  - object
+  - Y
+  - `modelRoles`
+  - `model_roles`
+  - Named roles for `get_model()`
+* - `sandbox`
+  - string/object
+  - Y
+  - `sandbox`
+  - `sandbox`
+  - Sandbox environment type or config
+* - `approval`
+  - string/object
+  - Y
+  - `approval`
+  - `approval`
+  - Tool use approval policies
+* - `epochs`
+  - int/object
+  - Y
+  - `epochs`
+  - `epochs`
+  - Number of times to repeat each sample
+* - `fail_on_error`
+  - number/bool
+  - Y
+  - `failOnError`
+  - `fail_on_error`
+  - Fail threshold for sample errors
+* - `continue_on_fail`
+  - bool
+  - Y
+  - `continueOnFail`
+  - `continue_on_fail`
+  - Continue running if `fail_on_error` condition is met
+* - `message_limit`
+  - int
+  - Y
+  - `messageLimit`
+  - `message_limit`
+  - Max total messages per sample
+* - `token_limit`
+  - int
+  - Y
+  - `tokenLimit`
+  - `token_limit`
+  - Max total tokens per sample
+* - `time_limit`
+  - int
+  - Y
+  - `timeLimit`
+  - `time_limit`
+  - Max clock time (seconds) per sample
+* - `working_limit`
+  - int
+  - Y
+  - `workingLimit`
+  - `working_limit`
+  - Max working time (seconds) per sample
+* - `cost_limit`
+  - float
+  - Y
+  - `costLimit`
+  - `cost_limit`
+  - Max cost (dollars) per sample
+* - `early_stopping`
+  - string/object
+  - Y
+  - `earlyStopping`
+  - `early_stopping`
+  - Early stopping callbacks
+* - `display_name`
+  - string
+  - Y
+  - `displayName`
+  - `display_name`
+  - Task display name (e.g. for plotting)
+* - `version`
+  - int
+  - Y
+  - `version`
+  - `version`
+  - Version of task spec
+* - `metadata`
+  - object
+  - Y
+  - `metadata`
+  - `metadata`
+  - Additional metadata to associate with the task
+```
+
+## Sample
+
+Samples are individual test cases defined either inline in `task.yaml` under `samples.inline`, or in external YAML files referenced via `samples.paths`. Fields like `difficulty`, `tags`, `workspace`, and `tests` are parsed from YAML and stored inside the sample's `metadata` dict.
+
+```{list-table}
+:header-rows: 1
+:widths: 20 8 5 12 12 43
+
+* - Field name
+  - YAML type
+  - Optional
+  - Dart field
+  - Python field
+  - Description
+* - `id`
+  - string
+  - N
+  - `id`
+  - `id`
+  - Unique sample identifier
+* - `input`
+  - string
+  - N
+  - `input`
+  - `input`
+  - The prompt given to the model
+* - `target`
+  - string
+  - N
+  - `target`
+  - `target`
+  - Expected output or grading criteria
+* - `difficulty`
+  - string
+  - Y
+  -
+  -
+  - `easy`, `medium`, or `hard` (stored in `metadata["difficulty"]`)
+* - `tags`
+  - list
+  - Y
+  -
+  -
+  - Categories for filtering (stored in `metadata["tags"]`)
+* - `system_message`
+  - string
+  - Y
+  -
+  -
+  - Override system prompt for this sample (stored in `metadata`)
+* - `workspace`
+  - string/object
+  - Y
+  -
+  -
+  - Override task-level workspace (resolved path stored in `metadata["workspace"]`)
+* - `tests`
+  - string/object
+  - Y
+  -
+  -
+  - Override task-level tests (resolved path stored in `metadata["tests"]`)
+* - `choices`
+  - list
+  - Y
+  - `choices`
+  - `choices`
+  - Answer choices for multiple-choice evaluations
+* - `metadata`
+  - object
+  - Y
+  - `metadata`
+  - `metadata`
+  - Arbitrary metadata
+* - `sandbox`
+  - string/object
+  - Y
+  - `sandbox`
+  - `sandbox`
+  - Override sandbox environment for this sample
+* - `files`
+  - object
+  - Y
+  - `files`
+  - `files`
+  - Files to copy into sandbox (`{destination: source}`)
+* - `setup`
+  - string
+  - Y
+  - `setup`
+  - `setup`
+  - Setup script to run in sandbox before evaluation
+```

From 3cce70806c9aef115f2a51bbd44181c7c55c8b93 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 15:56:27 -0700
Subject: [PATCH 02/21] updates in flight

---
 CHANGELOG.md                                  | 105 ++++++++++++++++++
 docs/reference/yaml_config.md                 |  84 +++++++++++---
 .../src/dataset_config_python/models/job.py   |   2 +-
 3 files changed, 177 insertions(+), 14 deletions(-)
 create mode 100644 CHANGELOG.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
new file mode 100644
index 0000000..13b3b70
--- /dev/null
+++ b/CHANGELOG.md
@@ -0,0 +1,105 @@
+# Changelog
+
+## Unreleased
+
+### New
+
+- **`Job.description`.** Optional human-readable description field on Job.
+
+- **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files.
+
+- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at three levels:
+  - `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags
+  - `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags
+  - `variant_filters` on task YAML — restrict which variants apply to a task (supplements `allowed_variants`)
+
+- **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks.
+
+- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level. Previously only available as a job-level override via `JobTask`.
+
+- **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration.
+
+- **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks.
+
+### Breaking Changes
+
+- **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep.
+
+- **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations.
+
+- **Workspace resolution uses native Inspect fields.** The `workspace` YAML key remains as parser-level sugar but resolves into Inspect AI's native `Sample.files` and `Sample.setup` fields. The `Sample.setup` command is no longer hardcoded to `cd /workspace && flutter pub get`; it is configurable or omitted for non-Flutter tasks.
+
+### Documentation
+
+- Updated `docs/reference/yaml_config.md` with all new fields and updated descriptions.
+- Updated `docs/guides/config.md` (pending — after implementation).
+
+## 11 March, 2025
+
+### New
+
+- **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML.
+
+### Breaking Changes
+
+- **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package.
+
+- **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package.
+
+## 28 February, 2025
+
+### New
+
+- **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`.
+
+- **Dual-mode eval runner.** The Python runner now supports two invocation modes:
+  - `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI
+  - `run-evals --task <name> --model <model>` — run a single task directly from CLI arguments
+
+- **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`.
+
+- **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure.
+
+### Breaking Changes
+
+- **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically).
+
+- **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model).
+
+- **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package.
+
+- **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models.
+
+- **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`.
+
+## 25 February, 2025
+
+### Breaking Changes
+
+- **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`.
+
+  **Before (list format):**
+  ```yaml
+  variants:
+    - baseline
+    - { mcp_servers: [dart] }
+  ```
+
+  **After (named map format):**
+  ```yaml
+  # job.yaml
+  variants:
+    baseline: {}
+    mcp_only: { mcp_servers: [dart] }
+    context_only: { context_files: [./context_files/flutter.md] }
+    full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
+  ```
+
+  ```yaml
+  # task.yaml (optional — omit to accept all job variants)
+  allowed_variants: [baseline, mcp_only]
+  ```
+
+- **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants.
+
+- **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks.<id>.variants`) are replaced by task-level `allowed_variants` whitelists.
\ No newline at end of file
diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index 8e6e632..05d63b8 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -4,7 +4,7 @@ This page provides a complete field-by-field reference for each YAML configurati
 
 ## Job
 
-Job files define runtime settings for an evaluation run, including sandbox configuration, rate limits, model selection, variant definitions, and pass-through parameters for Inspect AI's `eval_set()` and `Task` constructors. Located in `eval/jobs/`.
+Job files define runtime settings for an evaluation run, including sandbox configuration, rate limits, model selection, variant definitions, tag-based filtering, and pass-through parameters for Inspect AI's `eval_set()` and `Task` constructors. Located in `eval/jobs/`.
 
 ```{list-table}
 :header-rows: 1
@@ -16,6 +16,12 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Dart field
   - Python field
   - Description
+* - `description`
+  - string
+  - Y
+  - `description`
+  - `description`
+  - Human-readable description of the job
 * - `log_dir`
   - string
   - N
@@ -28,6 +34,12 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `sandboxType`
   - `sandbox_type`
   - Sandbox type: `local`, `docker`, or `podman` (default: `local`)
+* - `image_prefix`
+  - string
+  - Y
+  - `imagePrefix`
+  - `image_prefix`
+  - Registry prefix prepended to image names during sandbox resolution (e.g. `us-central1-docker.pkg.dev/project/repo/`)
 * - `max_connections`
   - int
   - Y
@@ -78,6 +90,32 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   -
   -
   - Flutter SDK channel (`stable`, `beta`, `main`)
+* - `task_filters`
+  - object
+  - Y
+  - `taskFilters`
+  - `task_filters`
+  - Tag-based task selection filter
+* - `task_filters`\
+    &nbsp;&nbsp;`.include_tags`
+  - list
+  - Y
+  - `TagFilter.includeTags`
+  - `TagFilter.include_tags`
+  - Only run tasks whose metadata tags include **all** of these
+* - `task_filters`\
+    &nbsp;&nbsp;`.exclude_tags`
+  - list
+  - Y
+  - `TagFilter.excludeTags`
+  - `TagFilter.exclude_tags`
+  - Exclude tasks whose metadata tags include **any** of these
+* - `sample_filters`
+  - object
+  - Y
+  - `sampleFilters`
+  - `sample_filters`
+  - Tag-based sample selection filter (same schema as `task_filters`)
 * - `task_paths`
   - list
   - Y
@@ -114,6 +152,14 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `JobTask.systemMessage`
   - `JobTask.system_message`
   - Override system message for this task
+* - `tasks`\
+    &nbsp;&nbsp;`.<task_id>`\
+    &nbsp;&nbsp;`.args`
+  - object
+  - Y
+  - `JobTask.args`
+  - `JobTask.args`
+  - Per-task argument overrides passed to the task function
 * - `save_examples`
   - bool
   - Y
@@ -433,9 +479,9 @@ Task files define a single evaluation task with its samples, prompt configuratio
 * - `func`
   - string
   - Y
-  -
-  -
-  - Name of the `@task` function (defaults to directory name)
+  - `func`
+  - `func`
+  - Name of the `@task` function or `module:function` reference (defaults to directory name)
 * - `id`
   - string
   - Y
@@ -445,15 +491,9 @@ Task files define a single evaluation task with its samples, prompt configuratio
 * - `description`
   - string
   - Y
-  -
-  -
+  - `description`
+  - `description`
   - Human-readable description
-* - `system_message`
-  - string
-  - Y
-  -
-  -
-  - Custom system prompt for this task
 * - `samples`
   - object
   - N
@@ -480,12 +520,30 @@ Task files define a single evaluation task with its samples, prompt configuratio
   -
   -
   - Whitelist of variant names this task accepts
+* - `variant_filters`
+  - object
+  - Y
+  -
+  -
+  - Tag-based variant filter (same schema as job-level `task_filters`)
+* - `system_message`
+  - string
+  - Y
+  - `systemMessage`
+  - `system_message`
+  - Custom system prompt for this task
+* - `sandbox_parameters`
+  - object
+  - Y
+  - `sandboxParameters`
+  - `sandbox_parameters`
+  - Pass-through parameters for sandbox plugin configuration
 * - `workspace`
   - string/object
   - Y
   -
   -
-  - Default workspace for all samples
+  - Default workspace for all samples (resolved into `Sample.files` and `Sample.setup`)
 * - `tests`
   - string/object
   - Y
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/job.py b/packages/dataset_config_python/src/dataset_config_python/models/job.py
index c82ccc1..683e09f 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/job.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/job.py
@@ -4,7 +4,7 @@
 
 from typing import Any
 
-from pydantic import BaseModel, Field
+from pydantic import BaseModel
 
 
 class JobTask(BaseModel):

From b9af6a29d79286497daa533da5abdbb55c861c4f Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 16:45:25 -0700
Subject: [PATCH 03/21] rename func

---
 IMPLEMENTATION_PLAN.md                        | 315 ++++++++++++++++++
 .../src/dash_evals/runner/json_runner.py      |   4 +-
 .../lib/src/models/context_file.g.dart        |   2 +-
 .../lib/src/models/dataset.g.dart             |   2 +-
 .../lib/src/models/eval_log.g.dart            |  76 ++---
 .../lib/src/models/eval_set.g.dart            |   2 +-
 .../lib/src/models/job.dart                   |  23 ++
 .../lib/src/models/job.freezed.dart           | 186 ++++++++---
 .../lib/src/models/job.g.dart                 |  16 +-
 .../lib/src/models/models.dart                |   1 +
 .../lib/src/models/tag_filter.dart            |  33 ++
 .../lib/src/models/tag_filter.freezed.dart    | 290 ++++++++++++++++
 .../lib/src/models/tag_filter.g.dart          |  22 ++
 .../lib/src/models/task.dart                  |  14 +-
 .../lib/src/models/task.freezed.dart          |  68 ++--
 .../lib/src/models/task.g.dart                |  10 +-
 .../lib/src/models/variant.g.dart             |   2 +-
 .../lib/src/parsed_task.dart                  |  14 +-
 .../lib/src/parsers/json_parser.dart          |   4 +-
 .../lib/src/parsers/yaml_parser.dart          |   4 +-
 .../lib/src/resolvers/eval_set_resolver.dart  |   2 +-
 packages/dataset_config_dart/pubspec.yaml     |   3 +
 .../test/eval_set_resolver_test.dart          |  10 +-
 .../test/eval_set_writer_test.dart            |   2 +-
 .../test/json_parser_test.dart                |   4 +-
 .../test/parsed_task_test.dart                |  16 +-
 .../dataset_config_python/models/__init__.py  |   3 +
 .../src/dataset_config_python/models/job.py   |  12 +
 .../models/tag_filter.py                      |  30 ++
 .../src/dataset_config_python/models/task.py  |   8 +-
 .../src/dataset_config_python/parser.py       |  10 +-
 .../src/dataset_config_python/resolver.py     |   2 +-
 .../tests/test_config.py                      |   6 +-
 .../devals_cli/lib/src/dataset/dry_run.dart   |   4 +-
 tool/config_parity/pubspec.lock               | 108 ------
 35 files changed, 1038 insertions(+), 270 deletions(-)
 create mode 100644 IMPLEMENTATION_PLAN.md
 create mode 100644 packages/dataset_config_dart/lib/src/models/tag_filter.dart
 create mode 100644 packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart
 create mode 100644 packages/dataset_config_dart/lib/src/models/tag_filter.g.dart
 create mode 100644 packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py
 delete mode 100644 tool/config_parity/pubspec.lock

diff --git a/IMPLEMENTATION_PLAN.md b/IMPLEMENTATION_PLAN.md
new file mode 100644
index 0000000..74441ea
--- /dev/null
+++ b/IMPLEMENTATION_PLAN.md
@@ -0,0 +1,315 @@
+# Config Improvements — Implementation Plan
+
+This document details the implementation steps for all decided config improvements. Each section includes the specific files to modify in both Dart and Python packages, what to change, and relevant context.
+
+> **Branch:** `yardstick-config-updates`
+> **Related docs:** `CHANGELOG.md`, `docs/reference/yaml_config.md`
+> **Design analysis:** The original design doc (`config_improvements.md`) has been deleted. The finalized decisions are captured in `CHANGELOG.md`.
+
+---
+
+## Table of Contents
+
+1. [Model Changes](#1-model-changes)
+2. [Parser/Resolver Changes](#2-parserresolver-changes)
+3. [Tag-Based Filtering](#3-tag-based-filtering)
+4. [File Index](#4-file-index)
+5. [Verification](#5-verification)
+
+---
+
+## 1. Model Changes
+
+### 1.1 Add `description` to Job
+
+Simple optional string field.
+
+**Dart** — `packages/dataset_config_dart/lib/src/models/job.dart`
+```dart
+String? description,  // Add to Job freezed class
+```
+
+**Python** — `packages/dataset_config_python/src/dataset_config_python/models/job.py`
+```python
+description: str | None = None
+```
+
+**Parser** — `packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart`
+```dart
+final description = data['description'] as String?;
+// Pass to Job constructor
+```
+
+---
+
+### 1.2 Add `image_prefix` to Job
+
+Registry URL prefix prepended to image names during sandbox resolution (e.g. `us-central1-docker.pkg.dev/project/repo/`).
+
+**Dart** — `packages/dataset_config_dart/lib/src/models/job.dart`
+```dart
+String? imagePrefix,
+```
+
+**Python** — `packages/dataset_config_python/src/dataset_config_python/models/job.py`
+```python
+image_prefix: str | None = None
+```
+
+**Parser** — read `image_prefix` from YAML, pass to Job.
+
+**Resolver** — `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart`
+- In `_resolveSandbox()`, prepend `job.imagePrefix` to image names when constructing sandbox specs.
+
+---
+
+### 1.3 Add `args` to JobTask
+
+Per-task argument overrides passed to the task function.
+
+**Dart** — `packages/dataset_config_dart/lib/src/models/job.dart` (on `JobTask` class)
+```dart
+@JsonKey(name: 'args') Map<String, dynamic>? args,
+```
+
+**Python** — `packages/dataset_config_python/src/dataset_config_python/models/job.py` (on `JobTask` class)
+```python
+args: dict[str, Any] | None = None
+```
+
+**Parser** — In `JobTask.fromYaml()` (both Dart and Python), read `args` from the per-task map.
+
+---
+
+### 1.4 Add `system_message` to Task model
+
+Currently exists on `ParsedTask` but not the output `Task` model. Promote it.
+
+**Dart** — `packages/dataset_config_dart/lib/src/models/task.dart`
+```dart
+@JsonKey(name: 'system_message') String? systemMessage,
+```
+
+**Python** — `packages/dataset_config_python/src/dataset_config_python/models/task.py`
+```python
+system_message: str | None = None
+```
+
+**Resolver** — `eval_set_resolver.dart` already puts `system_message` into Task metadata. After this change, set it as a first-class field on the Task object instead.
+
+---
+
+### 1.5 Add `sandbox_parameters` to Task
+
+Pass-through dict for sandbox plugin configuration.
+
+**Dart** — `packages/dataset_config_dart/lib/src/models/task.dart`
+```dart
+@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters,
+```
+
+**Python** — `packages/dataset_config_python/src/dataset_config_python/models/task.py`
+```python
+sandbox_parameters: dict[str, Any] | None = None
+```
+
+**Parser** — read `sandbox_parameters` from task.yaml.
+
+---
+
+### 1.6 Rename `task_func` → `func`
+
+The YAML parser already aliases `func` → `task_func`. This renames the model field to match.
+
+**Dart** — `packages/dataset_config_dart/lib/src/models/task.dart`
+- Rename `taskFunc` → `func`
+- Update `@JsonKey(name: 'task_func')` → `@JsonKey(name: 'func')`
+- Regenerate `.freezed.dart` / `.g.dart`
+
+**Python** — `packages/dataset_config_python/src/dataset_config_python/models/task.py`
+- Rename `task_func` → `func`
+
+**Other files to update:**
+- `packages/dataset_config_dart/lib/src/parsed_task.dart` — `taskFunc` field and `copyWith`
+- `packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart` — variable names referencing `taskFunc`
+- `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart` — `tc.taskFunc`
+- `packages/devals_cli/lib/src/dataset/dry_run.dart` — references `task_func`
+- `packages/dash_evals/src/dash_evals/runner/json_runner.py` — `task_def.get("task_func")`
+- `packages/dataset_config_python/tests/test_config.py` — Task construction with `task_func=`
+- `tool/config_parity/` — both `resolve_dart.dart` and `resolve_python.py`
+
+---
+
+## 2. Parser/Resolver Changes
+
+### 2.1 Support `module:task` syntax
+
+Task function references can use `module.path:function_name` format.
+
+**Python** — `packages/dash_evals/src/dash_evals/runner/json_runner.py`
+- Update `_resolve_task_func()` to split on `:` and import the module, then get the function by attribute name.
+
+**Dart parser** — `yaml_parser.dart` L53 already reads `func` as a string. No Dart change needed — the module resolution happens in the Python runner.
+
+---
+
+### 2.2 Make sandbox registry configurable
+
+The hardcoded `kSandboxRegistry` and `kSdkChannels` in `eval_set_resolver.dart` (lines 25-42) need to become data-driven.
+
+**Approach:**
+1. Move `kSandboxRegistry` and `kSdkChannels` out of the resolver
+2. Add an optional `sandbox_registry` parameter to `EvalSetResolver.resolve()`, or make it a field on the resolver
+3. The consuming project (dash_evals CLI) passes its sandbox registry when calling the resolver
+4. Default to an empty registry if none provided (no sandbox resolution)
+
+**Files:**
+- `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart` — extract constants, add parameter
+- `packages/devals_cli/` — pass the Flutter-specific registry when calling the resolver
+- Python resolver (`packages/dataset_config_python/src/dataset_config_python/resolver.py`) — mirror the same approach
+
+---
+
+### 2.3 Workspace: use native Inspect fields
+
+The `workspace` YAML key stays as parser sugar but resolves into Inspect's native `Sample.files` and `Sample.setup`.
+
+**Current behavior** (`eval_set_resolver.dart` L132-141):
+```dart
+if (workspace != null && isContainer) {
+  files = {...?files, '/workspace': workspace};
+  setup = setup ?? 'cd /workspace && flutter pub get';
+  enriched['workspace'] = '/workspace';
+}
+```
+
+**Change:**
+- Make the auto-generated `setup` command configurable. Options:
+  - Add a `workspace_setup` field to Task YAML (e.g. `workspace_setup: "cd /workspace && npm install"`)
+  - Or: only auto-generate setup for tasks that have a Flutter-specific tag/metadata
+  - Or: remove auto-generation entirely; require the task author to specify `setup` if needed
+- The resolver should still map `workspace` → `Sample.files['/workspace']`, but not assume Flutter.
+
+**Files:**
+- `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart` — update workspace → files mapping
+- `packages/dataset_config_python/src/dataset_config_python/resolver.py` — mirror
+
+---
+
+## 3. Tag-Based Filtering
+
+### 3.1 New `TagFilter` model
+
+**Dart** — new file `packages/dataset_config_dart/lib/src/models/tag_filter.dart`
+```dart
+@freezed
+sealed class TagFilter with _$TagFilter {
+  const factory TagFilter({
+    @JsonKey(name: 'include_tags') List<String>? includeTags,
+    @JsonKey(name: 'exclude_tags') List<String>? excludeTags,
+  }) = _TagFilter;
+
+  factory TagFilter.fromJson(Map<String, dynamic> json) =>
+      _$TagFilterFromJson(json);
+}
+```
+
+**Python** — new file or add to `packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py`
+```python
+class TagFilter(BaseModel):
+    include_tags: list[str] | None = None
+    exclude_tags: list[str] | None = None
+```
+
+**Shared matching function** (add to both languages):
+```python
+def matches_filter(item_tags: list[str], filter: TagFilter) -> bool:
+    if filter.include_tags and not all(t in item_tags for t in filter.include_tags):
+        return False
+    if filter.exclude_tags and any(t in item_tags for t in filter.exclude_tags):
+        return False
+    return True
+```
+
+### 3.2 Add filters to Job and Task
+
+**Job model:**
+- `taskFilters: TagFilter?` / `task_filters: TagFilter | None`
+- `sampleFilters: TagFilter?` / `sample_filters: TagFilter | None`
+
+**Task YAML (parser-level, not model):**
+- `variant_filters: TagFilter?` — parsed from task.yaml, stored on `ParsedTask`
+
+### 3.3 Apply filters in resolver
+
+In `_expandTaskConfigs()` (`eval_set_resolver.dart` L418-493), add filtering steps:
+
+1. **Task filtering** (after L431): if `job.taskFilters` is set, check `taskConfig.metadata['tags']` against the filter
+2. **Sample filtering** (after L460): if `job.sampleFilters` is set, filter samples by `sample.metadata['tags']`
+3. **Variant filtering** (after L440): if `taskConfig.variantFilters` is set, check variant metadata tags
+
+These run alongside (not replacing) the existing ID-based filters.
+
+---
+
+## 4. File Index
+
+All files that need modification, grouped by package:
+
+### `dataset_config_dart`
+| File | Changes |
+|---|---|
+| `lib/src/models/job.dart` | Add `description`, `imagePrefix`, `taskFilters`, `sampleFilters` |
+| `lib/src/models/job.dart` (JobTask) | Add `args` |
+| `lib/src/models/task.dart` | Rename `taskFunc` → `func`, add `systemMessage`, `sandboxParameters` |
+| `lib/src/models/tag_filter.dart` | **New file** — `TagFilter` model |
+| `lib/src/models/models.dart` | Export `tag_filter.dart` |
+| `lib/src/parsed_task.dart` | Rename `taskFunc` → `func`, add `variantFilters` |
+| `lib/src/parsers/yaml_parser.dart` | Read new fields from YAML |
+| `lib/src/resolvers/eval_set_resolver.dart` | Configurable sandbox registry, tag filtering, workspace setup |
+| `test/` | Update tests for renamed fields and new features |
+
+### `dataset_config_python`
+| File | Changes |
+|---|---|
+| `models/job.py` | Add `description`, `image_prefix`, `task_filters`, `sample_filters` |
+| `models/job.py` (JobTask) | Add `args` |
+| `models/task.py` | Rename `task_func` → `func`, add `system_message`, `sandbox_parameters` |
+| `models/tag_filter.py` | **New file** — `TagFilter` model |
+| `models/__init__.py` | Export `TagFilter` |
+| `parser.py` | Read new fields from YAML |
+| `resolver.py` | Configurable sandbox registry, tag filtering, workspace setup |
+| `tests/test_config.py` | Update tests |
+
+### `dash_evals` (Python runner)
+| File | Changes |
+|---|---|
+| `runner/json_runner.py` | `task_func` → `func`, `module:task` syntax support |
+
+### `devals_cli` (Dart CLI)
+| File | Changes |
+|---|---|
+| `lib/src/dataset/dry_run.dart` | `task_func` → `func` references |
+
+### Other
+| File | Changes |
+|---|---|
+| `tool/config_parity/` | Update both resolve scripts for renamed fields |
+| `docs/reference/yaml_config.md` | Already updated |
+| `CHANGELOG.md` | Already updated |
+| `docs/guides/config.md` | Update after implementation |
+
+---
+
+## 5. Verification
+
+### Automated
+- Run `dart test` in `dataset_config_dart`
+- Run `pytest` in `dataset_config_python`
+- Run `tool/config_parity` to verify Dart/Python output parity
+- Run `dart analyze` across workspace
+
+### Manual
+- Verify `make html` in `docs/` builds without new errors
+- Verify a sample job YAML with the new fields parses correctly
+- Verify tag filtering produces expected task/sample subsets
diff --git a/packages/dash_evals/src/dash_evals/runner/json_runner.py b/packages/dash_evals/src/dash_evals/runner/json_runner.py
index a5d7a5b..7db7e89 100644
--- a/packages/dash_evals/src/dash_evals/runner/json_runner.py
+++ b/packages/dash_evals/src/dash_evals/runner/json_runner.py
@@ -146,13 +146,13 @@ def _run_single_manifest(manifest: dict) -> bool:
     task_instances: list[inspect_ai.Task] = []
 
     for task_def in task_defs:
-        task_func_name = task_def.get("task_func")
+        task_func_name = task_def.get("func")
         task_name = task_def.get("name", task_func_name or "(unknown)")
 
         if not task_func_name:
             # Mode 2: hydrate directly from JSON (future)
             job_logger.warning(
-                f"  ⚠ {task_name}: no task_func — Mode 2 hydration not yet supported"
+                f"  ⚠ {task_name}: no func — Mode 2 hydration not yet supported"
             )
             continue
 
diff --git a/packages/dataset_config_dart/lib/src/models/context_file.g.dart b/packages/dataset_config_dart/lib/src/models/context_file.g.dart
index fcea90e..7489275 100644
--- a/packages/dataset_config_dart/lib/src/models/context_file.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/context_file.g.dart
@@ -37,7 +37,7 @@ _ContextFile _$ContextFileFromJson(Map<String, dynamic> json) => _ContextFile(
 
 Map<String, dynamic> _$ContextFileToJson(_ContextFile instance) =>
     <String, dynamic>{
-      'metadata': instance.metadata.toJson(),
+      'metadata': instance.metadata,
       'content': instance.content,
       'file_path': instance.filePath,
     };
diff --git a/packages/dataset_config_dart/lib/src/models/dataset.g.dart b/packages/dataset_config_dart/lib/src/models/dataset.g.dart
index 0b281d8..a3c87a3 100644
--- a/packages/dataset_config_dart/lib/src/models/dataset.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/dataset.g.dart
@@ -18,7 +18,7 @@ _Dataset _$DatasetFromJson(Map<String, dynamic> json) => _Dataset(
 );
 
 Map<String, dynamic> _$DatasetToJson(_Dataset instance) => <String, dynamic>{
-  'samples': instance.samples.map((e) => e.toJson()).toList(),
+  'samples': instance.samples,
   'name': instance.name,
   'location': instance.location,
   'shuffled': instance.shuffled,
diff --git a/packages/dataset_config_dart/lib/src/models/eval_log.g.dart b/packages/dataset_config_dart/lib/src/models/eval_log.g.dart
index f6fa452..d55efb0 100644
--- a/packages/dataset_config_dart/lib/src/models/eval_log.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/eval_log.g.dart
@@ -39,17 +39,17 @@ _EvalLog _$EvalLogFromJson(Map<String, dynamic> json) => _EvalLog(
 Map<String, dynamic> _$EvalLogToJson(_EvalLog instance) => <String, dynamic>{
   'version': instance.version,
   'status': instance.status,
-  'eval': instance.eval.toJson(),
-  'plan': instance.plan?.toJson(),
-  'results': instance.results?.toJson(),
-  'stats': instance.stats?.toJson(),
-  'error': instance.error?.toJson(),
+  'eval': instance.eval,
+  'plan': instance.plan,
+  'results': instance.results,
+  'stats': instance.stats,
+  'error': instance.error,
   'invalidated': instance.invalidated,
-  'samples': instance.samples?.map((e) => e.toJson()).toList(),
-  'reductions': instance.reductions?.map((e) => e.toJson()).toList(),
+  'samples': instance.samples,
+  'reductions': instance.reductions,
   'location': instance.location,
   'etag': instance.etag,
-  'eval_set_info': instance.evalSetInfo?.toJson(),
+  'eval_set_info': instance.evalSetInfo,
 };
 
 _EvalSpec _$EvalSpecFromJson(Map<String, dynamic> json) => _EvalSpec(
@@ -125,15 +125,15 @@ Map<String, dynamic> _$EvalSpecToJson(_EvalSpec instance) => <String, dynamic>{
   'solver_args': instance.solverArgs,
   'solver_args_passed': instance.solverArgsPassed,
   'tags': instance.tags,
-  'dataset': instance.dataset?.toJson(),
+  'dataset': instance.dataset,
   'sandbox': instance.sandbox,
   'model': instance.model,
-  'model_generate_config': instance.modelGenerateConfig?.toJson(),
+  'model_generate_config': instance.modelGenerateConfig,
   'model_base_url': instance.modelBaseUrl,
   'model_args': instance.modelArgs,
   'model_roles': instance.modelRoles,
-  'config': instance.config.toJson(),
-  'revision': instance.revision?.toJson(),
+  'config': instance.config,
+  'revision': instance.revision,
   'packages': instance.packages,
   'metadata': instance.metadata,
   'scorers': instance.scorers,
@@ -249,9 +249,9 @@ _EvalPlan _$EvalPlanFromJson(Map<String, dynamic> json) => _EvalPlan(
 
 Map<String, dynamic> _$EvalPlanToJson(_EvalPlan instance) => <String, dynamic>{
   'name': instance.name,
-  'steps': instance.steps.map((e) => e.toJson()).toList(),
-  'finish': instance.finish?.toJson(),
-  'config': instance.config.toJson(),
+  'steps': instance.steps,
+  'finish': instance.finish,
+  'config': instance.config,
 };
 
 _EvalPlanStep _$EvalPlanStepFromJson(Map<String, dynamic> json) =>
@@ -291,12 +291,10 @@ Map<String, dynamic> _$EvalResultsToJson(_EvalResults instance) =>
     <String, dynamic>{
       'total_samples': instance.totalSamples,
       'completed_samples': instance.completedSamples,
-      'early_stopping': instance.earlyStopping?.toJson(),
-      'scores': instance.scores.map((e) => e.toJson()).toList(),
+      'early_stopping': instance.earlyStopping,
+      'scores': instance.scores,
       'metadata': instance.metadata,
-      'sample_reductions': instance.sampleReductions
-          ?.map((e) => e.toJson())
-          .toList(),
+      'sample_reductions': instance.sampleReductions,
     };
 
 _EarlyStoppingSummary _$EarlyStoppingSummaryFromJson(
@@ -338,7 +336,7 @@ Map<String, dynamic> _$EvalScoreToJson(_EvalScore instance) =>
       'scored_samples': instance.scoredSamples,
       'unscored_samples': instance.unscoredSamples,
       'params': instance.params,
-      'metrics': instance.metrics.map((e) => e.toJson()).toList(),
+      'metrics': instance.metrics,
       'metadata': instance.metadata,
     };
 
@@ -372,7 +370,7 @@ Map<String, dynamic> _$EvalSampleReductionsToJson(
 ) => <String, dynamic>{
   'scorer': instance.scorer,
   'reducer': instance.reducer,
-  'samples': instance.samples.map((e) => e.toJson()).toList(),
+  'samples': instance.samples,
 };
 
 _EvalStats _$EvalStatsFromJson(Map<String, dynamic> json) => _EvalStats(
@@ -389,7 +387,7 @@ Map<String, dynamic> _$EvalStatsToJson(_EvalStats instance) =>
     <String, dynamic>{
       'started_at': instance.startedAt,
       'completed_at': instance.completedAt,
-      'model_usage': instance.modelUsage.map((k, e) => MapEntry(k, e.toJson())),
+      'model_usage': instance.modelUsage,
     };
 
 _EvalError _$EvalErrorFromJson(Map<String, dynamic> json) => _EvalError(
@@ -470,22 +468,22 @@ Map<String, dynamic> _$EvalSampleToJson(_EvalSample instance) =>
       'sandbox': instance.sandbox,
       'files': instance.files,
       'setup': instance.setup,
-      'messages': instance.messages.map((e) => e.toJson()).toList(),
-      'output': instance.output.toJson(),
-      'scores': instance.scores?.map((k, e) => MapEntry(k, e.toJson())),
+      'messages': instance.messages,
+      'output': instance.output,
+      'scores': instance.scores,
       'store': instance.store,
       'events': instance.events,
-      'model_usage': instance.modelUsage.map((k, e) => MapEntry(k, e.toJson())),
+      'model_usage': instance.modelUsage,
       'started_at': instance.startedAt,
       'completed_at': instance.completedAt,
       'total_time': instance.totalTime,
       'working_time': instance.workingTime,
       'uuid': instance.uuid,
-      'invalidation': instance.invalidation?.toJson(),
-      'error': instance.error?.toJson(),
-      'error_retries': instance.errorRetries?.map((e) => e.toJson()).toList(),
+      'invalidation': instance.invalidation,
+      'error': instance.error,
+      'error_retries': instance.errorRetries,
       'attachments': instance.attachments,
-      'limit': instance.limit?.toJson(),
+      'limit': instance.limit,
     };
 
 _ModelOutput _$ModelOutputFromJson(Map<String, dynamic> json) => _ModelOutput(
@@ -511,14 +509,14 @@ _ModelOutput _$ModelOutputFromJson(Map<String, dynamic> json) => _ModelOutput(
 Map<String, dynamic> _$ModelOutputToJson(_ModelOutput instance) =>
     <String, dynamic>{
       'model': instance.model,
-      'choices': instance.choices.map((e) => e.toJson()).toList(),
-      'usage': instance.usage?.toJson(),
+      'choices': instance.choices,
+      'usage': instance.usage,
       'completion': instance.completion,
       'stop_reason': instance.stopReason,
       'time': instance.time,
       'metadata': instance.metadata,
       'error': instance.error,
-      'message': instance.message?.toJson(),
+      'message': instance.message,
     };
 
 _ChatCompletionChoice _$ChatCompletionChoiceFromJson(
@@ -536,9 +534,9 @@ _ChatCompletionChoice _$ChatCompletionChoiceFromJson(
 Map<String, dynamic> _$ChatCompletionChoiceToJson(
   _ChatCompletionChoice instance,
 ) => <String, dynamic>{
-  'message': instance.message.toJson(),
+  'message': instance.message,
   'stop_reason': instance.stopReason,
-  'logprobs': instance.logprobs?.toJson(),
+  'logprobs': instance.logprobs,
 };
 
 _ModelUsage _$ModelUsageFromJson(Map<String, dynamic> json) => _ModelUsage(
@@ -620,7 +618,7 @@ Map<String, dynamic> _$ChatMessageAssistantToJson(
   'source': instance.source,
   'metadata': instance.metadata,
   'role': instance.role,
-  'tool_calls': instance.toolCalls?.map((e) => e.toJson()).toList(),
+  'tool_calls': instance.toolCalls,
   'model': instance.model,
 };
 
@@ -647,7 +645,7 @@ Map<String, dynamic> _$ChatMessageToolToJson(ChatMessageTool instance) =>
       'role': instance.role,
       'tool_call_id': instance.toolCallId,
       'function': instance.function,
-      'error': instance.error?.toJson(),
+      'error': instance.error,
     };
 
 ContentText _$ContentTextFromJson(Map<String, dynamic> json) => ContentText(
@@ -932,7 +930,7 @@ _EvalSetInfo _$EvalSetInfoFromJson(Map<String, dynamic> json) => _EvalSetInfo(
 Map<String, dynamic> _$EvalSetInfoToJson(_EvalSetInfo instance) =>
     <String, dynamic>{
       'eval_set_id': instance.evalSetId,
-      'tasks': instance.tasks.map((e) => e.toJson()).toList(),
+      'tasks': instance.tasks,
     };
 
 _EvalSetTask _$EvalSetTaskFromJson(Map<String, dynamic> json) => _EvalSetTask(
diff --git a/packages/dataset_config_dart/lib/src/models/eval_set.g.dart b/packages/dataset_config_dart/lib/src/models/eval_set.g.dart
index 7b0db55..4e91dab 100644
--- a/packages/dataset_config_dart/lib/src/models/eval_set.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/eval_set.g.dart
@@ -64,7 +64,7 @@ _EvalSet _$EvalSetFromJson(Map<String, dynamic> json) => _EvalSet(
 );
 
 Map<String, dynamic> _$EvalSetToJson(_EvalSet instance) => <String, dynamic>{
-  'tasks': instance.tasks.map((e) => e.toJson()).toList(),
+  'tasks': instance.tasks,
   'log_dir': instance.logDir,
   'retry_attempts': instance.retryAttempts,
   'retry_wait': instance.retryWait,
diff --git a/packages/dataset_config_dart/lib/src/models/job.dart b/packages/dataset_config_dart/lib/src/models/job.dart
index 800f19c..0d8f49d 100644
--- a/packages/dataset_config_dart/lib/src/models/job.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.dart
@@ -1,4 +1,5 @@
 import 'package:freezed_annotation/freezed_annotation.dart';
+import 'tag_filter.dart';
 
 part 'job.freezed.dart';
 part 'job.g.dart';
@@ -45,6 +46,14 @@ sealed class Job with _$Job {
     // Core job settings
     // ------------------------------------------------------------------
 
+    /// Human-readable description of this job.
+    String? description,
+
+    /// Registry URL prefix prepended to image names during sandbox resolution.
+    ///
+    /// Example: `us-central1-docker.pkg.dev/project/repo/`
+    @JsonKey(name: 'image_prefix') String? imagePrefix,
+
     /// Directory to write evaluation logs to.
     @JsonKey(name: 'log_dir') required String logDir,
 
@@ -233,6 +242,16 @@ sealed class Job with _$Job {
     ///
     /// Per-task overrides (from `task.yaml`) take precedence.
     @JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults,
+
+    // ------------------------------------------------------------------
+    // Tag-based filtering
+    // ------------------------------------------------------------------
+
+    /// Tag filters applied to tasks.
+    @JsonKey(name: 'task_filters') TagFilter? taskFilters,
+
+    /// Tag filters applied to samples.
+    @JsonKey(name: 'sample_filters') TagFilter? sampleFilters,
   }) = _Job;
 
   factory Job.fromJson(Map<String, dynamic> json) => _$JobFromJson(json);
@@ -256,6 +275,9 @@ sealed class JobTask with _$JobTask {
 
     /// Override system message for this task.
     @JsonKey(name: 'system_message') String? systemMessage,
+
+    /// Per-task argument overrides passed to the task function.
+    @JsonKey(name: 'args') Map<String, dynamic>? args,
   }) = _JobTask;
 
   factory JobTask.fromJson(Map<String, dynamic> json) =>
@@ -274,6 +296,7 @@ sealed class JobTask with _$JobTask {
       includeSamples: (data['include-samples'] as List?)?.cast<String>(),
       excludeSamples: (data['exclude-samples'] as List?)?.cast<String>(),
       systemMessage: data['system_message'] as String?,
+      args: (data['args'] as Map?)?.cast<String, dynamic>(),
     );
   }
 }
diff --git a/packages/dataset_config_dart/lib/src/models/job.freezed.dart b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
index e249877..4b955bd 100644
--- a/packages/dataset_config_dart/lib/src/models/job.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
@@ -18,7 +18,11 @@ mixin _$Job {
 // ------------------------------------------------------------------
 // Core job settings
 // ------------------------------------------------------------------
-/// Directory to write evaluation logs to.
+/// Human-readable description of this job.
+ String? get description;/// Registry URL prefix prepended to image names during sandbox resolution.
+///
+/// Example: `us-central1-docker.pkg.dev/project/repo/`
+@JsonKey(name: 'image_prefix') String? get imagePrefix;/// Directory to write evaluation logs to.
 @JsonKey(name: 'log_dir') String get logDir;/// Sandbox type: `'local'`, `'docker'`, or `'podman'`.
 @JsonKey(name: 'sandbox_type') String get sandboxType;/// Maximum concurrent API connections.
 @JsonKey(name: 'max_connections') int get maxConnections;/// Models to run. `null` means use defaults from registries.
@@ -91,7 +95,12 @@ mixin _$Job {
 @JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? get evalSetOverrides;/// Default `Task` kwargs applied to every task in this job.
 ///
 /// Per-task overrides (from `task.yaml`) take precedence.
-@JsonKey(name: 'task_defaults') Map<String, dynamic>? get taskDefaults;
+@JsonKey(name: 'task_defaults') Map<String, dynamic>? get taskDefaults;// ------------------------------------------------------------------
+// Tag-based filtering
+// ------------------------------------------------------------------
+/// Tag filters applied to tasks.
+@JsonKey(name: 'task_filters') TagFilter? get taskFilters;/// Tag filters applied to samples.
+@JsonKey(name: 'sample_filters') TagFilter? get sampleFilters;
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
 @JsonKey(includeFromJson: false, includeToJson: false)
@@ -104,16 +113,16 @@ $JobCopyWith<Job> get copyWith => _$JobCopyWithImpl<Job>(this as Job, _$identity
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Job&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other.models, models)&&const DeepCollectionEquality().equals(other.variants, variants)&&const DeepCollectionEquality().equals(other.taskPaths, taskPaths)&&const DeepCollectionEquality().equals(other.tasks, tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other.tags, tags)&&const DeepCollectionEquality().equals(other.metadata, metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other.modelArgs, modelArgs)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.taskArgs, taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.modelCostConfig, modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other.evalSetOverrides, evalSetOverrides)&&const DeepCollectionEquality().equals(other.taskDefaults, taskDefaults));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Job&&(identical(other.description, description) || other.description == description)&&(identical(other.imagePrefix, imagePrefix) || other.imagePrefix == imagePrefix)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other.models, models)&&const DeepCollectionEquality().equals(other.variants, variants)&&const DeepCollectionEquality().equals(other.taskPaths, taskPaths)&&const DeepCollectionEquality().equals(other.tasks, tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other.tags, tags)&&const DeepCollectionEquality().equals(other.metadata, metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other.modelArgs, modelArgs)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.taskArgs, taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.modelCostConfig, modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other.evalSetOverrides, evalSetOverrides)&&const DeepCollectionEquality().equals(other.taskDefaults, taskDefaults)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(models),const DeepCollectionEquality().hash(variants),const DeepCollectionEquality().hash(taskPaths),const DeepCollectionEquality().hash(tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(tags),const DeepCollectionEquality().hash(metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(modelArgs),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(evalSetOverrides),const DeepCollectionEquality().hash(taskDefaults)]);
+int get hashCode => Object.hashAll([runtimeType,description,imagePrefix,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(models),const DeepCollectionEquality().hash(variants),const DeepCollectionEquality().hash(taskPaths),const DeepCollectionEquality().hash(tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(tags),const DeepCollectionEquality().hash(metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(modelArgs),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(evalSetOverrides),const DeepCollectionEquality().hash(taskDefaults),taskFilters,sampleFilters]);
 
 @override
 String toString() {
-  return 'Job(logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults)';
+  return 'Job(description: $description, imagePrefix: $imagePrefix, logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults, taskFilters: $taskFilters, sampleFilters: $sampleFilters)';
 }
 
 
@@ -124,11 +133,11 @@ abstract mixin class $JobCopyWith<$Res>  {
   factory $JobCopyWith(Job value, $Res Function(Job) _then) = _$JobCopyWithImpl;
 @useResult
 $Res call({
-@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles,@JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,@JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults
+ String? description,@JsonKey(name: 'image_prefix') String? imagePrefix,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles,@JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,@JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
 });
 
 
-
+$TagFilterCopyWith<$Res>? get taskFilters;$TagFilterCopyWith<$Res>? get sampleFilters;
 
 }
 /// @nodoc
@@ -141,9 +150,11 @@ class _$JobCopyWithImpl<$Res>
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? description = freezed,Object? imagePrefix = freezed,Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
   return _then(_self.copyWith(
-logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
+description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable
+as String?,imagePrefix: freezed == imagePrefix ? _self.imagePrefix : imagePrefix // ignore: cast_nullable_to_non_nullable
+as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
 as String,sandboxType: null == sandboxType ? _self.sandboxType : sandboxType // ignore: cast_nullable_to_non_nullable
 as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable
 as int,models: freezed == models ? _self.models : models // ignore: cast_nullable_to_non_nullable
@@ -194,10 +205,36 @@ as bool?,logDirAllowDirty: freezed == logDirAllowDirty ? _self.logDirAllowDirty
 as bool?,evalSetId: freezed == evalSetId ? _self.evalSetId : evalSetId // ignore: cast_nullable_to_non_nullable
 as String?,evalSetOverrides: freezed == evalSetOverrides ? _self.evalSetOverrides : evalSetOverrides // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,taskDefaults: freezed == taskDefaults ? _self.taskDefaults : taskDefaults // ignore: cast_nullable_to_non_nullable
-as Map<String, dynamic>?,
+as Map<String, dynamic>?,taskFilters: freezed == taskFilters ? _self.taskFilters : taskFilters // ignore: cast_nullable_to_non_nullable
+as TagFilter?,sampleFilters: freezed == sampleFilters ? _self.sampleFilters : sampleFilters // ignore: cast_nullable_to_non_nullable
+as TagFilter?,
   ));
 }
-
+/// Create a copy of Job
+/// with the given fields replaced by the non-null parameter values.
+@override
+@pragma('vm:prefer-inline')
+$TagFilterCopyWith<$Res>? get taskFilters {
+    if (_self.taskFilters == null) {
+    return null;
+  }
+
+  return $TagFilterCopyWith<$Res>(_self.taskFilters!, (value) {
+    return _then(_self.copyWith(taskFilters: value));
+  });
+}/// Create a copy of Job
+/// with the given fields replaced by the non-null parameter values.
+@override
+@pragma('vm:prefer-inline')
+$TagFilterCopyWith<$Res>? get sampleFilters {
+    if (_self.sampleFilters == null) {
+    return null;
+  }
+
+  return $TagFilterCopyWith<$Res>(_self.sampleFilters!, (value) {
+    return _then(_self.copyWith(sampleFilters: value));
+  });
+}
 }
 
 
@@ -276,10 +313,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function(@JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'image_prefix')  String? imagePrefix, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Job() when $default != null:
-return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults);case _:
+return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults,_that.taskFilters,_that.sampleFilters);case _:
   return orElse();
 
 }
@@ -297,10 +334,10 @@ return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function(@JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'image_prefix')  String? imagePrefix, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)  $default,) {final _that = this;
 switch (_that) {
 case _Job():
-return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults);}
+return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults,_that.taskFilters,_that.sampleFilters);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -314,10 +351,10 @@ return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function(@JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String? description, @JsonKey(name: 'image_prefix')  String? imagePrefix, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,) {final _that = this;
 switch (_that) {
 case _Job() when $default != null:
-return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults);case _:
+return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults,_that.taskFilters,_that.sampleFilters);case _:
   return null;
 
 }
@@ -329,12 +366,18 @@ return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models
 @JsonSerializable()
 
 class _Job implements Job {
-  const _Job({@JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'sandbox_type') this.sandboxType = 'local', @JsonKey(name: 'max_connections') this.maxConnections = 10, final  List<String>? models, final  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths') final  List<String>? taskPaths, final  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, @JsonKey(name: 'retry_attempts') this.retryAttempts, @JsonKey(name: 'max_retries') this.maxRetries, @JsonKey(name: 'retry_wait') this.retryWait, @JsonKey(name: 'retry_connections') this.retryConnections, @JsonKey(name: 'retry_cleanup') this.retryCleanup, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'retry_on_error') this.retryOnError, @JsonKey(name: 'debug_errors') this.debugErrors, @JsonKey(name: 'max_samples') this.maxSamples, @JsonKey(name: 'max_tasks') this.maxTasks, @JsonKey(name: 'max_subprocesses') this.maxSubprocesses, @JsonKey(name: 'max_sandboxes') this.maxSandboxes, @JsonKey(name: 'log_level') this.logLevel, @JsonKey(name: 'log_level_transcript') this.logLevelTranscript, @JsonKey(name: 'log_format') this.logFormat, final  List<String>? tags, final  Map<String, dynamic>? metadata, this.trace, this.display, this.score, this.limit, @JsonKey(name: 'sample_id') this.sampleId, @JsonKey(name: 'sample_shuffle') this.sampleShuffle, this.epochs, this.approval, this.solver, @JsonKey(name: 'sandbox_cleanup') this.sandboxCleanup, @JsonKey(name: 'model_base_url') this.modelBaseUrl, @JsonKey(name: 'model_args') final  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, @JsonKey(name: 'task_args') final  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'model_cost_config') final  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples') this.logSamples, @JsonKey(name: 'log_realtime') this.logRealtime, @JsonKey(name: 'log_images') this.logImages, @JsonKey(name: 'log_buffer') this.logBuffer, @JsonKey(name: 'log_shared') this.logShared, @JsonKey(name: 'bundle_dir') this.bundleDir, @JsonKey(name: 'bundle_overwrite') this.bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') this.logDirAllowDirty, @JsonKey(name: 'eval_set_id') this.evalSetId, @JsonKey(name: 'eval_set_overrides') final  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults') final  Map<String, dynamic>? taskDefaults}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_tags = tags,_metadata = metadata,_modelArgs = modelArgs,_modelRoles = modelRoles,_taskArgs = taskArgs,_modelCostConfig = modelCostConfig,_evalSetOverrides = evalSetOverrides,_taskDefaults = taskDefaults;
+  const _Job({this.description, @JsonKey(name: 'image_prefix') this.imagePrefix, @JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'sandbox_type') this.sandboxType = 'local', @JsonKey(name: 'max_connections') this.maxConnections = 10, final  List<String>? models, final  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths') final  List<String>? taskPaths, final  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, @JsonKey(name: 'retry_attempts') this.retryAttempts, @JsonKey(name: 'max_retries') this.maxRetries, @JsonKey(name: 'retry_wait') this.retryWait, @JsonKey(name: 'retry_connections') this.retryConnections, @JsonKey(name: 'retry_cleanup') this.retryCleanup, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'retry_on_error') this.retryOnError, @JsonKey(name: 'debug_errors') this.debugErrors, @JsonKey(name: 'max_samples') this.maxSamples, @JsonKey(name: 'max_tasks') this.maxTasks, @JsonKey(name: 'max_subprocesses') this.maxSubprocesses, @JsonKey(name: 'max_sandboxes') this.maxSandboxes, @JsonKey(name: 'log_level') this.logLevel, @JsonKey(name: 'log_level_transcript') this.logLevelTranscript, @JsonKey(name: 'log_format') this.logFormat, final  List<String>? tags, final  Map<String, dynamic>? metadata, this.trace, this.display, this.score, this.limit, @JsonKey(name: 'sample_id') this.sampleId, @JsonKey(name: 'sample_shuffle') this.sampleShuffle, this.epochs, this.approval, this.solver, @JsonKey(name: 'sandbox_cleanup') this.sandboxCleanup, @JsonKey(name: 'model_base_url') this.modelBaseUrl, @JsonKey(name: 'model_args') final  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, @JsonKey(name: 'task_args') final  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'model_cost_config') final  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples') this.logSamples, @JsonKey(name: 'log_realtime') this.logRealtime, @JsonKey(name: 'log_images') this.logImages, @JsonKey(name: 'log_buffer') this.logBuffer, @JsonKey(name: 'log_shared') this.logShared, @JsonKey(name: 'bundle_dir') this.bundleDir, @JsonKey(name: 'bundle_overwrite') this.bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') this.logDirAllowDirty, @JsonKey(name: 'eval_set_id') this.evalSetId, @JsonKey(name: 'eval_set_overrides') final  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults') final  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters') this.taskFilters, @JsonKey(name: 'sample_filters') this.sampleFilters}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_tags = tags,_metadata = metadata,_modelArgs = modelArgs,_modelRoles = modelRoles,_taskArgs = taskArgs,_modelCostConfig = modelCostConfig,_evalSetOverrides = evalSetOverrides,_taskDefaults = taskDefaults;
   factory _Job.fromJson(Map<String, dynamic> json) => _$JobFromJson(json);
 
 // ------------------------------------------------------------------
 // Core job settings
 // ------------------------------------------------------------------
+/// Human-readable description of this job.
+@override final  String? description;
+/// Registry URL prefix prepended to image names during sandbox resolution.
+///
+/// Example: `us-central1-docker.pkg.dev/project/repo/`
+@override@JsonKey(name: 'image_prefix') final  String? imagePrefix;
 /// Directory to write evaluation logs to.
 @override@JsonKey(name: 'log_dir') final  String logDir;
 /// Sandbox type: `'local'`, `'docker'`, or `'podman'`.
@@ -583,6 +626,13 @@ class _Job implements Job {
   return EqualUnmodifiableMapView(value);
 }
 
+// ------------------------------------------------------------------
+// Tag-based filtering
+// ------------------------------------------------------------------
+/// Tag filters applied to tasks.
+@override@JsonKey(name: 'task_filters') final  TagFilter? taskFilters;
+/// Tag filters applied to samples.
+@override@JsonKey(name: 'sample_filters') final  TagFilter? sampleFilters;
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
@@ -597,16 +647,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Job&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other._models, _models)&&const DeepCollectionEquality().equals(other._variants, _variants)&&const DeepCollectionEquality().equals(other._taskPaths, _taskPaths)&&const DeepCollectionEquality().equals(other._tasks, _tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other._tags, _tags)&&const DeepCollectionEquality().equals(other._metadata, _metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other._modelArgs, _modelArgs)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other._taskArgs, _taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other._modelCostConfig, _modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other._evalSetOverrides, _evalSetOverrides)&&const DeepCollectionEquality().equals(other._taskDefaults, _taskDefaults));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Job&&(identical(other.description, description) || other.description == description)&&(identical(other.imagePrefix, imagePrefix) || other.imagePrefix == imagePrefix)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other._models, _models)&&const DeepCollectionEquality().equals(other._variants, _variants)&&const DeepCollectionEquality().equals(other._taskPaths, _taskPaths)&&const DeepCollectionEquality().equals(other._tasks, _tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other._tags, _tags)&&const DeepCollectionEquality().equals(other._metadata, _metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other._modelArgs, _modelArgs)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other._taskArgs, _taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other._modelCostConfig, _modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other._evalSetOverrides, _evalSetOverrides)&&const DeepCollectionEquality().equals(other._taskDefaults, _taskDefaults)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(_models),const DeepCollectionEquality().hash(_variants),const DeepCollectionEquality().hash(_taskPaths),const DeepCollectionEquality().hash(_tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(_tags),const DeepCollectionEquality().hash(_metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(_modelArgs),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(_taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(_modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(_evalSetOverrides),const DeepCollectionEquality().hash(_taskDefaults)]);
+int get hashCode => Object.hashAll([runtimeType,description,imagePrefix,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(_models),const DeepCollectionEquality().hash(_variants),const DeepCollectionEquality().hash(_taskPaths),const DeepCollectionEquality().hash(_tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(_tags),const DeepCollectionEquality().hash(_metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(_modelArgs),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(_taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(_modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(_evalSetOverrides),const DeepCollectionEquality().hash(_taskDefaults),taskFilters,sampleFilters]);
 
 @override
 String toString() {
-  return 'Job(logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults)';
+  return 'Job(description: $description, imagePrefix: $imagePrefix, logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults, taskFilters: $taskFilters, sampleFilters: $sampleFilters)';
 }
 
 
@@ -617,11 +667,11 @@ abstract mixin class _$JobCopyWith<$Res> implements $JobCopyWith<$Res> {
   factory _$JobCopyWith(_Job value, $Res Function(_Job) _then) = __$JobCopyWithImpl;
 @override @useResult
 $Res call({
-@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles,@JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,@JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults
+ String? description,@JsonKey(name: 'image_prefix') String? imagePrefix,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles,@JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,@JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
 });
 
 
-
+@override $TagFilterCopyWith<$Res>? get taskFilters;@override $TagFilterCopyWith<$Res>? get sampleFilters;
 
 }
 /// @nodoc
@@ -634,9 +684,11 @@ class __$JobCopyWithImpl<$Res>
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? description = freezed,Object? imagePrefix = freezed,Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
   return _then(_Job(
-logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
+description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable
+as String?,imagePrefix: freezed == imagePrefix ? _self.imagePrefix : imagePrefix // ignore: cast_nullable_to_non_nullable
+as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
 as String,sandboxType: null == sandboxType ? _self.sandboxType : sandboxType // ignore: cast_nullable_to_non_nullable
 as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable
 as int,models: freezed == models ? _self._models : models // ignore: cast_nullable_to_non_nullable
@@ -687,11 +739,37 @@ as bool?,logDirAllowDirty: freezed == logDirAllowDirty ? _self.logDirAllowDirty
 as bool?,evalSetId: freezed == evalSetId ? _self.evalSetId : evalSetId // ignore: cast_nullable_to_non_nullable
 as String?,evalSetOverrides: freezed == evalSetOverrides ? _self._evalSetOverrides : evalSetOverrides // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,taskDefaults: freezed == taskDefaults ? _self._taskDefaults : taskDefaults // ignore: cast_nullable_to_non_nullable
-as Map<String, dynamic>?,
+as Map<String, dynamic>?,taskFilters: freezed == taskFilters ? _self.taskFilters : taskFilters // ignore: cast_nullable_to_non_nullable
+as TagFilter?,sampleFilters: freezed == sampleFilters ? _self.sampleFilters : sampleFilters // ignore: cast_nullable_to_non_nullable
+as TagFilter?,
   ));
 }
 
-
+/// Create a copy of Job
+/// with the given fields replaced by the non-null parameter values.
+@override
+@pragma('vm:prefer-inline')
+$TagFilterCopyWith<$Res>? get taskFilters {
+    if (_self.taskFilters == null) {
+    return null;
+  }
+
+  return $TagFilterCopyWith<$Res>(_self.taskFilters!, (value) {
+    return _then(_self.copyWith(taskFilters: value));
+  });
+}/// Create a copy of Job
+/// with the given fields replaced by the non-null parameter values.
+@override
+@pragma('vm:prefer-inline')
+$TagFilterCopyWith<$Res>? get sampleFilters {
+    if (_self.sampleFilters == null) {
+    return null;
+  }
+
+  return $TagFilterCopyWith<$Res>(_self.sampleFilters!, (value) {
+    return _then(_self.copyWith(sampleFilters: value));
+  });
+}
 }
 
 
@@ -702,7 +780,8 @@ mixin _$JobTask {
  String get id;/// Only run these sample IDs. Mutually exclusive with [excludeSamples].
 @JsonKey(name: 'include_samples') List<String>? get includeSamples;/// Exclude these sample IDs. Mutually exclusive with [includeSamples].
 @JsonKey(name: 'exclude_samples') List<String>? get excludeSamples;/// Override system message for this task.
-@JsonKey(name: 'system_message') String? get systemMessage;
+@JsonKey(name: 'system_message') String? get systemMessage;/// Per-task argument overrides passed to the task function.
+@JsonKey(name: 'args') Map<String, dynamic>? get args;
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
 @JsonKey(includeFromJson: false, includeToJson: false)
@@ -715,16 +794,16 @@ $JobTaskCopyWith<JobTask> get copyWith => _$JobTaskCopyWithImpl<JobTask>(this as
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other.args, args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),systemMessage);
+int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),systemMessage,const DeepCollectionEquality().hash(args));
 
 @override
 String toString() {
-  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage)';
+  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage, args: $args)';
 }
 
 
@@ -735,7 +814,7 @@ abstract mixin class $JobTaskCopyWith<$Res>  {
   factory $JobTaskCopyWith(JobTask value, $Res Function(JobTask) _then) = _$JobTaskCopyWithImpl;
 @useResult
 $Res call({
- String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage
+ String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'args') Map<String, dynamic>? args
 });
 
 
@@ -752,13 +831,14 @@ class _$JobTaskCopyWithImpl<$Res>
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,Object? args = freezed,}) {
   return _then(_self.copyWith(
 id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable
 as String,includeSamples: freezed == includeSamples ? _self.includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,excludeSamples: freezed == excludeSamples ? _self.excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable
-as String?,
+as String?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,
   ));
 }
 
@@ -840,10 +920,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _JobTask() when $default != null:
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage);case _:
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage,_that.args);case _:
   return orElse();
 
 }
@@ -861,10 +941,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'args')  Map<String, dynamic>? args)  $default,) {final _that = this;
 switch (_that) {
 case _JobTask():
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage);}
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage,_that.args);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -878,10 +958,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,) {final _that = this;
 switch (_that) {
 case _JobTask() when $default != null:
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage);case _:
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage,_that.args);case _:
   return null;
 
 }
@@ -893,7 +973,7 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM
 @JsonSerializable()
 
 class _JobTask implements JobTask {
-  const _JobTask({required this.id, @JsonKey(name: 'include_samples') final  List<String>? includeSamples, @JsonKey(name: 'exclude_samples') final  List<String>? excludeSamples, @JsonKey(name: 'system_message') this.systemMessage}): _includeSamples = includeSamples,_excludeSamples = excludeSamples;
+  const _JobTask({required this.id, @JsonKey(name: 'include_samples') final  List<String>? includeSamples, @JsonKey(name: 'exclude_samples') final  List<String>? excludeSamples, @JsonKey(name: 'system_message') this.systemMessage, @JsonKey(name: 'args') final  Map<String, dynamic>? args}): _includeSamples = includeSamples,_excludeSamples = excludeSamples,_args = args;
   factory _JobTask.fromJson(Map<String, dynamic> json) => _$JobTaskFromJson(json);
 
 /// Task identifier matching a task directory name in `tasks/`.
@@ -922,6 +1002,17 @@ class _JobTask implements JobTask {
 
 /// Override system message for this task.
 @override@JsonKey(name: 'system_message') final  String? systemMessage;
+/// Per-task argument overrides passed to the task function.
+ final  Map<String, dynamic>? _args;
+/// Per-task argument overrides passed to the task function.
+@override@JsonKey(name: 'args') Map<String, dynamic>? get args {
+  final value = _args;
+  if (value == null) return null;
+  if (_args is EqualUnmodifiableMapView) return _args;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableMapView(value);
+}
+
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
@@ -936,16 +1027,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other._args, _args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),systemMessage);
+int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),systemMessage,const DeepCollectionEquality().hash(_args));
 
 @override
 String toString() {
-  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage)';
+  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage, args: $args)';
 }
 
 
@@ -956,7 +1047,7 @@ abstract mixin class _$JobTaskCopyWith<$Res> implements $JobTaskCopyWith<$Res> {
   factory _$JobTaskCopyWith(_JobTask value, $Res Function(_JobTask) _then) = __$JobTaskCopyWithImpl;
 @override @useResult
 $Res call({
- String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage
+ String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'args') Map<String, dynamic>? args
 });
 
 
@@ -973,13 +1064,14 @@ class __$JobTaskCopyWithImpl<$Res>
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,Object? args = freezed,}) {
   return _then(_JobTask(
 id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable
 as String,includeSamples: freezed == includeSamples ? _self._includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,excludeSamples: freezed == excludeSamples ? _self._excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable
-as String?,
+as String?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,
   ));
 }
 
diff --git a/packages/dataset_config_dart/lib/src/models/job.g.dart b/packages/dataset_config_dart/lib/src/models/job.g.dart
index f62e5b3..a3abef1 100644
--- a/packages/dataset_config_dart/lib/src/models/job.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.g.dart
@@ -7,6 +7,8 @@ part of 'job.dart';
 // **************************************************************************
 
 _Job _$JobFromJson(Map<String, dynamic> json) => _Job(
+  description: json['description'] as String?,
+  imagePrefix: json['image_prefix'] as String?,
   logDir: json['log_dir'] as String,
   sandboxType: json['sandbox_type'] as String? ?? 'local',
   maxConnections: (json['max_connections'] as num?)?.toInt() ?? 10,
@@ -72,16 +74,24 @@ _Job _$JobFromJson(Map<String, dynamic> json) => _Job(
   evalSetId: json['eval_set_id'] as String?,
   evalSetOverrides: json['eval_set_overrides'] as Map<String, dynamic>?,
   taskDefaults: json['task_defaults'] as Map<String, dynamic>?,
+  taskFilters: json['task_filters'] == null
+      ? null
+      : TagFilter.fromJson(json['task_filters'] as Map<String, dynamic>),
+  sampleFilters: json['sample_filters'] == null
+      ? null
+      : TagFilter.fromJson(json['sample_filters'] as Map<String, dynamic>),
 );
 
 Map<String, dynamic> _$JobToJson(_Job instance) => <String, dynamic>{
+  'description': instance.description,
+  'image_prefix': instance.imagePrefix,
   'log_dir': instance.logDir,
   'sandbox_type': instance.sandboxType,
   'max_connections': instance.maxConnections,
   'models': instance.models,
   'variants': instance.variants,
   'task_paths': instance.taskPaths,
-  'tasks': instance.tasks?.map((k, e) => MapEntry(k, e.toJson())),
+  'tasks': instance.tasks,
   'save_examples': instance.saveExamples,
   'retry_attempts': instance.retryAttempts,
   'max_retries': instance.maxRetries,
@@ -132,6 +142,8 @@ Map<String, dynamic> _$JobToJson(_Job instance) => <String, dynamic>{
   'eval_set_id': instance.evalSetId,
   'eval_set_overrides': instance.evalSetOverrides,
   'task_defaults': instance.taskDefaults,
+  'task_filters': instance.taskFilters,
+  'sample_filters': instance.sampleFilters,
 };
 
 _JobTask _$JobTaskFromJson(Map<String, dynamic> json) => _JobTask(
@@ -143,6 +155,7 @@ _JobTask _$JobTaskFromJson(Map<String, dynamic> json) => _JobTask(
       ?.map((e) => e as String)
       .toList(),
   systemMessage: json['system_message'] as String?,
+  args: json['args'] as Map<String, dynamic>?,
 );
 
 Map<String, dynamic> _$JobTaskToJson(_JobTask instance) => <String, dynamic>{
@@ -150,4 +163,5 @@ Map<String, dynamic> _$JobTaskToJson(_JobTask instance) => <String, dynamic>{
   'include_samples': instance.includeSamples,
   'exclude_samples': instance.excludeSamples,
   'system_message': instance.systemMessage,
+  'args': instance.args,
 };
diff --git a/packages/dataset_config_dart/lib/src/models/models.dart b/packages/dataset_config_dart/lib/src/models/models.dart
index 5b590fb..4fba25c 100644
--- a/packages/dataset_config_dart/lib/src/models/models.dart
+++ b/packages/dataset_config_dart/lib/src/models/models.dart
@@ -1,6 +1,7 @@
 // Config models (eval runner input configuration)
 export 'context_file.dart';
 export 'job.dart';
+export 'tag_filter.dart';
 export 'variant.dart';
 
 // Inspect AI models (mirrors the Python Inspect AI API types)
diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.dart
new file mode 100644
index 0000000..f5a4ec1
--- /dev/null
+++ b/packages/dataset_config_dart/lib/src/models/tag_filter.dart
@@ -0,0 +1,33 @@
+import 'package:freezed_annotation/freezed_annotation.dart';
+
+part 'tag_filter.freezed.dart';
+part 'tag_filter.g.dart';
+
+/// Tag-based filter for including/excluding items by their tags.
+@freezed
+sealed class TagFilter with _$TagFilter {
+  const factory TagFilter({
+    @JsonKey(name: 'include_tags') List<String>? includeTags,
+    @JsonKey(name: 'exclude_tags') List<String>? excludeTags,
+  }) = _TagFilter;
+
+  factory TagFilter.fromJson(Map<String, dynamic> json) =>
+      _$TagFilterFromJson(json);
+}
+
+/// Check whether a set of [itemTags] matches the given [filter].
+///
+/// Returns `true` if:
+/// - All include_tags (if any) are present in [itemTags]
+/// - No exclude_tags (if any) are present in [itemTags]
+bool matchesTagFilter(List<String> itemTags, TagFilter filter) {
+  if (filter.includeTags != null &&
+      !filter.includeTags!.every((t) => itemTags.contains(t))) {
+    return false;
+  }
+  if (filter.excludeTags != null &&
+      filter.excludeTags!.any((t) => itemTags.contains(t))) {
+    return false;
+  }
+  return true;
+}
diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart
new file mode 100644
index 0000000..5df78eb
--- /dev/null
+++ b/packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart
@@ -0,0 +1,290 @@
+// GENERATED CODE - DO NOT MODIFY BY HAND
+// coverage:ignore-file
+// ignore_for_file: type=lint
+// ignore_for_file: unused_element, deprecated_member_use, deprecated_member_use_from_same_package, use_function_type_syntax_for_parameters, unnecessary_const, avoid_init_to_null, invalid_override_different_default_values_named, prefer_expression_function_bodies, annotate_overrides, invalid_annotation_target, unnecessary_question_mark
+
+part of 'tag_filter.dart';
+
+// **************************************************************************
+// FreezedGenerator
+// **************************************************************************
+
+// dart format off
+T _$identity<T>(T value) => value;
+
+/// @nodoc
+mixin _$TagFilter {
+
+@JsonKey(name: 'include_tags') List<String>? get includeTags;@JsonKey(name: 'exclude_tags') List<String>? get excludeTags;
+/// Create a copy of TagFilter
+/// with the given fields replaced by the non-null parameter values.
+@JsonKey(includeFromJson: false, includeToJson: false)
+@pragma('vm:prefer-inline')
+$TagFilterCopyWith<TagFilter> get copyWith => _$TagFilterCopyWithImpl<TagFilter>(this as TagFilter, _$identity);
+
+  /// Serializes this TagFilter to a JSON map.
+  Map<String, dynamic> toJson();
+
+
+@override
+bool operator ==(Object other) {
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is TagFilter&&const DeepCollectionEquality().equals(other.includeTags, includeTags)&&const DeepCollectionEquality().equals(other.excludeTags, excludeTags));
+}
+
+@JsonKey(includeFromJson: false, includeToJson: false)
+@override
+int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(includeTags),const DeepCollectionEquality().hash(excludeTags));
+
+@override
+String toString() {
+  return 'TagFilter(includeTags: $includeTags, excludeTags: $excludeTags)';
+}
+
+
+}
+
+/// @nodoc
+abstract mixin class $TagFilterCopyWith<$Res>  {
+  factory $TagFilterCopyWith(TagFilter value, $Res Function(TagFilter) _then) = _$TagFilterCopyWithImpl;
+@useResult
+$Res call({
+@JsonKey(name: 'include_tags') List<String>? includeTags,@JsonKey(name: 'exclude_tags') List<String>? excludeTags
+});
+
+
+
+
+}
+/// @nodoc
+class _$TagFilterCopyWithImpl<$Res>
+    implements $TagFilterCopyWith<$Res> {
+  _$TagFilterCopyWithImpl(this._self, this._then);
+
+  final TagFilter _self;
+  final $Res Function(TagFilter) _then;
+
+/// Create a copy of TagFilter
+/// with the given fields replaced by the non-null parameter values.
+@pragma('vm:prefer-inline') @override $Res call({Object? includeTags = freezed,Object? excludeTags = freezed,}) {
+  return _then(_self.copyWith(
+includeTags: freezed == includeTags ? _self.includeTags : includeTags // ignore: cast_nullable_to_non_nullable
+as List<String>?,excludeTags: freezed == excludeTags ? _self.excludeTags : excludeTags // ignore: cast_nullable_to_non_nullable
+as List<String>?,
+  ));
+}
+
+}
+
+
+/// Adds pattern-matching-related methods to [TagFilter].
+extension TagFilterPatterns on TagFilter {
+/// A variant of `map` that fallback to returning `orElse`.
+///
+/// It is equivalent to doing:
+/// ```dart
+/// switch (sealedClass) {
+///   case final Subclass value:
+///     return ...;
+///   case _:
+///     return orElse();
+/// }
+/// ```
+
+@optionalTypeArgs TResult maybeMap<TResult extends Object?>(TResult Function( _TagFilter value)?  $default,{required TResult orElse(),}){
+final _that = this;
+switch (_that) {
+case _TagFilter() when $default != null:
+return $default(_that);case _:
+  return orElse();
+
+}
+}
+/// A `switch`-like method, using callbacks.
+///
+/// Callbacks receives the raw object, upcasted.
+/// It is equivalent to doing:
+/// ```dart
+/// switch (sealedClass) {
+///   case final Subclass value:
+///     return ...;
+///   case final Subclass2 value:
+///     return ...;
+/// }
+/// ```
+
+@optionalTypeArgs TResult map<TResult extends Object?>(TResult Function( _TagFilter value)  $default,){
+final _that = this;
+switch (_that) {
+case _TagFilter():
+return $default(_that);}
+}
+/// A variant of `map` that fallback to returning `null`.
+///
+/// It is equivalent to doing:
+/// ```dart
+/// switch (sealedClass) {
+///   case final Subclass value:
+///     return ...;
+///   case _:
+///     return null;
+/// }
+/// ```
+
+@optionalTypeArgs TResult? mapOrNull<TResult extends Object?>(TResult? Function( _TagFilter value)?  $default,){
+final _that = this;
+switch (_that) {
+case _TagFilter() when $default != null:
+return $default(_that);case _:
+  return null;
+
+}
+}
+/// A variant of `when` that fallback to an `orElse` callback.
+///
+/// It is equivalent to doing:
+/// ```dart
+/// switch (sealedClass) {
+///   case Subclass(:final field):
+///     return ...;
+///   case _:
+///     return orElse();
+/// }
+/// ```
+
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function(@JsonKey(name: 'include_tags')  List<String>? includeTags, @JsonKey(name: 'exclude_tags')  List<String>? excludeTags)?  $default,{required TResult orElse(),}) {final _that = this;
+switch (_that) {
+case _TagFilter() when $default != null:
+return $default(_that.includeTags,_that.excludeTags);case _:
+  return orElse();
+
+}
+}
+/// A `switch`-like method, using callbacks.
+///
+/// As opposed to `map`, this offers destructuring.
+/// It is equivalent to doing:
+/// ```dart
+/// switch (sealedClass) {
+///   case Subclass(:final field):
+///     return ...;
+///   case Subclass2(:final field2):
+///     return ...;
+/// }
+/// ```
+
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function(@JsonKey(name: 'include_tags')  List<String>? includeTags, @JsonKey(name: 'exclude_tags')  List<String>? excludeTags)  $default,) {final _that = this;
+switch (_that) {
+case _TagFilter():
+return $default(_that.includeTags,_that.excludeTags);}
+}
+/// A variant of `when` that fallback to returning `null`
+///
+/// It is equivalent to doing:
+/// ```dart
+/// switch (sealedClass) {
+///   case Subclass(:final field):
+///     return ...;
+///   case _:
+///     return null;
+/// }
+/// ```
+
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function(@JsonKey(name: 'include_tags')  List<String>? includeTags, @JsonKey(name: 'exclude_tags')  List<String>? excludeTags)?  $default,) {final _that = this;
+switch (_that) {
+case _TagFilter() when $default != null:
+return $default(_that.includeTags,_that.excludeTags);case _:
+  return null;
+
+}
+}
+
+}
+
+/// @nodoc
+@JsonSerializable()
+
+class _TagFilter implements TagFilter {
+  const _TagFilter({@JsonKey(name: 'include_tags') final  List<String>? includeTags, @JsonKey(name: 'exclude_tags') final  List<String>? excludeTags}): _includeTags = includeTags,_excludeTags = excludeTags;
+  factory _TagFilter.fromJson(Map<String, dynamic> json) => _$TagFilterFromJson(json);
+
+ final  List<String>? _includeTags;
+@override@JsonKey(name: 'include_tags') List<String>? get includeTags {
+  final value = _includeTags;
+  if (value == null) return null;
+  if (_includeTags is EqualUnmodifiableListView) return _includeTags;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableListView(value);
+}
+
+ final  List<String>? _excludeTags;
+@override@JsonKey(name: 'exclude_tags') List<String>? get excludeTags {
+  final value = _excludeTags;
+  if (value == null) return null;
+  if (_excludeTags is EqualUnmodifiableListView) return _excludeTags;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableListView(value);
+}
+
+
+/// Create a copy of TagFilter
+/// with the given fields replaced by the non-null parameter values.
+@override @JsonKey(includeFromJson: false, includeToJson: false)
+@pragma('vm:prefer-inline')
+_$TagFilterCopyWith<_TagFilter> get copyWith => __$TagFilterCopyWithImpl<_TagFilter>(this, _$identity);
+
+@override
+Map<String, dynamic> toJson() {
+  return _$TagFilterToJson(this, );
+}
+
+@override
+bool operator ==(Object other) {
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _TagFilter&&const DeepCollectionEquality().equals(other._includeTags, _includeTags)&&const DeepCollectionEquality().equals(other._excludeTags, _excludeTags));
+}
+
+@JsonKey(includeFromJson: false, includeToJson: false)
+@override
+int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(_includeTags),const DeepCollectionEquality().hash(_excludeTags));
+
+@override
+String toString() {
+  return 'TagFilter(includeTags: $includeTags, excludeTags: $excludeTags)';
+}
+
+
+}
+
+/// @nodoc
+abstract mixin class _$TagFilterCopyWith<$Res> implements $TagFilterCopyWith<$Res> {
+  factory _$TagFilterCopyWith(_TagFilter value, $Res Function(_TagFilter) _then) = __$TagFilterCopyWithImpl;
+@override @useResult
+$Res call({
+@JsonKey(name: 'include_tags') List<String>? includeTags,@JsonKey(name: 'exclude_tags') List<String>? excludeTags
+});
+
+
+
+
+}
+/// @nodoc
+class __$TagFilterCopyWithImpl<$Res>
+    implements _$TagFilterCopyWith<$Res> {
+  __$TagFilterCopyWithImpl(this._self, this._then);
+
+  final _TagFilter _self;
+  final $Res Function(_TagFilter) _then;
+
+/// Create a copy of TagFilter
+/// with the given fields replaced by the non-null parameter values.
+@override @pragma('vm:prefer-inline') $Res call({Object? includeTags = freezed,Object? excludeTags = freezed,}) {
+  return _then(_TagFilter(
+includeTags: freezed == includeTags ? _self._includeTags : includeTags // ignore: cast_nullable_to_non_nullable
+as List<String>?,excludeTags: freezed == excludeTags ? _self._excludeTags : excludeTags // ignore: cast_nullable_to_non_nullable
+as List<String>?,
+  ));
+}
+
+
+}
+
+// dart format on
diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.g.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.g.dart
new file mode 100644
index 0000000..db8553c
--- /dev/null
+++ b/packages/dataset_config_dart/lib/src/models/tag_filter.g.dart
@@ -0,0 +1,22 @@
+// GENERATED CODE - DO NOT MODIFY BY HAND
+
+part of 'tag_filter.dart';
+
+// **************************************************************************
+// JsonSerializableGenerator
+// **************************************************************************
+
+_TagFilter _$TagFilterFromJson(Map<String, dynamic> json) => _TagFilter(
+  includeTags: (json['include_tags'] as List<dynamic>?)
+      ?.map((e) => e as String)
+      .toList(),
+  excludeTags: (json['exclude_tags'] as List<dynamic>?)
+      ?.map((e) => e as String)
+      .toList(),
+);
+
+Map<String, dynamic> _$TagFilterToJson(_TagFilter instance) =>
+    <String, dynamic>{
+      'include_tags': instance.includeTags,
+      'exclude_tags': instance.excludeTags,
+    };
diff --git a/packages/dataset_config_dart/lib/src/models/task.dart b/packages/dataset_config_dart/lib/src/models/task.dart
index ccb568b..19e4f02 100644
--- a/packages/dataset_config_dart/lib/src/models/task.dart
+++ b/packages/dataset_config_dart/lib/src/models/task.dart
@@ -95,7 +95,13 @@ sealed class Task with _$Task {
     /// `@task` function (e.g. `"flutter_code_gen"` or
     /// `"dash_evals.runner.tasks.flutter_code_gen"`).
     /// When absent, the runner hydrates directly from JSON (Mode 2 — future).
-    @JsonKey(name: 'task_func') String? taskFunc,
+    @JsonKey(name: 'func') String? func,
+
+    /// System message override for this task.
+    @JsonKey(name: 'system_message') String? systemMessage,
+
+    /// Pass-through dict for sandbox plugin configuration.
+    @JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters,
 
     /// Task name.
     ///
@@ -113,14 +119,14 @@ sealed class Task with _$Task {
 }
 
 class TaskMetadata {
-  final String taskFunc;
+  final String func;
   final Map<String, Object?> additional;
 
-  TaskMetadata(this.taskFunc, this.additional);
+  TaskMetadata(this.func, this.additional);
 
   Map<String, dynamic> toJson() {
     return {
-      'taskFunc': taskFunc,
+      'func': func,
     };
   }
 }
diff --git a/packages/dataset_config_dart/lib/src/models/task.freezed.dart b/packages/dataset_config_dart/lib/src/models/task.freezed.dart
index 94a4a37..b38f7d9 100644
--- a/packages/dataset_config_dart/lib/src/models/task.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/task.freezed.dart
@@ -50,14 +50,15 @@ mixin _$Task {
 @JsonKey(name: 'early_stopping') Object? get earlyStopping;/// Task display name (e.g. for plotting).
 ///
 /// Defaults to the registered task name.
-@JsonKey(name: 'display_name') String? get displayName;
-/// Task function identifier for Mode 1 hydration.
+@JsonKey(name: 'display_name') String? get displayName;/// Task function identifier for Mode 1 hydration.
 ///
 /// When present, the Python runner uses this to look up a pre-built
 /// `@task` function (e.g. `"flutter_code_gen"` or
 /// `"dash_evals.runner.tasks.flutter_code_gen"`).
 /// When absent, the runner hydrates directly from JSON (Mode 2 — future).
-@JsonKey(name: 'task_func') String? get taskFunc;/// Task name.
+@JsonKey(name: 'func') String? get func;/// System message override for this task.
+@JsonKey(name: 'system_message') String? get systemMessage;/// Pass-through dict for sandbox plugin configuration.
+@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? get sandboxParameters;/// Task name.
 ///
 /// Automatically determined based on the registered name if not specified.
  String? get name;/// Version of task (to distinguish evolutions of the task spec).
@@ -75,16 +76,16 @@ $TaskCopyWith<Task> get copyWith => _$TaskCopyWithImpl<Task>(this as Task, _$ide
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.taskFunc, taskFunc) || other.taskFunc == taskFunc)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other.metadata, metadata));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other.sandboxParameters, sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other.metadata, metadata));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,taskFunc,name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(metadata)]);
+int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(metadata)]);
 
 @override
 String toString() {
-  return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, taskFunc: $taskFunc, name: $name, version: $version, metadata: $metadata)';
+  return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)';
 }
 
 
@@ -95,7 +96,7 @@ abstract mixin class $TaskCopyWith<$Res>  {
   factory $TaskCopyWith(Task value, $Res Function(Task) _then) = _$TaskCopyWithImpl;
 @useResult
 $Res call({
- Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map<String, dynamic>? metadata
+ Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata
 });
 
 
@@ -112,7 +113,7 @@ class _$TaskCopyWithImpl<$Res>
 
 /// Create a copy of Task
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? taskFunc = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
   return _then(_self.copyWith(
 dataset: freezed == dataset ? _self.dataset : dataset // ignore: cast_nullable_to_non_nullable
 as Dataset?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable
@@ -124,8 +125,10 @@ as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore:
 as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable
 as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable
 as double?,earlyStopping: freezed == earlyStopping ? _self.earlyStopping : earlyStopping ,displayName: freezed == displayName ? _self.displayName : displayName // ignore: cast_nullable_to_non_nullable
-as String?,taskFunc: freezed == taskFunc ? _self.taskFunc : taskFunc // ignore: cast_nullable_to_non_nullable
-as String?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
+as String?,func: freezed == func ? _self.func : func // ignore: cast_nullable_to_non_nullable
+as String?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable
+as String?,sandboxParameters: freezed == sandboxParameters ? _self.sandboxParameters : sandboxParameters // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
 as String?,version: null == version ? _self.version : version ,metadata: freezed == metadata ? _self.metadata : metadata // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,
   ));
@@ -221,10 +224,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'task_func')  String? taskFunc,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Task() when $default != null:
-return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.taskFunc,_that.name,_that.version,_that.metadata);case _:
+return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _:
   return orElse();
 
 }
@@ -242,10 +245,10 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'task_func')  String? taskFunc,  String? name,  Object version,  Map<String, dynamic>? metadata)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)  $default,) {final _that = this;
 switch (_that) {
 case _Task():
-return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.taskFunc,_that.name,_that.version,_that.metadata);}
+return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -259,10 +262,10 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'task_func')  String? taskFunc,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,) {final _that = this;
 switch (_that) {
 case _Task() when $default != null:
-return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.taskFunc,_that.name,_that.version,_that.metadata);case _:
+return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _:
   return null;
 
 }
@@ -274,7 +277,7 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score
 @JsonSerializable()
 
 class _Task implements Task {
-  const _Task({this.dataset, this.setup, this.solver, this.cleanup, this.scorer, this.metrics, this.model, this.config, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, this.sandbox, this.approval, this.epochs, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'early_stopping') this.earlyStopping, @JsonKey(name: 'display_name') this.displayName, @JsonKey(name: 'task_func') this.taskFunc, this.name, this.version = 0, final  Map<String, dynamic>? metadata}): _modelRoles = modelRoles,_metadata = metadata;
+  const _Task({this.dataset, this.setup, this.solver, this.cleanup, this.scorer, this.metrics, this.model, this.config, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, this.sandbox, this.approval, this.epochs, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'early_stopping') this.earlyStopping, @JsonKey(name: 'display_name') this.displayName, @JsonKey(name: 'func') this.func, @JsonKey(name: 'system_message') this.systemMessage, @JsonKey(name: 'sandbox_parameters') final  Map<String, dynamic>? sandboxParameters, this.name, this.version = 0, final  Map<String, dynamic>? metadata}): _modelRoles = modelRoles,_sandboxParameters = sandboxParameters,_metadata = metadata;
   factory _Task.fromJson(Map<String, dynamic> json) => _$TaskFromJson(json);
 
 /// Dataset to evaluate.
@@ -348,7 +351,20 @@ class _Task implements Task {
 /// `@task` function (e.g. `"flutter_code_gen"` or
 /// `"dash_evals.runner.tasks.flutter_code_gen"`).
 /// When absent, the runner hydrates directly from JSON (Mode 2 — future).
-@override@JsonKey(name: 'task_func') final  String? taskFunc;
+@override@JsonKey(name: 'func') final  String? func;
+/// System message override for this task.
+@override@JsonKey(name: 'system_message') final  String? systemMessage;
+/// Pass-through dict for sandbox plugin configuration.
+ final  Map<String, dynamic>? _sandboxParameters;
+/// Pass-through dict for sandbox plugin configuration.
+@override@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? get sandboxParameters {
+  final value = _sandboxParameters;
+  if (value == null) return null;
+  if (_sandboxParameters is EqualUnmodifiableMapView) return _sandboxParameters;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableMapView(value);
+}
+
 /// Task name.
 ///
 /// Automatically determined based on the registered name if not specified.
@@ -380,16 +396,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.taskFunc, taskFunc) || other.taskFunc == taskFunc)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other._metadata, _metadata));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other._sandboxParameters, _sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other._metadata, _metadata));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,taskFunc,name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(_metadata)]);
+int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(_sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(_metadata)]);
 
 @override
 String toString() {
-  return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, taskFunc: $taskFunc, name: $name, version: $version, metadata: $metadata)';
+  return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)';
 }
 
 
@@ -400,7 +416,7 @@ abstract mixin class _$TaskCopyWith<$Res> implements $TaskCopyWith<$Res> {
   factory _$TaskCopyWith(_Task value, $Res Function(_Task) _then) = __$TaskCopyWithImpl;
 @override @useResult
 $Res call({
- Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map<String, dynamic>? metadata
+ Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata
 });
 
 
@@ -417,7 +433,7 @@ class __$TaskCopyWithImpl<$Res>
 
 /// Create a copy of Task
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? taskFunc = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
   return _then(_Task(
 dataset: freezed == dataset ? _self.dataset : dataset // ignore: cast_nullable_to_non_nullable
 as Dataset?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable
@@ -429,8 +445,10 @@ as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore:
 as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable
 as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable
 as double?,earlyStopping: freezed == earlyStopping ? _self.earlyStopping : earlyStopping ,displayName: freezed == displayName ? _self.displayName : displayName // ignore: cast_nullable_to_non_nullable
-as String?,taskFunc: freezed == taskFunc ? _self.taskFunc : taskFunc // ignore: cast_nullable_to_non_nullable
-as String?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
+as String?,func: freezed == func ? _self.func : func // ignore: cast_nullable_to_non_nullable
+as String?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable
+as String?,sandboxParameters: freezed == sandboxParameters ? _self._sandboxParameters : sandboxParameters // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
 as String?,version: null == version ? _self.version : version ,metadata: freezed == metadata ? _self._metadata : metadata // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,
   ));
diff --git a/packages/dataset_config_dart/lib/src/models/task.g.dart b/packages/dataset_config_dart/lib/src/models/task.g.dart
index 9906b3a..7752223 100644
--- a/packages/dataset_config_dart/lib/src/models/task.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/task.g.dart
@@ -32,14 +32,16 @@ _Task _$TaskFromJson(Map<String, dynamic> json) => _Task(
   costLimit: (json['cost_limit'] as num?)?.toDouble(),
   earlyStopping: json['early_stopping'],
   displayName: json['display_name'] as String?,
-  taskFunc: json['task_func'] as String?,
+  func: json['func'] as String?,
+  systemMessage: json['system_message'] as String?,
+  sandboxParameters: json['sandbox_parameters'] as Map<String, dynamic>?,
   name: json['name'] as String?,
   version: json['version'] as Object? ?? 0,
   metadata: json['metadata'] as Map<String, dynamic>?,
 );
 
 Map<String, dynamic> _$TaskToJson(_Task instance) => <String, dynamic>{
-  'dataset': instance.dataset?.toJson(),
+  'dataset': instance.dataset,
   'setup': instance.setup,
   'solver': instance.solver,
   'cleanup': instance.cleanup,
@@ -60,7 +62,9 @@ Map<String, dynamic> _$TaskToJson(_Task instance) => <String, dynamic>{
   'cost_limit': instance.costLimit,
   'early_stopping': instance.earlyStopping,
   'display_name': instance.displayName,
-  'task_func': instance.taskFunc,
+  'func': instance.func,
+  'system_message': instance.systemMessage,
+  'sandbox_parameters': instance.sandboxParameters,
   'name': instance.name,
   'version': instance.version,
   'metadata': instance.metadata,
diff --git a/packages/dataset_config_dart/lib/src/models/variant.g.dart b/packages/dataset_config_dart/lib/src/models/variant.g.dart
index 3ed7ff4..a9a6d25 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.g.dart
@@ -28,7 +28,7 @@ _Variant _$VariantFromJson(Map<String, dynamic> json) => _Variant(
 
 Map<String, dynamic> _$VariantToJson(_Variant instance) => <String, dynamic>{
   'name': instance.name,
-  'context_files': instance.contextFiles.map((e) => e.toJson()).toList(),
+  'context_files': instance.contextFiles,
   'mcp_servers': instance.mcpServers,
   'skill_paths': instance.skillPaths,
   'flutter_channel': instance.flutterChannel,
diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart
index 21ce5e3..ef74d5e 100644
--- a/packages/dataset_config_dart/lib/src/parsed_task.dart
+++ b/packages/dataset_config_dart/lib/src/parsed_task.dart
@@ -13,7 +13,7 @@ const kDefaultSystemMessage =
 /// former `TaskConfig` model-package class.
 class ParsedTask {
   final String id;
-  final String taskFunc;
+  final String func;
   final List<Sample> samples;
   final Variant variant;
   final String sandboxType;
@@ -22,6 +22,9 @@ class ParsedTask {
   final bool saveExamples;
   final String? examplesDir;
 
+  /// Tag filter for variant selection.
+  final TagFilter? variantFilters;
+
   // ------------------------------------------------------------------
   // Task-level settings (from task.yaml)
   // ------------------------------------------------------------------
@@ -79,7 +82,7 @@ class ParsedTask {
 
   const ParsedTask({
     required this.id,
-    required this.taskFunc,
+    required this.func,
     required this.samples,
     required this.variant,
     this.sandboxType = 'local',
@@ -87,6 +90,7 @@ class ParsedTask {
     this.allowedVariants,
     this.saveExamples = false,
     this.examplesDir,
+    this.variantFilters,
     // Task-level settings
     this.model,
     this.config,
@@ -110,7 +114,7 @@ class ParsedTask {
   /// Create a copy with overrides.
   ParsedTask copyWith({
     String? id,
-    String? taskFunc,
+    String? func,
     List<Sample>? samples,
     Variant? variant,
     String? sandboxType,
@@ -118,6 +122,7 @@ class ParsedTask {
     List<String>? allowedVariants,
     bool? saveExamples,
     String? examplesDir,
+    TagFilter? variantFilters,
     String? model,
     Map<String, dynamic>? config,
     Map<String, String>? modelRoles,
@@ -138,7 +143,7 @@ class ParsedTask {
   }) {
     return ParsedTask(
       id: id ?? this.id,
-      taskFunc: taskFunc ?? this.taskFunc,
+      func: func ?? this.func,
       samples: samples ?? this.samples,
       variant: variant ?? this.variant,
       sandboxType: sandboxType ?? this.sandboxType,
@@ -146,6 +151,7 @@ class ParsedTask {
       allowedVariants: allowedVariants ?? this.allowedVariants,
       saveExamples: saveExamples ?? this.saveExamples,
       examplesDir: examplesDir ?? this.examplesDir,
+      variantFilters: variantFilters ?? this.variantFilters,
       model: model ?? this.model,
       config: config ?? this.config,
       modelRoles: modelRoles ?? this.modelRoles,
diff --git a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
index 89d9668..3175ffd 100644
--- a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
@@ -21,7 +21,7 @@ class JsonParser extends Parser {
   List<ParsedTask> parseTasksFromMaps(List<Map<String, dynamic>> taskMaps) {
     return taskMaps.map((data) {
       final taskId = data['id'] as String;
-      final taskFunc = (data['func'] as String?) ?? taskId;
+      final func = (data['func'] as String?) ?? taskId;
       final systemMessage = data['system_message'] as String?;
       final allowedVariants = (data['allowed_variants'] as List?)
           ?.cast<String>();
@@ -113,7 +113,7 @@ class JsonParser extends Parser {
 
       return ParsedTask(
         id: taskId,
-        taskFunc: taskFunc,
+        func: func,
         variant: const Variant(),
         samples: samples,
         systemMessage: systemMessage,
diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index 3ea236c..edd4b03 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -50,7 +50,7 @@ class YamlParser extends Parser {
     final taskDir = p.dirname(taskPath);
 
     final taskId = (data['id'] as String?) ?? p.basename(taskDir);
-    final taskFunc = (data['func'] as String?) ?? taskId;
+    final func = (data['func'] as String?) ?? taskId;
 
     final taskWorkspaceRaw = data['workspace'];
     final taskTestsRaw = data['tests'];
@@ -102,7 +102,7 @@ class YamlParser extends Parser {
     return [
       ParsedTask(
         id: taskId,
-        taskFunc: taskFunc,
+        func: func,
         variant: const Variant(), // placeholder baseline
         samples: samples,
         systemMessage: systemMessage,
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index d308d68..fe4861f 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -239,7 +239,7 @@ class EvalSetResolver {
       inspectTasks.add(
         Task(
           name: '${tc.id}:${tc.variant.name}',
-          taskFunc: tc.taskFunc,
+          func: tc.func,
           dataset: dataset,
           sandbox: taskSandbox,
           metadata: metadata,
diff --git a/packages/dataset_config_dart/pubspec.yaml b/packages/dataset_config_dart/pubspec.yaml
index cc76a7a..61a386b 100644
--- a/packages/dataset_config_dart/pubspec.yaml
+++ b/packages/dataset_config_dart/pubspec.yaml
@@ -15,5 +15,8 @@ dependencies:
   yaml: ^3.1.0
 
 dev_dependencies:
+  build_runner: ^2.12.2
+  freezed: ^3.2.5
+  json_serializable: ^6.13.0
   lints: ^6.0.0
   test: any
diff --git a/packages/dataset_config_dart/test/eval_set_resolver_test.dart b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
index d982b58..de32e88 100644
--- a/packages/dataset_config_dart/test/eval_set_resolver_test.dart
+++ b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
@@ -7,7 +7,7 @@ void main() {
   /// Helper to create a minimal [ParsedTask] for testing.
   ParsedTask makeTask({
     String id = 'test_task',
-    String taskFunc = 'question_answer',
+    String func = 'question_answer',
     List<Sample>? samples,
     Variant? variant,
     List<String>? allowedVariants,
@@ -18,7 +18,7 @@ void main() {
   }) {
     return ParsedTask(
       id: id,
-      taskFunc: taskFunc,
+      func: func,
       samples:
           samples ??
           [
@@ -278,14 +278,14 @@ void main() {
       expect(taskNames.first, contains('included'));
     });
 
-    test('taskFunc is propagated to output Task', () {
+    test('func is propagated to output Task', () {
       final results = resolver.resolve(
-        [makeTask(taskFunc: 'flutter_code_gen')],
+        [makeTask(func: 'flutter_code_gen')],
         makeJob(models: ['m']),
         '/tmp/dataset',
       );
 
-      expect(results.first.tasks.first.taskFunc, 'flutter_code_gen');
+      expect(results.first.tasks.first.func, 'flutter_code_gen');
     });
 
     test('system_message appears in task metadata', () {
diff --git a/packages/dataset_config_dart/test/eval_set_writer_test.dart b/packages/dataset_config_dart/test/eval_set_writer_test.dart
index 2ef58e5..ef377e6 100644
--- a/packages/dataset_config_dart/test/eval_set_writer_test.dart
+++ b/packages/dataset_config_dart/test/eval_set_writer_test.dart
@@ -25,7 +25,7 @@ void main() {
         taskCount,
         (i) => Task(
           name: 'task_$i:baseline',
-          taskFunc: 'func_$i',
+          func: 'func_$i',
           dataset: Dataset(
             samples: [
               Sample(id: 's$i', input: 'input $i', target: 'target $i'),
diff --git a/packages/dataset_config_dart/test/json_parser_test.dart b/packages/dataset_config_dart/test/json_parser_test.dart
index f09520c..3763af6 100644
--- a/packages/dataset_config_dart/test/json_parser_test.dart
+++ b/packages/dataset_config_dart/test/json_parser_test.dart
@@ -24,7 +24,7 @@ void main() {
 
       expect(tasks, hasLength(1));
       expect(tasks.first.id, 'my_task');
-      expect(tasks.first.taskFunc, 'question_answer');
+      expect(tasks.first.func, 'question_answer');
       expect(tasks.first.samples, hasLength(1));
       expect(tasks.first.samples.first.id, 's1');
       expect(tasks.first.samples.first.input, 'What is Dart?');
@@ -39,7 +39,7 @@ void main() {
         },
       ]);
 
-      expect(tasks.first.taskFunc, 'dart_qa');
+      expect(tasks.first.func, 'dart_qa');
     });
 
     test('throws FormatException when sample missing required field', () {
diff --git a/packages/dataset_config_dart/test/parsed_task_test.dart b/packages/dataset_config_dart/test/parsed_task_test.dart
index 4921e30..b6fb7c5 100644
--- a/packages/dataset_config_dart/test/parsed_task_test.dart
+++ b/packages/dataset_config_dart/test/parsed_task_test.dart
@@ -6,7 +6,7 @@ void main() {
     test('has correct defaults', () {
       const task = ParsedTask(
         id: 'test',
-        taskFunc: 'question_answer',
+        func: 'question_answer',
         samples: [],
         variant: Variant(),
       );
@@ -27,7 +27,7 @@ void main() {
     test('stores all constructor fields', () {
       const task = ParsedTask(
         id: 'my_task',
-        taskFunc: 'flutter_code_gen',
+        func: 'flutter_code_gen',
         samples: [Sample(id: 's1', input: 'q', target: 'a')],
         variant: Variant(name: 'full'),
         sandboxType: 'podman',
@@ -49,7 +49,7 @@ void main() {
       );
 
       expect(task.id, 'my_task');
-      expect(task.taskFunc, 'flutter_code_gen');
+      expect(task.func, 'flutter_code_gen');
       expect(task.samples, hasLength(1));
       expect(task.variant.name, 'full');
       expect(task.sandboxType, 'podman');
@@ -75,7 +75,7 @@ void main() {
     test('overrides specified fields', () {
       const original = ParsedTask(
         id: 'original',
-        taskFunc: 'func_a',
+        func: 'func_a',
         samples: [],
         variant: Variant(name: 'baseline'),
         timeLimit: 100,
@@ -93,7 +93,7 @@ void main() {
     test('preserves fields not overridden', () {
       const original = ParsedTask(
         id: 'task',
-        taskFunc: 'func',
+        func: 'func',
         samples: [],
         variant: Variant(name: 'full'),
         sandboxType: 'podman',
@@ -103,7 +103,7 @@ void main() {
 
       final copy = original.copyWith(id: 'new_id');
 
-      expect(copy.taskFunc, 'func');
+      expect(copy.func, 'func');
       expect(copy.variant.name, 'full');
       expect(copy.sandboxType, 'podman');
       expect(copy.systemMessage, 'Be helpful');
@@ -113,7 +113,7 @@ void main() {
     test('returns a new instance (not the same object)', () {
       const original = ParsedTask(
         id: 'a',
-        taskFunc: 'f',
+        func: 'f',
         samples: [],
         variant: Variant(),
       );
@@ -128,7 +128,7 @@ void main() {
     test('can override samples list', () {
       const original = ParsedTask(
         id: 'task',
-        taskFunc: 'func',
+        func: 'func',
         samples: [Sample(id: 's1', input: 'q', target: 'a')],
         variant: Variant(),
       );
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/__init__.py b/packages/dataset_config_python/src/dataset_config_python/models/__init__.py
index a90aaad..f42caca 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/__init__.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/__init__.py
@@ -5,6 +5,7 @@
 from dataset_config_python.models.eval_set import EvalSet
 from dataset_config_python.models.job import Job, JobTask
 from dataset_config_python.models.sample import Sample
+from dataset_config_python.models.tag_filter import TagFilter, matches_tag_filter
 from dataset_config_python.models.task import Task
 from dataset_config_python.models.variant import Variant
 
@@ -16,6 +17,8 @@
     "Job",
     "JobTask",
     "Sample",
+    "TagFilter",
     "Task",
     "Variant",
+    "matches_tag_filter",
 ]
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/job.py b/packages/dataset_config_python/src/dataset_config_python/models/job.py
index 683e09f..b259ed1 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/job.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/job.py
@@ -6,6 +6,8 @@
 
 from pydantic import BaseModel
 
+from dataset_config_python.models.tag_filter import TagFilter
+
 
 class JobTask(BaseModel):
     """Per-task configuration within a job."""
@@ -22,6 +24,9 @@ class JobTask(BaseModel):
     system_message: str | None = None
     """Override system message for this task."""
 
+    args: dict[str, Any] | None = None
+    """Per-task argument overrides passed to the task function."""
+
     @staticmethod
     def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask:
         """Create from parsed YAML data."""
@@ -32,6 +37,7 @@ def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask:
             include_samples=data.get("include-samples"),
             exclude_samples=data.get("exclude-samples"),
             system_message=data.get("system_message"),
+            args=data.get("args"),
         )
 
 
@@ -39,6 +45,8 @@ class Job(BaseModel):
     """A job configuration defining what to run and how to run it."""
 
     # Core settings
+    description: str | None = None
+    image_prefix: str | None = None
     log_dir: str
     sandbox_type: str = "local"
     max_connections: int = 10
@@ -100,3 +108,7 @@ class Job(BaseModel):
     # Pass-through overrides
     eval_set_overrides: dict[str, Any] | None = None
     task_defaults: dict[str, Any] | None = None
+
+    # Tag-based filtering
+    task_filters: TagFilter | None = None
+    sample_filters: TagFilter | None = None
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py b/packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py
new file mode 100644
index 0000000..5d298e2
--- /dev/null
+++ b/packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py
@@ -0,0 +1,30 @@
+"""Tag-based filter for including/excluding items by their tags."""
+
+from __future__ import annotations
+
+from pydantic import BaseModel
+
+
+class TagFilter(BaseModel):
+    """Tag-based filter for including/excluding items."""
+
+    include_tags: list[str] | None = None
+    exclude_tags: list[str] | None = None
+
+
+def matches_tag_filter(item_tags: list[str], tag_filter: TagFilter) -> bool:
+    """Check whether a set of item_tags matches the given filter.
+
+    Returns True if:
+    - All include_tags (if any) are present in item_tags
+    - No exclude_tags (if any) are present in item_tags
+    """
+    if tag_filter.include_tags and not all(
+        t in item_tags for t in tag_filter.include_tags
+    ):
+        return False
+    if tag_filter.exclude_tags and any(
+        t in item_tags for t in tag_filter.exclude_tags
+    ):
+        return False
+    return True
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/task.py b/packages/dataset_config_python/src/dataset_config_python/models/task.py
index cafbbe3..5623ab3 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/task.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/task.py
@@ -19,9 +19,15 @@ class Task(BaseModel):
     name: str = ""
     """Task name (e.g. ``"dart_qa:baseline"``)."""
 
-    task_func: str | None = None
+    func: str | None = None
     """Task function identifier for hydration (e.g. ``"question_answer"``)."""
 
+    system_message: str | None = None
+    """System message override for this task."""
+
+    sandbox_parameters: dict[str, Any] | None = None
+    """Pass-through dict for sandbox plugin configuration."""
+
     dataset: Dataset | None = None
     """Inline dataset with samples."""
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index 0e9fc12..43a646e 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -29,7 +29,7 @@ def __init__(
         self,
         *,
         id: str,
-        task_func: str,
+        func: str,
         samples: list[Sample],
         variant: Variant | None = None,
         sandbox_type: str = "local",
@@ -57,7 +57,7 @@ def __init__(
         metadata: dict[str, Any] | None = None,
     ):
         self.id = id
-        self.task_func = task_func
+        self.func = func
         self.samples = samples
         self.variant = variant or Variant()
         self.sandbox_type = sandbox_type
@@ -89,7 +89,7 @@ def copy_with(
         self,
         *,
         id: str | None = _UNSET,
-        task_func: str | None = _UNSET,
+        func: str | None = _UNSET,
         samples: list[Sample] | None = _UNSET,
         variant: Variant | None = _UNSET,
         sandbox_type: str | None = _UNSET,
@@ -119,7 +119,7 @@ def copy_with(
         _U = ParsedTask._UNSET
         return ParsedTask(
             id=self.id if id is _U else id,  # type: ignore[arg-type]
-            task_func=self.task_func if task_func is _U else task_func,  # type: ignore[arg-type]
+            func=self.func if func is _U else func,  # type: ignore[arg-type]
             samples=self.samples if samples is _U else samples,  # type: ignore[arg-type]
             variant=self.variant if variant is _U else variant,
             sandbox_type=self.sandbox_type if sandbox_type is _U else sandbox_type,  # type: ignore[arg-type]
@@ -260,7 +260,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     return [
         ParsedTask(
             id=task_id,
-            task_func=task_func,
+            func=task_func,
             variant=Variant(),
             samples=samples,
             system_message=system_message,
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 0801b3c..9f5589a 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -213,7 +213,7 @@ def _build_eval_set(
         inspect_tasks.append(
             Task(
                 name=f"{tc.id}:{tc.variant.name}",
-                task_func=tc.task_func,
+                func=tc.func,
                 dataset=dataset,
                 sandbox=task_sandbox,
                 metadata=task_metadata,
diff --git a/packages/dataset_config_python/tests/test_config.py b/packages/dataset_config_python/tests/test_config.py
index b79f9a9..20890c3 100644
--- a/packages/dataset_config_python/tests/test_config.py
+++ b/packages/dataset_config_python/tests/test_config.py
@@ -178,7 +178,7 @@ def test_job_task_from_yaml_with_data(self):
 
     def test_eval_set_serialization(self):
         es = EvalSet(
-            tasks=[Task(name="test:baseline", task_func="qa")],
+            tasks=[Task(name="test:baseline", func="qa")],
             log_dir="/tmp/logs",
             model=["google/gemini-2.5-flash"],
         )
@@ -312,11 +312,11 @@ def test_write_single(self, dataset_dir, tmp_path):
 
     def test_write_multiple(self, tmp_path):
         es1 = EvalSet(
-            tasks=[Task(name="t1:baseline", task_func="qa")],
+            tasks=[Task(name="t1:baseline", func="qa")],
             log_dir="/tmp/logs1",
         )
         es2 = EvalSet(
-            tasks=[Task(name="t2:baseline", task_func="qa")],
+            tasks=[Task(name="t2:baseline", func="qa")],
             log_dir="/tmp/logs2",
         )
         output_dir = str(tmp_path / "output")
diff --git a/packages/devals_cli/lib/src/dataset/dry_run.dart b/packages/devals_cli/lib/src/dataset/dry_run.dart
index 891f700..856e172 100644
--- a/packages/devals_cli/lib/src/dataset/dry_run.dart
+++ b/packages/devals_cli/lib/src/dataset/dry_run.dart
@@ -32,9 +32,9 @@ bool _validateConfig(EvalSet config) {
   final taskSummaries = <String, int>{};
 
   for (final task in config.tasks) {
-    final name = task.name ?? task.taskFunc ?? '(unknown)';
+    final name = task.name ?? task.func ?? '(unknown)';
 
-    if (task.taskFunc == null) {
+    if (task.func == null) {
       warnings.add(
         'Task "$name" has no task_func — Mode 2 hydration required',
       );
diff --git a/tool/config_parity/pubspec.lock b/tool/config_parity/pubspec.lock
deleted file mode 100644
index dd2733b..0000000
--- a/tool/config_parity/pubspec.lock
+++ /dev/null
@@ -1,108 +0,0 @@
-# Generated by pub
-# See https://dart.dev/tools/pub/glossary#lockfile
-packages:
-  async:
-    dependency: transitive
-    description:
-      name: async
-      sha256: "758e6d74e971c3e5aceb4110bfd6698efc7f501675bcfe0c775459a8140750eb"
-      url: "https://pub.dev"
-    source: hosted
-    version: "2.13.0"
-  collection:
-    dependency: transitive
-    description:
-      name: collection
-      sha256: "2f5709ae4d3d59dd8f7cd309b4e023046b57d8a6c82130785d2b0e5868084e76"
-      url: "https://pub.dev"
-    source: hosted
-    version: "1.19.1"
-  dataset_config_dart:
-    dependency: "direct main"
-    description:
-      path: "../../packages/dataset_config_dart"
-      relative: true
-    source: path
-    version: "0.0.1"
-  file:
-    dependency: transitive
-    description:
-      name: file
-      sha256: a3b4f84adafef897088c160faf7dfffb7696046cb13ae90b508c2cbc95d3b8d4
-      url: "https://pub.dev"
-    source: hosted
-    version: "7.0.1"
-  freezed_annotation:
-    dependency: transitive
-    description:
-      name: freezed_annotation
-      sha256: "7294967ff0a6d98638e7acb774aac3af2550777accd8149c90af5b014e6d44d8"
-      url: "https://pub.dev"
-    source: hosted
-    version: "3.1.0"
-  glob:
-    dependency: transitive
-    description:
-      name: glob
-      sha256: c3f1ee72c96f8f78935e18aa8cecced9ab132419e8625dc187e1c2408efc20de
-      url: "https://pub.dev"
-    source: hosted
-    version: "2.1.3"
-  json_annotation:
-    dependency: transitive
-    description:
-      name: json_annotation
-      sha256: cb09e7dac6210041fad964ed7fbee004f14258b4eca4040f72d1234062ace4c8
-      url: "https://pub.dev"
-    source: hosted
-    version: "4.11.0"
-  meta:
-    dependency: transitive
-    description:
-      name: meta
-      sha256: "9f29b9bcc8ee287b1a31e0d01be0eae99a930dbffdaecf04b3f3d82a969f296f"
-      url: "https://pub.dev"
-    source: hosted
-    version: "1.18.1"
-  path:
-    dependency: "direct main"
-    description:
-      name: path
-      sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
-      url: "https://pub.dev"
-    source: hosted
-    version: "1.9.1"
-  source_span:
-    dependency: transitive
-    description:
-      name: source_span
-      sha256: "56a02f1f4cd1a2d96303c0144c93bd6d909eea6bee6bf5a0e0b685edbd4c47ab"
-      url: "https://pub.dev"
-    source: hosted
-    version: "1.10.2"
-  string_scanner:
-    dependency: transitive
-    description:
-      name: string_scanner
-      sha256: "921cd31725b72fe181906c6a94d987c78e3b98c2e205b397ea399d4054872b43"
-      url: "https://pub.dev"
-    source: hosted
-    version: "1.4.1"
-  term_glyph:
-    dependency: transitive
-    description:
-      name: term_glyph
-      sha256: "7f554798625ea768a7518313e58f83891c7f5024f88e46e7182a4558850a4b8e"
-      url: "https://pub.dev"
-    source: hosted
-    version: "1.2.2"
-  yaml:
-    dependency: transitive
-    description:
-      name: yaml
-      sha256: b9da305ac7c39faa3f030eccd175340f968459dae4af175130b3fc47e40d76ce
-      url: "https://pub.dev"
-    source: hosted
-    version: "3.1.3"
-sdks:
-  dart: ">=3.10.0 <4.0.0"

From 28fba88f9990ab2b245d74ee6354b0b6744a6f95 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 16:51:08 -0700
Subject: [PATCH 04/21] adds task level fields and updates parser

---
 .../lib/src/parsed_task.dart                   |  6 ++++++
 .../lib/src/parsers/yaml_parser.dart           | 16 ++++++++++++++++
 .../lib/src/resolvers/eval_set_resolver.dart   | 18 +++++++++++++++++-
 .../src/dataset_config_python/parser.py        | 10 ++++++++++
 .../src/dataset_config_python/resolver.py      | 18 +++++++++++++++++-
 5 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart
index ef74d5e..c5afaf7 100644
--- a/packages/dataset_config_dart/lib/src/parsed_task.dart
+++ b/packages/dataset_config_dart/lib/src/parsed_task.dart
@@ -25,6 +25,9 @@ class ParsedTask {
   /// Tag filter for variant selection.
   final TagFilter? variantFilters;
 
+  /// Pass-through dict for sandbox plugin configuration.
+  final Map<String, dynamic>? sandboxParameters;
+
   // ------------------------------------------------------------------
   // Task-level settings (from task.yaml)
   // ------------------------------------------------------------------
@@ -91,6 +94,7 @@ class ParsedTask {
     this.saveExamples = false,
     this.examplesDir,
     this.variantFilters,
+    this.sandboxParameters,
     // Task-level settings
     this.model,
     this.config,
@@ -123,6 +127,7 @@ class ParsedTask {
     bool? saveExamples,
     String? examplesDir,
     TagFilter? variantFilters,
+    Map<String, dynamic>? sandboxParameters,
     String? model,
     Map<String, dynamic>? config,
     Map<String, String>? modelRoles,
@@ -152,6 +157,7 @@ class ParsedTask {
       saveExamples: saveExamples ?? this.saveExamples,
       examplesDir: examplesDir ?? this.examplesDir,
       variantFilters: variantFilters ?? this.variantFilters,
+      sandboxParameters: sandboxParameters ?? this.sandboxParameters,
       model: model ?? this.model,
       config: config ?? this.config,
       modelRoles: modelRoles ?? this.modelRoles,
diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index edd4b03..e8f0b69 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -98,6 +98,7 @@ class YamlParser extends Parser {
     final displayName = data['display_name'] as String?;
     final version = data['version'];
     final taskMetadata = _asMap(data['metadata']);
+    final sandboxParameters = _asMap(data['sandbox_parameters']);
 
     return [
       ParsedTask(
@@ -125,6 +126,7 @@ class YamlParser extends Parser {
         displayName: displayName,
         version: version,
         metadata: taskMetadata,
+        sandboxParameters: sandboxParameters,
       ),
     ];
   }
@@ -370,14 +372,28 @@ class YamlParser extends Parser {
       }
     }
 
+    // Parse tag filters
+    final taskFiltersRaw = data['task_filters'];
+    final sampleFiltersRaw = data['sample_filters'];
+    final TagFilter? taskFilters = taskFiltersRaw is Map
+        ? TagFilter.fromJson(Map<String, dynamic>.from(taskFiltersRaw))
+        : null;
+    final TagFilter? sampleFilters = sampleFiltersRaw is Map
+        ? TagFilter.fromJson(Map<String, dynamic>.from(sampleFiltersRaw))
+        : null;
+
     return Job(
       logDir: logDir,
       sandboxType: sandboxType,
       maxConnections: maxConnections,
+      description: data['description'] as String?,
+      imagePrefix: data['image_prefix'] as String?,
       models: (data['models'] as List?)?.cast<String>(),
       variants: variants,
       taskPaths: taskPaths,
       tasks: tasks,
+      taskFilters: taskFilters,
+      sampleFilters: sampleFilters,
       saveExamples: data['save_examples'] == true,
       // Promoted eval_set() fields
       retryAttempts: data['retry_attempts'] as int?,
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index fe4861f..ba5dddf 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -243,6 +243,8 @@ class EvalSetResolver {
           dataset: dataset,
           sandbox: taskSandbox,
           metadata: metadata,
+          systemMessage: tc.systemMessage,
+          sandboxParameters: tc.sandboxParameters,
           model: resolvedModel,
           config: resolvedConfig,
           modelRoles: resolvedModelRoles,
@@ -427,9 +429,15 @@ class EvalSetResolver {
     for (final taskConfig in datasetTasks) {
       final taskId = taskConfig.id;
 
-      // Filter by job.tasks
+      // Filter by job.tasks (ID-based)
       if (job.tasks != null && !job.tasks!.containsKey(taskId)) continue;
 
+      // Filter by job.taskFilters (tag-based)
+      if (job.taskFilters != null) {
+        final taskTags = (taskConfig.metadata?['tags'] as List?)?.cast<String>() ?? [];
+        if (!matchesTagFilter(taskTags, job.taskFilters!)) continue;
+      }
+
       // Determine effective variants (intersection)
       final effectiveVariants = <String, Map<String, dynamic>>{};
       for (final entry in jobVariants.entries) {
@@ -459,6 +467,14 @@ class EvalSetResolver {
         }
       }
 
+      // Apply sample tag filtering (job-level)
+      if (job.sampleFilters != null) {
+        samples = samples.where((s) {
+          final sampleTags = (s.metadata?['tags'] as List?)?.cast<String>() ?? [];
+          return matchesTagFilter(sampleTags, job.sampleFilters!);
+        }).toList();
+      }
+
       // Apply system_message override
       var systemMessage = taskConfig.systemMessage;
       if (jobTask?.systemMessage != null) {
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index 43a646e..56dc89f 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -12,6 +12,7 @@
 
 from dataset_config_python.models.job import Job, JobTask
 from dataset_config_python.models.sample import Sample
+from dataset_config_python.models.tag_filter import TagFilter
 from dataset_config_python.models.variant import Variant
 
 # Default log directory (relative to dataset root).
@@ -55,6 +56,7 @@ def __init__(
         display_name: str | None = None,
         version: Any | None = None,
         metadata: dict[str, Any] | None = None,
+        sandbox_parameters: dict[str, Any] | None = None,
     ):
         self.id = id
         self.func = func
@@ -82,6 +84,7 @@ def __init__(
         self.display_name = display_name
         self.version = version
         self.metadata = metadata
+        self.sandbox_parameters = sandbox_parameters
 
     _UNSET: Any = object()
 
@@ -97,6 +100,7 @@ def copy_with(
         allowed_variants: list[str] | None = _UNSET,
         save_examples: bool | None = _UNSET,
         examples_dir: str | None = _UNSET,
+        sandbox_parameters: dict[str, Any] | None = _UNSET,
         model: str | None = _UNSET,
         config: dict[str, Any] | None = _UNSET,
         model_roles: dict[str, str] | None = _UNSET,
@@ -127,6 +131,7 @@ def copy_with(
             allowed_variants=self.allowed_variants if allowed_variants is _U else allowed_variants,
             save_examples=self.save_examples if save_examples is _U else save_examples,  # type: ignore[arg-type]
             examples_dir=self.examples_dir if examples_dir is _U else examples_dir,
+            sandbox_parameters=self.sandbox_parameters if sandbox_parameters is _U else sandbox_parameters,
             model=self.model if model is _U else model,
             config=self.config if config is _U else config,
             model_roles=self.model_roles if model_roles is _U else model_roles,
@@ -282,6 +287,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             display_name=data.get("display_name"),
             version=data.get("version"),
             metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None,
+            sandbox_parameters=data.get("sandbox_parameters") if isinstance(data.get("sandbox_parameters"), dict) else None,
         )
     ]
 
@@ -542,6 +548,10 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
         task_defaults=(
             data.get("task_defaults") if isinstance(data.get("task_defaults"), dict) else None
         ),
+        description=data.get("description"),
+        image_prefix=data.get("image_prefix"),
+        task_filters=TagFilter(**data["task_filters"]) if isinstance(data.get("task_filters"), dict) else None,
+        sample_filters=TagFilter(**data["sample_filters"]) if isinstance(data.get("sample_filters"), dict) else None,
     )
 
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 9f5589a..1415e2d 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -10,6 +10,7 @@
 from dataset_config_python.models.dataset import Dataset
 from dataset_config_python.models.eval_set import EvalSet
 from dataset_config_python.models.sample import Sample
+from dataset_config_python.models.tag_filter import matches_tag_filter
 from dataset_config_python.models.task import Task
 from dataset_config_python.models.variant import Variant
 from dataset_config_python.parser import ParsedTask, find_job_file, parse_job, parse_tasks
@@ -217,6 +218,8 @@ def _build_eval_set(
                 dataset=dataset,
                 sandbox=task_sandbox,
                 metadata=task_metadata,
+                system_message=tc.system_message,
+                sandbox_parameters=tc.sandbox_parameters,
                 model=tc.model or task_defaults.get("model"),
                 config=tc.config or task_defaults.get("config"),
                 model_roles=tc.model_roles or task_defaults.get("model_roles"),
@@ -372,10 +375,16 @@ def _expand_task_configs(
     for tc in dataset_tasks:
         task_id = tc.id
 
-        # Filter by job.tasks
+        # Filter by job.tasks (ID-based)
         if job.tasks is not None and task_id not in job.tasks:
             continue
 
+        # Filter by job.task_filters (tag-based)
+        if job.task_filters is not None:
+            task_tags = (tc.metadata or {}).get("tags", [])
+            if not matches_tag_filter(task_tags, job.task_filters):
+                continue
+
         # Determine effective variants (intersection)
         effective_variants: dict[str, dict[str, Any]] = {}
         for vname, vdef in job_variants.items():
@@ -393,6 +402,13 @@ def _expand_task_configs(
             if job_task.exclude_samples:
                 samples = [s for s in samples if s.id not in job_task.exclude_samples]
 
+        # Apply sample tag filtering (job-level)
+        if job.sample_filters is not None:
+            samples = [
+                s for s in samples
+                if matches_tag_filter((s.metadata or {}).get("tags", []), job.sample_filters)
+            ]
+
         # Apply system_message override
         system_message = tc.system_message
         if job_task and job_task.system_message is not None:

From fe24d910f9536313d043350250eff41f910b07e3 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 16:58:14 -0700
Subject: [PATCH 05/21] feat: allow configurable sandbox and SDK channel
 mappings in dataset resolvers and support colon syntax for task function
 resolution.

---
 .../src/dash_evals/runner/json_runner.py      | 16 ++++++++
 .../lib/src/resolvers/eval_set_resolver.dart  | 41 ++++++++++++++-----
 .../src/dataset_config_python/resolver.py     | 37 +++++++++++------
 3 files changed, 70 insertions(+), 24 deletions(-)

diff --git a/packages/dash_evals/src/dash_evals/runner/json_runner.py b/packages/dash_evals/src/dash_evals/runner/json_runner.py
index 7db7e89..89e4eee 100644
--- a/packages/dash_evals/src/dash_evals/runner/json_runner.py
+++ b/packages/dash_evals/src/dash_evals/runner/json_runner.py
@@ -27,6 +27,7 @@ def _resolve_task_func(name: str):
 
     Supports:
     - Short names: "flutter_code_gen" → dash_evals.runner.tasks.flutter_code_gen
+    - Colon syntax: "my_package.tasks:my_task" → import my_package.tasks, get my_task
     - Dotted paths: "dash_evals.runner.tasks.flutter_code_gen.flutter_code_gen"
 
     For short names, first tries to import a module with the same name.
@@ -36,6 +37,21 @@ def _resolve_task_func(name: str):
 
     Returns the callable task function.
     """
+    # Colon syntax: "module.path:function_name"
+    if ":" in name:
+        module_path, func_name = name.split(":", 1)
+        try:
+            module = importlib.import_module(module_path)
+        except ModuleNotFoundError:
+            raise ValueError(
+                f"Could not find module '{module_path}' for task function '{name}'. "
+                f"Check that the module exists and is importable."
+            )
+        func = getattr(module, func_name, None)
+        if func is None:
+            raise ValueError(f"Module '{module_path}' does not have a function '{func_name}'.")
+        return func
+
     if "." not in name:
         # Short name: try module with the same name first
         module_path = f"dash_evals.runner.tasks.{name}"
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index ba5dddf..871e5e9 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -21,8 +21,10 @@ const List<String> kDefaultModels = [
   'openai/gpt-5-pro',
 ];
 
-/// Available sandbox configurations.
-const Map<String, Map<String, String>> kSandboxRegistry = {
+/// Default sandbox configurations for Flutter evaluations.
+///
+/// Consumers can pass these to [EvalSetResolver] or provide their own.
+const Map<String, Map<String, String>> kDefaultSandboxRegistry = {
   'podman': {'name': 'podman', 'path': './sandboxes/podman/compose.yaml'},
   'podman-beta': {
     'name': 'podman',
@@ -34,8 +36,10 @@ const Map<String, Map<String, String>> kSandboxRegistry = {
   },
 };
 
-/// Maps Flutter SDK channel names to sandbox registry keys.
-const Map<String, String> kSdkChannels = {
+/// Default Flutter SDK channel → sandbox registry key mapping.
+///
+/// Consumers can pass these to [EvalSetResolver] or provide their own.
+const Map<String, String> kDefaultSdkChannels = {
   'stable': 'podman',
   'beta': 'podman-beta',
   'main': 'podman-main',
@@ -50,6 +54,22 @@ const Map<String, String> kSdkChannels = {
 /// 3. Groups by flutter_channel (one [EvalSet] per group)
 /// 4. Propagates job-level and task-level settings to the output
 class EvalSetResolver {
+  /// Creates a resolver with optional sandbox configuration.
+  ///
+  /// If [sandboxRegistry] or [sdkChannels] are not provided, they default
+  /// to empty maps (no sandbox resolution). Pass [kDefaultSandboxRegistry]
+  /// and [kDefaultSdkChannels] for the Flutter-specific sandbox setup.
+  const EvalSetResolver({
+    this.sandboxRegistry = const {},
+    this.sdkChannels = const {},
+  });
+
+  /// Named sandbox configurations (e.g. `'podman'` → compose file path).
+  final Map<String, Map<String, String>> sandboxRegistry;
+
+  /// SDK channel → sandbox registry key mapping.
+  final Map<String, String> sdkChannels;
+
   /// Resolve task configs and job into [EvalSet] objects.
   ///
   /// Groups by flutter_channel so each gets its own sandbox.
@@ -136,7 +156,6 @@ class EvalSetResolver {
 
         if (workspace != null && isContainer) {
           files = {...?files, '/workspace': workspace};
-          setup = setup ?? 'cd /workspace && flutter pub get';
           enriched['workspace'] = '/workspace';
         }
         if (workspaceGit != null) {
@@ -387,10 +406,10 @@ class EvalSetResolver {
     if (sandboxType.isEmpty || sandboxType == 'local') return 'local';
 
     // Channel override → look up channel-specific sandbox
-    if (flutterChannel != null && kSdkChannels.containsKey(flutterChannel)) {
-      final registryKey = kSdkChannels[flutterChannel]!;
-      if (kSandboxRegistry.containsKey(registryKey)) {
-        final def = kSandboxRegistry[registryKey]!;
+    if (flutterChannel != null && sdkChannels.containsKey(flutterChannel)) {
+      final registryKey = sdkChannels[flutterChannel]!;
+      if (sandboxRegistry.containsKey(registryKey)) {
+        final def = sandboxRegistry[registryKey]!;
         var sandboxPath = def['path']!;
         if (!p.isAbsolute(sandboxPath)) {
           sandboxPath = p.normalize(p.join(datasetRoot, sandboxPath));
@@ -400,8 +419,8 @@ class EvalSetResolver {
     }
 
     // Named sandbox from registry
-    if (kSandboxRegistry.containsKey(sandboxType)) {
-      final def = kSandboxRegistry[sandboxType]!;
+    if (sandboxRegistry.containsKey(sandboxType)) {
+      final def = sandboxRegistry[sandboxType]!;
       var sandboxPath = def['path']!;
       if (!p.isAbsolute(sandboxPath)) {
         sandboxPath = p.normalize(p.join(datasetRoot, sandboxPath));
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 1415e2d..6c9116c 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -29,15 +29,16 @@
     "openai/gpt-5-pro",
 ]
 
-# Available sandbox configurations.
-SANDBOX_REGISTRY: dict[str, dict[str, str]] = {
+# Default sandbox configurations for Flutter evaluations.
+# Consumers can pass these to resolve() or provide their own.
+DEFAULT_SANDBOX_REGISTRY: dict[str, dict[str, str]] = {
     "podman": {"name": "podman", "path": "./sandboxes/podman/compose.yaml"},
     "podman-beta": {"name": "podman", "path": "./sandboxes/podman/compose-beta.yaml"},
     "podman-main": {"name": "podman", "path": "./sandboxes/podman/compose-main.yaml"},
 }
 
-# Maps Flutter SDK channel names to sandbox registry keys.
-SDK_CHANNELS: dict[str, str] = {
+# Default Flutter SDK channel → sandbox registry key mapping.
+DEFAULT_SDK_CHANNELS: dict[str, str] = {
     "stable": "podman",
     "beta": "podman-beta",
     "main": "podman-main",
@@ -51,6 +52,9 @@ def _is_glob(pattern: str) -> bool:
 def resolve(
     dataset_path: str,
     job_names: list[str],
+    *,
+    sandbox_registry: dict[str, dict[str, str]] | None = None,
+    sdk_channels: dict[str, str] | None = None,
 ) -> list[EvalSet]:
     """Resolve dataset + job(s) into EvalSet objects.
 
@@ -59,17 +63,21 @@ def resolve(
     Args:
         dataset_path: Root directory containing ``tasks/`` and ``jobs/``.
         job_names: Job names (looked up in ``jobs/``) or paths.
+        sandbox_registry: Named sandbox configurations. Defaults to empty.
+        sdk_channels: SDK channel → sandbox registry key mapping. Defaults to empty.
 
     Returns:
         A list of EvalSet objects ready for JSON serialization.
     """
+    registry = sandbox_registry or {}
+    channels = sdk_channels or {}
     task_configs = parse_tasks(dataset_path)
     results: list[EvalSet] = []
 
     for job_name in job_names:
         job_path = find_job_file(dataset_path, job_name)
         job = parse_job(job_path, dataset_path)
-        results.extend(_resolve_job(task_configs, job, dataset_path))
+        results.extend(_resolve_job(task_configs, job, dataset_path, registry, channels))
 
     return results
 
@@ -78,6 +86,8 @@ def _resolve_job(
     dataset_tasks: list[ParsedTask],
     job: Any,
     dataset_root: str,
+    sandbox_registry: dict[str, dict[str, str]],
+    sdk_channels: dict[str, str],
 ) -> list[EvalSet]:
     """Resolve task configs and job into EvalSet objects."""
     models = job.models if job.models else list(DEFAULT_MODELS)
@@ -96,7 +106,7 @@ def _resolve_job(
             task_configs=group,
             log_dir=job.log_dir,
             models=models,
-            sandbox=_resolve_sandbox(dataset_root, job, flutter_channel=channel),
+            sandbox=_resolve_sandbox(dataset_root, job, sandbox_registry, sdk_channels, flutter_channel=channel),
             job=job,
         )
         for channel, group in groups.items()
@@ -142,7 +152,6 @@ def _build_eval_set(
 
             if workspace is not None and is_container:
                 files = {**(files or {}), "/workspace": workspace}
-                setup = setup or "cd /workspace && flutter pub get"
                 enriched["workspace"] = "/workspace"
             if workspace_git is not None:
                 enriched["workspace_git"] = workspace_git
@@ -328,6 +337,8 @@ def _resolve_models(job: Any) -> list[str]:
 def _resolve_sandbox(
     dataset_root: str,
     job: Any,
+    sandbox_registry: dict[str, dict[str, str]],
+    sdk_channels: dict[str, str],
     *,
     flutter_channel: str | None = None,
 ) -> Any:
@@ -337,18 +348,18 @@ def _resolve_sandbox(
         return "local"
 
     # Channel override
-    if flutter_channel and flutter_channel in SDK_CHANNELS:
-        registry_key = SDK_CHANNELS[flutter_channel]
-        if registry_key in SANDBOX_REGISTRY:
-            defn = SANDBOX_REGISTRY[registry_key]
+    if flutter_channel and flutter_channel in sdk_channels:
+        registry_key = sdk_channels[flutter_channel]
+        if registry_key in sandbox_registry:
+            defn = sandbox_registry[registry_key]
             sandbox_path = defn["path"]
             if not os.path.isabs(sandbox_path):
                 sandbox_path = os.path.normpath(os.path.join(dataset_root, sandbox_path))
             return {"type": defn["name"], "path": sandbox_path}
 
     # Named sandbox from registry
-    if sandbox_type in SANDBOX_REGISTRY:
-        defn = SANDBOX_REGISTRY[sandbox_type]
+    if sandbox_type in sandbox_registry:
+        defn = sandbox_registry[sandbox_type]
         sandbox_path = defn["path"]
         if not os.path.isabs(sandbox_path):
             sandbox_path = os.path.normpath(os.path.join(dataset_root, sandbox_path))

From 2eb11048b9512b82fb14e553e3811ae2fa1b17aa Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 17:05:16 -0700
Subject: [PATCH 06/21] feat: Introduce tag-based filtering, refined task
 function references, and expanded sandbox configuration options in
 documentation and API.

---
 docs/guides/config.md                         | 48 +++++++++-
 .../dataset_config_dart.md                    | 87 ++++++++++++++++---
 2 files changed, 124 insertions(+), 11 deletions(-)

diff --git a/docs/guides/config.md b/docs/guides/config.md
index aef6aba..f3889b9 100644
--- a/docs/guides/config.md
+++ b/docs/guides/config.md
@@ -39,5 +39,51 @@ In evals, the definition of dataset is expanded to include all fixtures of runni
 This means you care about job files and task files. Job files might look like this:
 
 - job/main.yaml (runs the whole thing)
-- job/ci.yaml (a job that runs as part of ci)
+- job/ci.yaml (a job that is run as part of ci)
 - job/local_dev.yaml (a job that is .gitignored, used for quick iteration)
+
+## Tag-based filtering
+
+Jobs can filter which tasks and samples run using tags. Tasks and samples define tags in their `metadata`, and jobs reference them via `task_filters` and `sample_filters`:
+
+```yaml
+# job.yaml
+task_filters:
+  include_tags: [code_gen]      # only tasks tagged "code_gen"
+  exclude_tags: [deprecated]    # skip deprecated tasks
+sample_filters:
+  include_tags: [flutter]       # only samples tagged "flutter"
+```
+
+- **`include_tags`** — an item must have *all* listed tags to be included
+- **`exclude_tags`** — an item is excluded if it has *any* listed tag
+
+Tag filters work alongside ID-based filtering (`tasks.<id>.include-samples` / `exclude-samples`).
+
+## Task function references
+
+The `func` field in task YAML identifies the Python `@task` function to run. Three formats are supported:
+
+| Format | Example | Resolution |
+|---|---|---|
+| Short name | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` |
+| Colon syntax | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` |
+| Dotted path | `my_package.tasks.my_task.my_task` | Last segment is the function name |
+
+## Sandbox configuration
+
+The sandbox registry is **configurable** — the resolver accepts a registry mapping names to compose files. The default registry is empty; the `devals_cli` passes the Flutter-specific registry:
+
+```yaml
+# job.yaml
+sandbox_type: podman          # looks up "podman" in the registry
+image_prefix: us-central1-docker.pkg.dev/my-project/repo/
+```
+
+The `image_prefix` is prepended to image names during sandbox resolution (useful for private registries).
+
+## Workspace setup
+
+When `workspace` is specified on a sample and the sandbox is a container (`docker` or `podman`), the resolver maps it to `Sample.files['/workspace']`. The setup command (e.g. `cd /workspace && flutter pub get`) is **not** auto-generated — specify it explicitly in your sample or task YAML via the `setup` field.
+
+For the full field reference, see {doc}`/reference/yaml_config`.
diff --git a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
index 460a6a4..7d3afaf 100644
--- a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
+++ b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
@@ -776,9 +776,25 @@ This is the resolution engine. It:
 #### `EvalSetResolver`
 
 ```dart
-EvalSetResolver()
+EvalSetResolver({Map<String, Map<String, String>> sandboxRegistry, Map<String, String> sdkChannels})
 ```
 
+Creates a resolver with optional sandbox configuration.
+
+If [sandboxRegistry] or [sdkChannels] are not provided, they default
+to empty maps (no sandbox resolution). Pass [kDefaultSandboxRegistry]
+and [kDefaultSdkChannels] for the Flutter-specific sandbox setup.
+
+### Properties
+
+- **`sandboxRegistry`** → `Map<String, Map<String, String>>` *(final)*
+
+  Named sandbox configurations (e.g. `'podman'` → compose file path).
+
+- **`sdkChannels`** → `Map<String, String>` *(final)*
+
+  SDK channel → sandbox registry key mapping.
+
 ### Methods
 
 #### `resolve`
@@ -1000,7 +1016,7 @@ task_defaults:
 #### `Job`
 
 ```dart
-Job({required String logDir, String sandboxType, int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, int? retryAttempts, int? maxRetries, double? retryWait, double? retryConnections, bool? retryCleanup, double? failOnError, bool? continueOnFail, int? retryOnError, bool? debugErrors, int? maxSamples, int? maxTasks, int? maxSubprocesses, int? maxSandboxes, String? logLevel, String? logLevelTranscript, String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit, Object? sampleId, Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, bool? sandboxCleanup, String? modelBaseUrl, Map<String, Object?>? modelArgs, Map<String, String>? modelRoles, Map<String, Object?>? taskArgs, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Map<String, Object?>? modelCostConfig, bool? logSamples, bool? logRealtime, bool? logImages, int? logBuffer, int? logShared, String? bundleDir, bool? bundleOverwrite, bool? logDirAllowDirty, String? evalSetId, Map<String, dynamic>? evalSetOverrides, Map<String, dynamic>? taskDefaults})
+Job({String? description, String? imagePrefix, required String logDir, String sandboxType, int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, int? retryAttempts, int? maxRetries, double? retryWait, double? retryConnections, bool? retryCleanup, double? failOnError, bool? continueOnFail, int? retryOnError, bool? debugErrors, int? maxSamples, int? maxTasks, int? maxSubprocesses, int? maxSandboxes, String? logLevel, String? logLevelTranscript, String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit, Object? sampleId, Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, bool? sandboxCleanup, String? modelBaseUrl, Map<String, Object?>? modelArgs, Map<String, String>? modelRoles, Map<String, Object?>? taskArgs, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Map<String, Object?>? modelCostConfig, bool? logSamples, bool? logRealtime, bool? logImages, int? logBuffer, int? logShared, String? bundleDir, bool? bundleOverwrite, bool? logDirAllowDirty, String? evalSetId, Map<String, dynamic>? evalSetOverrides, Map<String, dynamic>? taskDefaults, TagFilter? taskFilters, TagFilter? sampleFilters})
 ```
 
 #### `Job.fromJson`
@@ -1025,7 +1041,7 @@ a custom system message.
 #### `JobTask`
 
 ```dart
-JobTask({required String id, List<String>? includeSamples, List<String>? excludeSamples, String? systemMessage})
+JobTask({required String id, List<String>? includeSamples, List<String>? excludeSamples, String? systemMessage, Map<String, dynamic>? args})
 ```
 
 #### `JobTask.fromJson`
@@ -1200,14 +1216,14 @@ former `TaskConfig` model-package class.
 #### `ParsedTask`
 
 ```dart
-ParsedTask({required String id, required String taskFunc, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, List<String>? allowedVariants, bool saveExamples, String? examplesDir, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
+ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, List<String>? allowedVariants, bool saveExamples, String? examplesDir, TagFilter? variantFilters, Map<String, dynamic>? sandboxParameters, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
 ```
 
 ### Properties
 
 - **`id`** → `String` *(final)*
 
-- **`taskFunc`** → `String` *(final)*
+- **`func`** → `String` *(final)*
 
 - **`samples`** → `List<Sample>` *(final)*
 
@@ -1223,6 +1239,14 @@ ParsedTask({required String id, required String taskFunc, required List<Sample>
 
 - **`examplesDir`** → `String?` *(final)*
 
+- **`variantFilters`** → `TagFilter?` *(final)*
+
+  Tag filter for variant selection.
+
+- **`sandboxParameters`** → `Map<String, dynamic>?` *(final)*
+
+  Pass-through dict for sandbox plugin configuration.
+
 - **`model`** → `String?` *(final)*
 
   Default model for this task.
@@ -1296,7 +1320,7 @@ ParsedTask({required String id, required String taskFunc, required List<Sample>
 #### `copyWith`
 
 ```dart
-ParsedTask copyWith({String? id, String? taskFunc, List<Sample>? samples, Variant? variant, String? sandboxType, String? systemMessage, List<String>? allowedVariants, bool? saveExamples, String? examplesDir, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
+ParsedTask copyWith({String? id, String? func, List<Sample>? samples, Variant? variant, String? sandboxType, String? systemMessage, List<String>? allowedVariants, bool? saveExamples, String? examplesDir, TagFilter? variantFilters, Map<String, dynamic>? sandboxParameters, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
 ```
 
 Create a copy with overrides.
@@ -1304,7 +1328,7 @@ Create a copy with overrides.
 **Parameters:**
 
 - `id` (`String?`)
-- `taskFunc` (`String?`)
+- `func` (`String?`)
 - `samples` (`List<Sample>?`)
 - `variant` (`Variant?`)
 - `sandboxType` (`String?`)
@@ -1312,6 +1336,8 @@ Create a copy with overrides.
 - `allowedVariants` (`List<String>?`)
 - `saveExamples` (`bool?`)
 - `examplesDir` (`String?`)
+- `variantFilters` (`TagFilter?`)
+- `sandboxParameters` (`Map<String, dynamic>?`)
 - `model` (`String?`)
 - `config` (`Map<String, dynamic>?`)
 - `modelRoles` (`Map<String, String>?`)
@@ -1460,6 +1486,28 @@ Score.fromJson(Map<String, dynamic> json)
 
 ---
 
+## abstract class `TagFilter`
+
+**Mixins:** `_$TagFilter`
+
+Tag-based filter for including/excluding items by their tags.
+
+### Constructors
+
+#### `TagFilter`
+
+```dart
+TagFilter({List<String>? includeTags, List<String>? excludeTags})
+```
+
+#### `TagFilter.fromJson`
+
+```dart
+TagFilter.fromJson(Map<String, dynamic> json)
+```
+
+---
+
 ## abstract class `Task`
 
 **Mixins:** `_$Task`
@@ -1475,7 +1523,7 @@ constructor.
 #### `Task`
 
 ```dart
-Task({Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, String? taskFunc, String? name, Object version, Map<String, dynamic>? metadata})
+Task({Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, String? func, String? systemMessage, Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata})
 ```
 
 #### `Task.fromJson`
@@ -1519,12 +1567,12 @@ TaskInfo.fromJson(Map<String, dynamic> json)
 #### `TaskMetadata`
 
 ```dart
-TaskMetadata(String taskFunc, Map<String, Object?> additional)
+TaskMetadata(String func, Map<String, Object?> additional)
 ```
 
 ### Properties
 
-- **`taskFunc`** → `String` *(final)*
+- **`func`** → `String` *(final)*
 
 - **`additional`** → `Map<String, Object?>` *(final)*
 
@@ -1721,6 +1769,25 @@ Throws [FileSystemException] if the job file is not found.
 
 ---
 
+## `matchesTagFilter`
+
+```dart
+bool matchesTagFilter(List<String> itemTags, TagFilter filter)
+```
+
+Check whether a set of [itemTags] matches the given [filter].
+
+Returns `true` if:
+- All include_tags (if any) are present in [itemTags]
+- No exclude_tags (if any) are present in [itemTags]
+
+**Parameters:**
+
+- `itemTags` (`List<String>`) *(required)*
+- `filter` (`TagFilter`) *(required)*
+
+---
+
 ## `readYamlFile`
 
 ```dart

From 757acb7178fffe1c9283d04a559dc2299983b358 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 17:26:36 -0700
Subject: [PATCH 07/21] feat: Add variant filtering and propagate image prefix
 and job task arguments to resolved task metadata, updating the config parity
 tool.

---
 .../lib/src/parsers/yaml_parser.dart          |  7 ++
 .../lib/src/resolvers/eval_set_resolver.dart  | 17 +++++
 .../test/eval_set_resolver_test.dart          | 69 +++++++++++++++++++
 .../src/dataset_config_python/parser.py       |  9 +++
 .../src/dataset_config_python/resolver.py     | 17 +++++
 ...{config_partiy.dart => config_parity.dart} |  0
 6 files changed, 119 insertions(+)
 rename tool/config_parity/bin/{config_partiy.dart => config_parity.dart} (100%)

diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index e8f0b69..d7c0b36 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -100,6 +100,12 @@ class YamlParser extends Parser {
     final taskMetadata = _asMap(data['metadata']);
     final sandboxParameters = _asMap(data['sandbox_parameters']);
 
+    // Parse variant_filters (tag-based variant restriction)
+    final variantFiltersRaw = _asMap(data['variant_filters']);
+    final variantFilters = variantFiltersRaw != null
+        ? TagFilter.fromJson(variantFiltersRaw)
+        : null;
+
     return [
       ParsedTask(
         id: taskId,
@@ -127,6 +133,7 @@ class YamlParser extends Parser {
         version: version,
         metadata: taskMetadata,
         sandboxParameters: sandboxParameters,
+        variantFilters: variantFilters,
       ),
     ];
   }
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index 871e5e9..928a908 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -211,6 +211,9 @@ class EvalSetResolver {
         if (tc.systemMessage != null) 'system_message': tc.systemMessage,
         if (tc.saveExamples) 'save_examples': true,
         if (tc.examplesDir != null) 'examples_dir': tc.examplesDir,
+        // Propagate image_prefix from job for container image resolution
+        if (job.imagePrefix != null && job.imagePrefix!.isNotEmpty)
+          'image_prefix': job.imagePrefix,
         // Merge any task-level metadata from YAML
         ...?tc.metadata,
       };
@@ -466,6 +469,13 @@ class EvalSetResolver {
         }
       }
 
+      // Filter by task-level variant_filters (tag-based)
+      if (taskConfig.variantFilters != null) {
+        effectiveVariants.removeWhere((name, _) {
+          return !matchesTagFilter([name], taskConfig.variantFilters!);
+        });
+      }
+
       // Get job-level task overrides
       final jobTask = (job.tasks != null && job.tasks!.containsKey(taskId))
           ? job.tasks![taskId]
@@ -500,6 +510,12 @@ class EvalSetResolver {
         systemMessage = jobTask!.systemMessage;
       }
 
+      // Merge job-task args into metadata
+      Map<String, dynamic>? mergedMetadata = taskConfig.metadata;
+      if (jobTask?.args != null && jobTask!.args!.isNotEmpty) {
+        mergedMetadata = {...?mergedMetadata, 'args': jobTask.args};
+      }
+
       // Create one ParsedTask per effective variant
       for (final entry in effectiveVariants.entries) {
         final variant = _resolveVariant(entry.key, entry.value, datasetRoot);
@@ -519,6 +535,7 @@ class EvalSetResolver {
             allowedVariants: null,
             saveExamples: job.saveExamples,
             examplesDir: examplesDir,
+            metadata: mergedMetadata,
           ),
         );
       }
diff --git a/packages/dataset_config_dart/test/eval_set_resolver_test.dart b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
index de32e88..48d4c6d 100644
--- a/packages/dataset_config_dart/test/eval_set_resolver_test.dart
+++ b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
@@ -15,6 +15,8 @@ void main() {
     String? model,
     int? timeLimit,
     int? messageLimit,
+    TagFilter? variantFilters,
+    Map<String, dynamic>? metadata,
   }) {
     return ParsedTask(
       id: id,
@@ -35,6 +37,8 @@ void main() {
       model: model,
       timeLimit: timeLimit,
       messageLimit: messageLimit,
+      variantFilters: variantFilters,
+      metadata: metadata,
     );
   }
 
@@ -47,6 +51,7 @@ void main() {
     Map<String, JobTask>? tasks,
     bool saveExamples = false,
     Map<String, dynamic>? taskDefaults,
+    String? imagePrefix,
   }) {
     return Job(
       logDir: logDir,
@@ -56,6 +61,7 @@ void main() {
       tasks: tasks,
       saveExamples: saveExamples,
       taskDefaults: taskDefaults,
+      imagePrefix: imagePrefix,
     );
   }
 
@@ -366,5 +372,68 @@ void main() {
       final dataset = results.first.tasks.first.dataset!;
       expect(dataset.name, 'my_eval:baseline');
     });
+
+    test('variant_filters restricts effective variants', () {
+      final results = resolver.resolve(
+        [
+          makeTask(
+            variantFilters: const TagFilter(
+              includeTags: ['baseline'],
+            ),
+          ),
+        ],
+        makeJob(
+          models: ['m'],
+          variants: {'baseline': {}, 'full': {}, 'mcp_only': {}},
+        ),
+        '/tmp/dataset',
+      );
+
+      final taskNames = results
+          .expand((e) => e.tasks)
+          .map((t) => t.name)
+          .toList();
+      expect(taskNames, ['test_task:baseline']);
+      expect(taskNames, isNot(contains('test_task:full')));
+      expect(taskNames, isNot(contains('test_task:mcp_only')));
+    });
+
+    test('image_prefix from job appears in task metadata', () {
+      final results = resolver.resolve(
+        [makeTask()],
+        makeJob(
+          models: ['m'],
+          imagePrefix: 'us-central1-docker.pkg.dev/my-project/repo/',
+        ),
+        '/tmp/dataset',
+      );
+
+      final metadata = results.first.tasks.first.metadata!;
+      expect(
+        metadata['image_prefix'],
+        'us-central1-docker.pkg.dev/my-project/repo/',
+      );
+    });
+
+    test('JobTask.args appears in task metadata', () {
+      final results = resolver.resolve(
+        [makeTask(id: 'my_task')],
+        makeJob(
+          models: ['m'],
+          tasks: {
+            'my_task': const JobTask(
+              id: 'my_task',
+              args: {'base_url': 'http://localhost', 'timeout': 30},
+            ),
+          },
+        ),
+        '/tmp/dataset',
+      );
+
+      final metadata = results.first.tasks.first.metadata!;
+      expect(metadata['args'], isA<Map>());
+      expect(metadata['args']['base_url'], 'http://localhost');
+      expect(metadata['args']['timeout'], 30);
+    });
   });
 }
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index 56dc89f..4148e11 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -57,6 +57,7 @@ def __init__(
         version: Any | None = None,
         metadata: dict[str, Any] | None = None,
         sandbox_parameters: dict[str, Any] | None = None,
+        variant_filters: TagFilter | None = None,
     ):
         self.id = id
         self.func = func
@@ -85,6 +86,7 @@ def __init__(
         self.version = version
         self.metadata = metadata
         self.sandbox_parameters = sandbox_parameters
+        self.variant_filters = variant_filters
 
     _UNSET: Any = object()
 
@@ -118,6 +120,7 @@ def copy_with(
         display_name: str | None = _UNSET,
         version: Any = _UNSET,
         metadata: dict[str, Any] | None = _UNSET,
+        variant_filters: TagFilter | None = _UNSET,
     ) -> ParsedTask:
         """Create a copy with overrides."""
         _U = ParsedTask._UNSET
@@ -149,6 +152,7 @@ def copy_with(
             display_name=self.display_name if display_name is _U else display_name,
             version=self.version if version is _U else version,
             metadata=self.metadata if metadata is _U else metadata,
+            variant_filters=self.variant_filters if variant_filters is _U else variant_filters,
         )
 
 
@@ -262,6 +266,10 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
         )
     samples = _load_samples_section(samples_raw, dataset_root, task_workspace, task_tests, task_dir)
 
+    # Parse variant_filters (tag-based variant restriction)
+    variant_filters_raw = data.get("variant_filters")
+    variant_filters = TagFilter(**variant_filters_raw) if isinstance(variant_filters_raw, dict) else None
+
     return [
         ParsedTask(
             id=task_id,
@@ -288,6 +296,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             version=data.get("version"),
             metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None,
             sandbox_parameters=data.get("sandbox_parameters") if isinstance(data.get("sandbox_parameters"), dict) else None,
+            variant_filters=variant_filters,
         )
     ]
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 6c9116c..9ecd4e7 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -202,6 +202,9 @@ def _build_eval_set(
             task_metadata["save_examples"] = True
         if tc.examples_dir is not None:
             task_metadata["examples_dir"] = tc.examples_dir
+        # Propagate image_prefix from job for container image resolution
+        if job.image_prefix:
+            task_metadata["image_prefix"] = job.image_prefix
         if tc.metadata:
             task_metadata.update(tc.metadata)
 
@@ -402,6 +405,14 @@ def _expand_task_configs(
             if tc.allowed_variants is None or vname in tc.allowed_variants:
                 effective_variants[vname] = vdef
 
+        # Filter by task-level variant_filters (tag-based)
+        if tc.variant_filters is not None:
+            effective_variants = {
+                vname: vdef
+                for vname, vdef in effective_variants.items()
+                if matches_tag_filter([vname], tc.variant_filters)
+            }
+
         # Get job-level task overrides
         job_task = job.tasks.get(task_id) if job.tasks else None
 
@@ -425,6 +436,11 @@ def _expand_task_configs(
         if job_task and job_task.system_message is not None:
             system_message = job_task.system_message
 
+        # Merge job-task args into metadata
+        merged_metadata = dict(tc.metadata) if tc.metadata else None
+        if job_task and job_task.args:
+            merged_metadata = {**(merged_metadata or {}), "args": job_task.args}
+
         # Create one ParsedTask per effective variant
         for vname, vdef in effective_variants.items():
             variant = _resolve_variant(vname, vdef, dataset_root)
@@ -442,6 +458,7 @@ def _expand_task_configs(
                     allowed_variants=None,
                     save_examples=job.save_examples,
                     examples_dir=examples_dir,
+                    metadata=merged_metadata,
                 )
             )
 
diff --git a/tool/config_parity/bin/config_partiy.dart b/tool/config_parity/bin/config_parity.dart
similarity index 100%
rename from tool/config_parity/bin/config_partiy.dart
rename to tool/config_parity/bin/config_parity.dart

From 9c52fd1d26e623592e1621b8b3947bf8b4a72e9d Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 13 Mar 2026 17:55:49 -0700
Subject: [PATCH 08/21] feat: Generalize SDK channel to 'branch', consolidate
 sandbox configuration, and refine tag filtering logic.

---
 .../lib/src/models/tag_filter.dart            |  2 +
 .../lib/src/models/variant.dart               |  6 +--
 .../lib/src/models/variant.freezed.dart       | 50 +++++++++----------
 .../lib/src/models/variant.g.dart             |  4 +-
 .../lib/src/resolvers/eval_set_resolver.dart  | 35 ++++++-------
 .../src/dataset_config_python/__init__.py     |  4 +-
 .../dataset_config_python/models/variant.py   |  4 +-
 .../src/dataset_config_python/parser.py       |  8 +--
 .../src/dataset_config_python/resolver.py     | 43 ++++++++++------
 .../tests/test_config.py                      |  2 +-
 .../devals_cli/lib/src/dataset/dry_run.dart   |  2 +-
 11 files changed, 87 insertions(+), 73 deletions(-)

diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.dart
index f5a4ec1..3e112f4 100644
--- a/packages/dataset_config_dart/lib/src/models/tag_filter.dart
+++ b/packages/dataset_config_dart/lib/src/models/tag_filter.dart
@@ -22,10 +22,12 @@ sealed class TagFilter with _$TagFilter {
 /// - No exclude_tags (if any) are present in [itemTags]
 bool matchesTagFilter(List<String> itemTags, TagFilter filter) {
   if (filter.includeTags != null &&
+      filter.includeTags!.isNotEmpty &&
       !filter.includeTags!.every((t) => itemTags.contains(t))) {
     return false;
   }
   if (filter.excludeTags != null &&
+      filter.excludeTags!.isNotEmpty &&
       filter.excludeTags!.any((t) => itemTags.contains(t))) {
     return false;
   }
diff --git a/packages/dataset_config_dart/lib/src/models/variant.dart b/packages/dataset_config_dart/lib/src/models/variant.dart
index 82afa37..bfa1542 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.dart
@@ -43,9 +43,9 @@ sealed class Variant with _$Variant {
     /// Each directory must contain a `SKILL.md` file.
     @JsonKey(name: 'skill_paths') @Default([]) List<String> skillPaths,
 
-    /// Flutter SDK channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
-    /// `null` means use the default (stable) image from the job's sandbox.
-    @JsonKey(name: 'flutter_channel') String? flutterChannel,
+    /// SDK branch/channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
+    /// `null` means use the default image from the job's sandbox.
+    @JsonKey(name: 'branch') String? branch,
   }) = _Variant;
 
   const Variant._();
diff --git a/packages/dataset_config_dart/lib/src/models/variant.freezed.dart b/packages/dataset_config_dart/lib/src/models/variant.freezed.dart
index 9fe224c..5389724 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.freezed.dart
@@ -20,9 +20,9 @@ mixin _$Variant {
 @JsonKey(name: 'context_files') List<ContextFile> get contextFiles;/// MCP server keys to enable (e.g., `['dart']`).
 @JsonKey(name: 'mcp_servers') List<String> get mcpServers;/// Resolved paths to agent skill directories.
 /// Each directory must contain a `SKILL.md` file.
-@JsonKey(name: 'skill_paths') List<String> get skillPaths;/// Flutter SDK channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
-/// `null` means use the default (stable) image from the job's sandbox.
-@JsonKey(name: 'flutter_channel') String? get flutterChannel;
+@JsonKey(name: 'skill_paths') List<String> get skillPaths;/// SDK branch/channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
+/// `null` means use the default image from the job's sandbox.
+@JsonKey(name: 'branch') String? get branch;
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
 @JsonKey(includeFromJson: false, includeToJson: false)
@@ -35,16 +35,16 @@ $VariantCopyWith<Variant> get copyWith => _$VariantCopyWithImpl<Variant>(this as
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.contextFiles, contextFiles)&&const DeepCollectionEquality().equals(other.mcpServers, mcpServers)&&const DeepCollectionEquality().equals(other.skillPaths, skillPaths)&&(identical(other.flutterChannel, flutterChannel) || other.flutterChannel == flutterChannel));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.contextFiles, contextFiles)&&const DeepCollectionEquality().equals(other.mcpServers, mcpServers)&&const DeepCollectionEquality().equals(other.skillPaths, skillPaths)&&(identical(other.branch, branch) || other.branch == branch));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(contextFiles),const DeepCollectionEquality().hash(mcpServers),const DeepCollectionEquality().hash(skillPaths),flutterChannel);
+int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(contextFiles),const DeepCollectionEquality().hash(mcpServers),const DeepCollectionEquality().hash(skillPaths),branch);
 
 @override
 String toString() {
-  return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, flutterChannel: $flutterChannel)';
+  return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, branch: $branch)';
 }
 
 
@@ -55,7 +55,7 @@ abstract mixin class $VariantCopyWith<$Res>  {
   factory $VariantCopyWith(Variant value, $Res Function(Variant) _then) = _$VariantCopyWithImpl;
 @useResult
 $Res call({
- String name,@JsonKey(name: 'context_files') List<ContextFile> contextFiles,@JsonKey(name: 'mcp_servers') List<String> mcpServers,@JsonKey(name: 'skill_paths') List<String> skillPaths,@JsonKey(name: 'flutter_channel') String? flutterChannel
+ String name,@JsonKey(name: 'context_files') List<ContextFile> contextFiles,@JsonKey(name: 'mcp_servers') List<String> mcpServers,@JsonKey(name: 'skill_paths') List<String> skillPaths,@JsonKey(name: 'branch') String? branch
 });
 
 
@@ -72,13 +72,13 @@ class _$VariantCopyWithImpl<$Res>
 
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? flutterChannel = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? branch = freezed,}) {
   return _then(_self.copyWith(
 name: null == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
 as String,contextFiles: null == contextFiles ? _self.contextFiles : contextFiles // ignore: cast_nullable_to_non_nullable
 as List<ContextFile>,mcpServers: null == mcpServers ? _self.mcpServers : mcpServers // ignore: cast_nullable_to_non_nullable
 as List<String>,skillPaths: null == skillPaths ? _self.skillPaths : skillPaths // ignore: cast_nullable_to_non_nullable
-as List<String>,flutterChannel: freezed == flutterChannel ? _self.flutterChannel : flutterChannel // ignore: cast_nullable_to_non_nullable
+as List<String>,branch: freezed == branch ? _self.branch : branch // ignore: cast_nullable_to_non_nullable
 as String?,
   ));
 }
@@ -161,10 +161,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'flutter_channel')  String? flutterChannel)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'branch')  String? branch)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Variant() when $default != null:
-return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.flutterChannel);case _:
+return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.branch);case _:
   return orElse();
 
 }
@@ -182,10 +182,10 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'flutter_channel')  String? flutterChannel)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'branch')  String? branch)  $default,) {final _that = this;
 switch (_that) {
 case _Variant():
-return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.flutterChannel);}
+return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.branch);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -199,10 +199,10 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'flutter_channel')  String? flutterChannel)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'branch')  String? branch)?  $default,) {final _that = this;
 switch (_that) {
 case _Variant() when $default != null:
-return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.flutterChannel);case _:
+return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.branch);case _:
   return null;
 
 }
@@ -214,7 +214,7 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,
 @JsonSerializable()
 
 class _Variant extends Variant {
-  const _Variant({this.name = 'baseline', @JsonKey(name: 'context_files') final  List<ContextFile> contextFiles = const [], @JsonKey(name: 'mcp_servers') final  List<String> mcpServers = const [], @JsonKey(name: 'skill_paths') final  List<String> skillPaths = const [], @JsonKey(name: 'flutter_channel') this.flutterChannel}): _contextFiles = contextFiles,_mcpServers = mcpServers,_skillPaths = skillPaths,super._();
+  const _Variant({this.name = 'baseline', @JsonKey(name: 'context_files') final  List<ContextFile> contextFiles = const [], @JsonKey(name: 'mcp_servers') final  List<String> mcpServers = const [], @JsonKey(name: 'skill_paths') final  List<String> skillPaths = const [], @JsonKey(name: 'branch') this.branch}): _contextFiles = contextFiles,_mcpServers = mcpServers,_skillPaths = skillPaths,super._();
   factory _Variant.fromJson(Map<String, dynamic> json) => _$VariantFromJson(json);
 
 /// User-defined variant name from the job file.
@@ -248,9 +248,9 @@ class _Variant extends Variant {
   return EqualUnmodifiableListView(_skillPaths);
 }
 
-/// Flutter SDK channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
-/// `null` means use the default (stable) image from the job's sandbox.
-@override@JsonKey(name: 'flutter_channel') final  String? flutterChannel;
+/// SDK branch/channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
+/// `null` means use the default image from the job's sandbox.
+@override@JsonKey(name: 'branch') final  String? branch;
 
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
@@ -265,16 +265,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other._contextFiles, _contextFiles)&&const DeepCollectionEquality().equals(other._mcpServers, _mcpServers)&&const DeepCollectionEquality().equals(other._skillPaths, _skillPaths)&&(identical(other.flutterChannel, flutterChannel) || other.flutterChannel == flutterChannel));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other._contextFiles, _contextFiles)&&const DeepCollectionEquality().equals(other._mcpServers, _mcpServers)&&const DeepCollectionEquality().equals(other._skillPaths, _skillPaths)&&(identical(other.branch, branch) || other.branch == branch));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(_contextFiles),const DeepCollectionEquality().hash(_mcpServers),const DeepCollectionEquality().hash(_skillPaths),flutterChannel);
+int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(_contextFiles),const DeepCollectionEquality().hash(_mcpServers),const DeepCollectionEquality().hash(_skillPaths),branch);
 
 @override
 String toString() {
-  return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, flutterChannel: $flutterChannel)';
+  return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, branch: $branch)';
 }
 
 
@@ -285,7 +285,7 @@ abstract mixin class _$VariantCopyWith<$Res> implements $VariantCopyWith<$Res> {
   factory _$VariantCopyWith(_Variant value, $Res Function(_Variant) _then) = __$VariantCopyWithImpl;
 @override @useResult
 $Res call({
- String name,@JsonKey(name: 'context_files') List<ContextFile> contextFiles,@JsonKey(name: 'mcp_servers') List<String> mcpServers,@JsonKey(name: 'skill_paths') List<String> skillPaths,@JsonKey(name: 'flutter_channel') String? flutterChannel
+ String name,@JsonKey(name: 'context_files') List<ContextFile> contextFiles,@JsonKey(name: 'mcp_servers') List<String> mcpServers,@JsonKey(name: 'skill_paths') List<String> skillPaths,@JsonKey(name: 'branch') String? branch
 });
 
 
@@ -302,13 +302,13 @@ class __$VariantCopyWithImpl<$Res>
 
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? flutterChannel = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? branch = freezed,}) {
   return _then(_Variant(
 name: null == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
 as String,contextFiles: null == contextFiles ? _self._contextFiles : contextFiles // ignore: cast_nullable_to_non_nullable
 as List<ContextFile>,mcpServers: null == mcpServers ? _self._mcpServers : mcpServers // ignore: cast_nullable_to_non_nullable
 as List<String>,skillPaths: null == skillPaths ? _self._skillPaths : skillPaths // ignore: cast_nullable_to_non_nullable
-as List<String>,flutterChannel: freezed == flutterChannel ? _self.flutterChannel : flutterChannel // ignore: cast_nullable_to_non_nullable
+as List<String>,branch: freezed == branch ? _self.branch : branch // ignore: cast_nullable_to_non_nullable
 as String?,
   ));
 }
diff --git a/packages/dataset_config_dart/lib/src/models/variant.g.dart b/packages/dataset_config_dart/lib/src/models/variant.g.dart
index a9a6d25..09277ff 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.g.dart
@@ -23,7 +23,7 @@ _Variant _$VariantFromJson(Map<String, dynamic> json) => _Variant(
           ?.map((e) => e as String)
           .toList() ??
       const [],
-  flutterChannel: json['flutter_channel'] as String?,
+  branch: json['branch'] as String?,
 );
 
 Map<String, dynamic> _$VariantToJson(_Variant instance) => <String, dynamic>{
@@ -31,5 +31,5 @@ Map<String, dynamic> _$VariantToJson(_Variant instance) => <String, dynamic>{
   'context_files': instance.contextFiles,
   'mcp_servers': instance.mcpServers,
   'skill_paths': instance.skillPaths,
-  'flutter_channel': instance.flutterChannel,
+  'branch': instance.branch,
 };
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index 928a908..53d6b47 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -36,10 +36,10 @@ const Map<String, Map<String, String>> kDefaultSandboxRegistry = {
   },
 };
 
-/// Default Flutter SDK channel → sandbox registry key mapping.
+/// Default SDK branch → sandbox registry key mapping.
 ///
 /// Consumers can pass these to [EvalSetResolver] or provide their own.
-const Map<String, String> kDefaultSdkChannels = {
+const Map<String, String> kDefaultBranchChannels = {
   'stable': 'podman',
   'beta': 'podman-beta',
   'main': 'podman-main',
@@ -51,28 +51,28 @@ const Map<String, String> kDefaultSdkChannels = {
 /// This is the resolution engine. It:
 /// 1. Resolves models, sandboxes, and variants
 /// 2. Expands task × variant combinations into [Task] entries
-/// 3. Groups by flutter_channel (one [EvalSet] per group)
+/// 3. Groups by branch (one [EvalSet] per group)
 /// 4. Propagates job-level and task-level settings to the output
 class EvalSetResolver {
   /// Creates a resolver with optional sandbox configuration.
   ///
-  /// If [sandboxRegistry] or [sdkChannels] are not provided, they default
+  /// If [sandboxRegistry] or [branchChannels] are not provided, they default
   /// to empty maps (no sandbox resolution). Pass [kDefaultSandboxRegistry]
-  /// and [kDefaultSdkChannels] for the Flutter-specific sandbox setup.
+  /// and [kDefaultBranchChannels] for the Flutter-specific sandbox setup.
   const EvalSetResolver({
     this.sandboxRegistry = const {},
-    this.sdkChannels = const {},
+    this.branchChannels = const {},
   });
 
   /// Named sandbox configurations (e.g. `'podman'` → compose file path).
   final Map<String, Map<String, String>> sandboxRegistry;
 
-  /// SDK channel → sandbox registry key mapping.
-  final Map<String, String> sdkChannels;
+  /// SDK branch → sandbox registry key mapping.
+  final Map<String, String> branchChannels;
 
   /// Resolve task configs and job into [EvalSet] objects.
   ///
-  /// Groups by flutter_channel so each gets its own sandbox.
+  /// Groups by branch so each gets its own sandbox.
   List<EvalSet> resolve(
     List<ParsedTask> datasetTasks,
     Job job,
@@ -87,10 +87,10 @@ class EvalSetResolver {
       datasetRoot,
     );
 
-    // Group by flutter channel
+    // Group by branch
     final groups = <String?, List<ParsedTask>>{};
     for (final tc in expandedTasks) {
-      final key = tc.variant.flutterChannel;
+      final key = tc.variant.branch;
       (groups[key] ??= []).add(tc);
     }
 
@@ -103,7 +103,7 @@ class EvalSetResolver {
           sandbox: _resolveSandbox(
             datasetRoot,
             job,
-            flutterChannel: entry.key,
+            branch: entry.key,
           ),
           job: job,
         ),
@@ -156,6 +156,7 @@ class EvalSetResolver {
 
         if (workspace != null && isContainer) {
           files = {...?files, '/workspace': workspace};
+          setup ??= 'cd /workspace && flutter pub get';
           enriched['workspace'] = '/workspace';
         }
         if (workspaceGit != null) {
@@ -403,14 +404,14 @@ class EvalSetResolver {
   Object _resolveSandbox(
     String datasetRoot,
     Job job, {
-    String? flutterChannel,
+    String? branch,
   }) {
     final sandboxType = job.sandboxType;
     if (sandboxType.isEmpty || sandboxType == 'local') return 'local';
 
-    // Channel override → look up channel-specific sandbox
-    if (flutterChannel != null && sdkChannels.containsKey(flutterChannel)) {
-      final registryKey = sdkChannels[flutterChannel]!;
+    // Branch override → look up branch-specific sandbox
+    if (branch != null && branchChannels.containsKey(branch)) {
+      final registryKey = branchChannels[branch]!;
       if (sandboxRegistry.containsKey(registryKey)) {
         final def = sandboxRegistry[registryKey]!;
         var sandboxPath = def['path']!;
@@ -614,7 +615,7 @@ class EvalSetResolver {
       contextFiles: contextFiles,
       mcpServers: (vDef['mcp_servers'] as List?)?.cast<String>() ?? [],
       skillPaths: skillPaths,
-      flutterChannel: vDef['flutter_channel'] as String?,
+      branch: vDef['branch'] as String?,
     );
   }
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/__init__.py b/packages/dataset_config_python/src/dataset_config_python/__init__.py
index 135b4cb..3a47fd0 100644
--- a/packages/dataset_config_python/src/dataset_config_python/__init__.py
+++ b/packages/dataset_config_python/src/dataset_config_python/__init__.py
@@ -6,7 +6,7 @@
 No Dart SDK or Inspect AI dependency required.
 """
 
-from dataset_config_python.resolver import resolve
+from dataset_config_python.resolver import DEFAULT_BRANCH_CHANNELS, DEFAULT_SANDBOX_REGISTRY, SandboxConfig, resolve
 from dataset_config_python.writer import write_eval_sets
 
-__all__ = ["resolve", "write_eval_sets"]
+__all__ = ["DEFAULT_BRANCH_CHANNELS", "DEFAULT_SANDBOX_REGISTRY", "SandboxConfig", "resolve", "write_eval_sets"]
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/variant.py b/packages/dataset_config_python/src/dataset_config_python/models/variant.py
index 690e675..4fa39d6 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/variant.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/variant.py
@@ -32,5 +32,5 @@ class Variant(BaseModel):
     skill_paths: list[str] = Field(default_factory=list)
     """Resolved paths to agent skill directories."""
 
-    flutter_channel: str | None = None
-    """Flutter SDK channel to use (e.g. 'stable', 'beta', 'main')."""
+    branch: str | None = None
+    """SDK branch/channel to use (e.g. 'stable', 'beta', 'main')."""
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index 4148e11..dd8ed1d 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -246,7 +246,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     task_dir = os.path.dirname(task_path)
 
     task_id = data.get("id") or os.path.basename(task_dir)
-    task_func = data.get("func") or task_id
+    func_name = data.get("func") or task_id
 
     task_workspace_raw = data.get("workspace")
     task_tests_raw = data.get("tests")
@@ -273,7 +273,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     return [
         ParsedTask(
             id=task_id,
-            func=task_func,
+            func=func_name,
             variant=Variant(),
             samples=samples,
             system_message=system_message,
@@ -559,8 +559,8 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
         ),
         description=data.get("description"),
         image_prefix=data.get("image_prefix"),
-        task_filters=TagFilter(**data["task_filters"]) if isinstance(data.get("task_filters"), dict) else None,
-        sample_filters=TagFilter(**data["sample_filters"]) if isinstance(data.get("sample_filters"), dict) else None,
+        task_filters=data.get("task_filters"),
+        sample_filters=data.get("sample_filters"),
     )
 
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 9ecd4e7..dffd95e 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -4,6 +4,7 @@
 
 import glob as globmod
 import os
+from dataclasses import dataclass, field
 from typing import Any
 
 from dataset_config_python.models.context_file import ContextFile
@@ -37,14 +38,22 @@
     "podman-main": {"name": "podman", "path": "./sandboxes/podman/compose-main.yaml"},
 }
 
-# Default Flutter SDK channel → sandbox registry key mapping.
-DEFAULT_SDK_CHANNELS: dict[str, str] = {
+# Default SDK branch → sandbox registry key mapping.
+DEFAULT_BRANCH_CHANNELS: dict[str, str] = {
     "stable": "podman",
     "beta": "podman-beta",
     "main": "podman-main",
 }
 
 
+@dataclass
+class SandboxConfig:
+    """Sandbox registry and branch-channel mapping."""
+
+    registry: dict[str, dict[str, str]] = field(default_factory=dict)
+    branch_channels: dict[str, str] = field(default_factory=dict)
+
+
 def _is_glob(pattern: str) -> bool:
     return "*" in pattern or "?" in pattern or "[" in pattern
 
@@ -53,8 +62,7 @@ def resolve(
     dataset_path: str,
     job_names: list[str],
     *,
-    sandbox_registry: dict[str, dict[str, str]] | None = None,
-    sdk_channels: dict[str, str] | None = None,
+    sandbox_config: SandboxConfig | None = None,
 ) -> list[EvalSet]:
     """Resolve dataset + job(s) into EvalSet objects.
 
@@ -64,13 +72,14 @@ def resolve(
         dataset_path: Root directory containing ``tasks/`` and ``jobs/``.
         job_names: Job names (looked up in ``jobs/``) or paths.
         sandbox_registry: Named sandbox configurations. Defaults to empty.
-        sdk_channels: SDK channel → sandbox registry key mapping. Defaults to empty.
+        branch_channels: SDK branch → sandbox registry key mapping. Defaults to empty.
 
     Returns:
         A list of EvalSet objects ready for JSON serialization.
     """
-    registry = sandbox_registry or {}
-    channels = sdk_channels or {}
+    sandbox_cfg = sandbox_config or SandboxConfig()
+    registry = sandbox_cfg.registry
+    channels = sandbox_cfg.branch_channels
     task_configs = parse_tasks(dataset_path)
     results: list[EvalSet] = []
 
@@ -87,7 +96,7 @@ def _resolve_job(
     job: Any,
     dataset_root: str,
     sandbox_registry: dict[str, dict[str, str]],
-    sdk_channels: dict[str, str],
+    branch_channels: dict[str, str],
 ) -> list[EvalSet]:
     """Resolve task configs and job into EvalSet objects."""
     models = job.models if job.models else list(DEFAULT_MODELS)
@@ -95,10 +104,10 @@ def _resolve_job(
 
     expanded_tasks = _expand_task_configs(dataset_tasks, job, sandbox_type_str, dataset_root)
 
-    # Group by flutter channel
+    # Group by branch
     groups: dict[str | None, list[ParsedTask]] = {}
     for tc in expanded_tasks:
-        key = tc.variant.flutter_channel
+        key = tc.variant.branch
         groups.setdefault(key, []).append(tc)
 
     return [
@@ -106,7 +115,7 @@ def _resolve_job(
             task_configs=group,
             log_dir=job.log_dir,
             models=models,
-            sandbox=_resolve_sandbox(dataset_root, job, sandbox_registry, sdk_channels, flutter_channel=channel),
+            sandbox=_resolve_sandbox(dataset_root, job, sandbox_registry, branch_channels, branch=channel),
             job=job,
         )
         for channel, group in groups.items()
@@ -152,6 +161,7 @@ def _build_eval_set(
 
             if workspace is not None and is_container:
                 files = {**(files or {}), "/workspace": workspace}
+                setup = setup or "cd /workspace && flutter pub get"
                 enriched["workspace"] = "/workspace"
             if workspace_git is not None:
                 enriched["workspace_git"] = workspace_git
@@ -341,9 +351,9 @@ def _resolve_sandbox(
     dataset_root: str,
     job: Any,
     sandbox_registry: dict[str, dict[str, str]],
-    sdk_channels: dict[str, str],
+    branch_channels: dict[str, str],
     *,
-    flutter_channel: str | None = None,
+    branch: str | None = None,
 ) -> Any:
     """Resolve sandbox spec for a given config."""
     sandbox_type = job.sandbox_type
@@ -351,8 +361,9 @@ def _resolve_sandbox(
         return "local"
 
     # Channel override
-    if flutter_channel and flutter_channel in sdk_channels:
-        registry_key = sdk_channels[flutter_channel]
+    # Branch override → look up branch-specific sandbox
+    if branch and branch in branch_channels:
+        registry_key = branch_channels[branch]
         if registry_key in sandbox_registry:
             defn = sandbox_registry[registry_key]
             sandbox_path = defn["path"]
@@ -529,7 +540,7 @@ def _resolve_variant(
         context_files=context_files,
         mcp_servers=vdef.get("mcp_servers") or [],
         skill_paths=skill_paths,
-        flutter_channel=vdef.get("flutter_channel"),
+        branch=vdef.get("branch"),
     )
 
 
diff --git a/packages/dataset_config_python/tests/test_config.py b/packages/dataset_config_python/tests/test_config.py
index 20890c3..89ea89c 100644
--- a/packages/dataset_config_python/tests/test_config.py
+++ b/packages/dataset_config_python/tests/test_config.py
@@ -165,7 +165,7 @@ def test_variant_defaults(self):
         assert v.context_files == []
         assert v.mcp_servers == []
         assert v.skill_paths == []
-        assert v.flutter_channel is None
+        assert v.branch is None
 
     def test_job_task_from_yaml_none(self):
         jt = JobTask.from_yaml("my_task", None)
diff --git a/packages/devals_cli/lib/src/dataset/dry_run.dart b/packages/devals_cli/lib/src/dataset/dry_run.dart
index 856e172..1a61dcc 100644
--- a/packages/devals_cli/lib/src/dataset/dry_run.dart
+++ b/packages/devals_cli/lib/src/dataset/dry_run.dart
@@ -36,7 +36,7 @@ bool _validateConfig(EvalSet config) {
 
     if (task.func == null) {
       warnings.add(
-        'Task "$name" has no task_func — Mode 2 hydration required',
+        'Task "$name" has no func — Mode 2 hydration required',
       );
     }
 

From 254f6a1493836d23d33caa90a7c7e636cecec4d3 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Tue, 17 Mar 2026 11:45:42 -0700
Subject: [PATCH 09/21] feat: Introduce `resolve_from_parsed` for explicit
 configuration resolution and expose new parsing utilities.

---
 .../src/dash_evals/runner/json_runner.py      |  4 +-
 .../src/dataset_config_python/__init__.py     | 22 +++++++-
 .../src/dataset_config_python/resolver.py     | 55 ++++++++++++++++---
 3 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/packages/dash_evals/src/dash_evals/runner/json_runner.py b/packages/dash_evals/src/dash_evals/runner/json_runner.py
index 89e4eee..2f395b7 100644
--- a/packages/dash_evals/src/dash_evals/runner/json_runner.py
+++ b/packages/dash_evals/src/dash_evals/runner/json_runner.py
@@ -167,9 +167,7 @@ def _run_single_manifest(manifest: dict) -> bool:
 
         if not task_func_name:
             # Mode 2: hydrate directly from JSON (future)
-            job_logger.warning(
-                f"  ⚠ {task_name}: no func — Mode 2 hydration not yet supported"
-            )
+            job_logger.warning(f"  ⚠ {task_name}: no func — Mode 2 hydration not yet supported")
             continue
 
         try:
diff --git a/packages/dataset_config_python/src/dataset_config_python/__init__.py b/packages/dataset_config_python/src/dataset_config_python/__init__.py
index 3a47fd0..fc0cf83 100644
--- a/packages/dataset_config_python/src/dataset_config_python/__init__.py
+++ b/packages/dataset_config_python/src/dataset_config_python/__init__.py
@@ -6,7 +6,25 @@
 No Dart SDK or Inspect AI dependency required.
 """
 
-from dataset_config_python.resolver import DEFAULT_BRANCH_CHANNELS, DEFAULT_SANDBOX_REGISTRY, SandboxConfig, resolve
+from dataset_config_python.parser import ParsedTask, find_job_file, parse_job, parse_tasks
+from dataset_config_python.resolver import (
+    DEFAULT_BRANCH_CHANNELS,
+    DEFAULT_SANDBOX_REGISTRY,
+    SandboxConfig,
+    resolve,
+    resolve_from_parsed,
+)
 from dataset_config_python.writer import write_eval_sets
 
-__all__ = ["DEFAULT_BRANCH_CHANNELS", "DEFAULT_SANDBOX_REGISTRY", "SandboxConfig", "resolve", "write_eval_sets"]
+__all__ = [
+    "DEFAULT_BRANCH_CHANNELS",
+    "DEFAULT_SANDBOX_REGISTRY",
+    "ParsedTask",
+    "SandboxConfig",
+    "find_job_file",
+    "parse_job",
+    "parse_tasks",
+    "resolve",
+    "resolve_from_parsed",
+    "write_eval_sets",
+]
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index dffd95e..1c67c4a 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -10,6 +10,7 @@
 from dataset_config_python.models.context_file import ContextFile
 from dataset_config_python.models.dataset import Dataset
 from dataset_config_python.models.eval_set import EvalSet
+from dataset_config_python.models.job import Job
 from dataset_config_python.models.sample import Sample
 from dataset_config_python.models.tag_filter import matches_tag_filter
 from dataset_config_python.models.task import Task
@@ -66,31 +67,71 @@ def resolve(
 ) -> list[EvalSet]:
     """Resolve dataset + job(s) into EvalSet objects.
 
-    This is the main public API of the package.
+    This is a convenience wrapper around :func:`resolve_from_parsed` that
+    handles parsing automatically.  Use ``resolve_from_parsed`` when you
+    need to inspect or mutate the parsed config before resolution.
 
     Args:
         dataset_path: Root directory containing ``tasks/`` and ``jobs/``.
         job_names: Job names (looked up in ``jobs/``) or paths.
-        sandbox_registry: Named sandbox configurations. Defaults to empty.
-        branch_channels: SDK branch → sandbox registry key mapping. Defaults to empty.
+        sandbox_config: Sandbox registry and branch-channel mapping.
 
     Returns:
         A list of EvalSet objects ready for JSON serialization.
     """
-    sandbox_cfg = sandbox_config or SandboxConfig()
-    registry = sandbox_cfg.registry
-    channels = sandbox_cfg.branch_channels
     task_configs = parse_tasks(dataset_path)
     results: list[EvalSet] = []
 
     for job_name in job_names:
         job_path = find_job_file(dataset_path, job_name)
         job = parse_job(job_path, dataset_path)
-        results.extend(_resolve_job(task_configs, job, dataset_path, registry, channels))
+        results.extend(
+            resolve_from_parsed(
+                task_configs=task_configs,
+                job=job,
+                dataset_path=dataset_path,
+                sandbox_config=sandbox_config,
+            )
+        )
 
     return results
 
 
+def resolve_from_parsed(
+    task_configs: list[ParsedTask],
+    job: Job,
+    dataset_path: str,
+    *,
+    sandbox_config: SandboxConfig | None = None,
+) -> list[EvalSet]:
+    """Resolve pre-parsed task configs and a job into EvalSet objects.
+
+    Use this instead of :func:`resolve` when you need to inspect or modify
+    the parsed configuration before resolution.  A typical workflow::
+
+        tasks = parse_tasks(dataset_path)
+        job = parse_job(find_job_file(dataset_path, "my_job"), dataset_path)
+
+        # Patch values before resolution
+        job.log_dir = f"{job.log_dir}/{execution_id}"
+
+        eval_sets = resolve_from_parsed(tasks, job, dataset_path)
+
+    Args:
+        task_configs: Parsed task configs (from :func:`parse_tasks`).
+        job: A parsed Job object (from :func:`parse_job`).
+        dataset_path: Root directory of the dataset (used for path resolution).
+        sandbox_config: Sandbox registry and branch-channel mapping.
+
+    Returns:
+        A list of EvalSet objects ready for JSON serialization.
+    """
+    sandbox_cfg = sandbox_config or SandboxConfig()
+    registry = sandbox_cfg.registry
+    channels = sandbox_cfg.branch_channels
+    return _resolve_job(task_configs, job, dataset_path, registry, channels)
+
+
 def _resolve_job(
     dataset_tasks: list[ParsedTask],
     job: Any,

From 4ce2824abae5e38004ae5b0b640855253df6c11e Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Tue, 17 Mar 2026 12:32:06 -0700
Subject: [PATCH 10/21] refactor: Consolidate sandbox configuration and Inspect
 AI eval_set arguments into single map fields, removing individual top-level
 parameters and `JobTask.systemMessage`.

---
 docs/guides/config.md                         |   7 +-
 docs/reference/configuration_reference.md     |  73 +--
 docs/reference/yaml_config.md                 | 511 ++++++++++--------
 .../lib/src/models/job.dart                   | 194 +------
 .../lib/src/models/job.freezed.dart           | 448 +++------------
 .../lib/src/models/job.g.dart                 | 110 +---
 .../lib/src/parsers/json_parser.dart          | 131 ++---
 .../lib/src/parsers/yaml_parser.dart          | 123 ++---
 .../lib/src/resolvers/eval_set_resolver.dart  | 161 +++---
 .../test/eval_set_resolver_test.dart          |  31 +-
 .../test/json_parser_test.dart                | 103 ++--
 .../src/dataset_config_python/models/job.py   |  65 +--
 .../src/dataset_config_python/parser.py       | 140 ++---
 .../src/dataset_config_python/resolver.py     | 133 ++---
 14 files changed, 796 insertions(+), 1434 deletions(-)

diff --git a/docs/guides/config.md b/docs/guides/config.md
index f3889b9..7250b57 100644
--- a/docs/guides/config.md
+++ b/docs/guides/config.md
@@ -76,10 +76,13 @@ The sandbox registry is **configurable** — the resolver accepts a registry map
 
 ```yaml
 # job.yaml
-sandbox_type: podman          # looks up "podman" in the registry
-image_prefix: us-central1-docker.pkg.dev/my-project/repo/
+sandbox:
+  environment: podman          # looks up "podman" in the registry
+  image_prefix: us-central1-docker.pkg.dev/my-project/repo/
 ```
 
+A string shorthand is also supported — `sandbox: podman` is equivalent to `sandbox: {environment: podman}`.
+
 The `image_prefix` is prepended to image names during sandbox resolution (useful for private registries).
 
 ## Workspace setup
diff --git a/docs/reference/configuration_reference.md b/docs/reference/configuration_reference.md
index a1f4d68..b0e457f 100644
--- a/docs/reference/configuration_reference.md
+++ b/docs/reference/configuration_reference.md
@@ -68,22 +68,24 @@ allowed_variants: [baseline, mcp_only]
 samples:
   inline:
     - id: flutter_bloc_cart_mutation_001
-      difficulty: medium
-      tags: [bloc, state]
       input: |
         Fix the bug where adding items to cart doesn't update the total.
       target: |
         The fix should modify the BLoC to emit a new state instead of mutating.
+      metadata:
+        difficulty: medium
+        tags: [bloc, state]
 
     - id: navigation_crash
-      difficulty: hard
-      tags: [navigation]
       workspace:
         path: ./nav_project    # Override task-level workspace
       input: |
         Fix the crash when navigating back from the detail screen.
       target: |
         The fix should handle the disposed controller properly.
+      metadata:
+        difficulty: hard
+        tags: [navigation]
 ```
 
 For the complete list of task fields (including Inspect AI `Task` parameters), see the [Task fields table](yaml_config.md#task).
@@ -121,14 +123,14 @@ A sample is a single test case containing an input prompt, expected output (grad
 samples:
   inline:
     - id: dart_async_await_001
-      difficulty: medium
-      tags: [async, dart]
       input: |
         Explain the difference between Future.then() and async/await in Dart.
       target: |
         The answer should cover both approaches, explain that they are
         functionally equivalent, and note when each is preferred.
       metadata:
+        difficulty: medium
+        tags: [async, dart]
         added: 2025-02-04
         category: language_fundamentals
 ```
@@ -171,10 +173,12 @@ Job files define **what to run** and can **override built-in runtime defaults**.
 # jobs/local_dev.yaml
 name: local_dev
 
+# Sandbox configuration (string shorthand or object)
+sandbox:
+  environment: podman
+
 # Override runtime defaults
-sandbox_type: podman
 max_connections: 15
-max_retries: 10
 
 # Save the agent's final workspace output to logs/<run>/examples/
 # save_examples: true
@@ -191,21 +195,22 @@ variants:
   mcp_only: { mcp_servers: [dart] }
   full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
 
-# Inspect AI eval_set() parameters (all optional)
-retry_attempts: 20
-fail_on_error: 0.05
-log_level: info
-tags: [nightly]
-
-# Default Task-level overrides applied to every task
-task_defaults:
-  time_limit: 600
-  message_limit: 50
-
-# Additional eval_set() parameters not covered above
-# eval_set_overrides:
-#   bundle_dir: ./bundle
-#   log_images: true
+# Inspect AI eval_set() parameters (all optional, nested under inspect_eval_arguments)
+inspect_eval_arguments:
+  retry_attempts: 20
+  fail_on_error: 0.05
+  log_level: info
+  tags: [nightly]
+
+  # Default Task-level overrides applied to every task
+  task_defaults:
+    time_limit: 600
+    message_limit: 50
+
+  # Additional eval_set() parameters not covered above
+  # eval_set_overrides:
+  #   bundle_dir: ./bundle
+  #   log_images: true
 ```
 
 For the complete list of job fields (including all Inspect AI `eval_set()` parameters), see the [Job fields table](yaml_config.md#job).
@@ -214,24 +219,26 @@ For the complete list of job fields (including all Inspect AI `eval_set()` param
 
 #### `task_defaults`
 
-Default [Task parameters](yaml_config.md#task) applied to **every task** in this job. Per-task overrides from `task.yaml` take precedence.
+Default [Task parameters](yaml_config.md#task) applied to **every task** in this job. Per-task overrides from `task.yaml` take precedence. Nested under `inspect_eval_arguments`:
 
 ```yaml
-task_defaults:
-  time_limit: 600
-  message_limit: 50
-  cost_limit: 2.0
-  epochs: 3
+inspect_eval_arguments:
+  task_defaults:
+    time_limit: 600
+    message_limit: 50
+    cost_limit: 2.0
+    epochs: 3
 ```
 
 #### `eval_set_overrides`
 
-Arbitrary `eval_set()` kwargs for parameters not covered by the named fields above. Top-level fields take precedence over overrides.
+Arbitrary `eval_set()` kwargs for parameters not covered by the named fields above. Top-level `inspect_eval_arguments` fields take precedence over overrides. Nested under `inspect_eval_arguments`:
 
 ```yaml
-eval_set_overrides:
-  bundle_dir: ./bundle
-  log_images: true
+inspect_eval_arguments:
+  eval_set_overrides:
+    bundle_dir: ./bundle
+    log_images: true
 ```
 
 ### Tasks Object
diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index 05d63b8..9004539 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -28,17 +28,32 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `logDir`
   - `log_dir`
   - Directory to write evaluation logs to
-* - `sandbox_type`
+* - `sandbox`
+  - string/object
+  - Y
+  - `sandbox`
+  - `sandbox`
+  - Sandbox configuration. String shorthand (e.g. `podman`) is equivalent to `{environment: podman}`
+* - `sandbox`\\
+    &nbsp;&nbsp;`.environment`
   - string
   - Y
-  - `sandboxType`
-  - `sandbox_type`
+  -
+  -
   - Sandbox type: `local`, `docker`, or `podman` (default: `local`)
-* - `image_prefix`
+* - `sandbox`\\
+    &nbsp;&nbsp;`.parameters`
+  - object
+  - Y
+  -
+  -
+  - Pass-through parameters for sandbox plugin configuration
+* - `sandbox`\\
+    &nbsp;&nbsp;`.image_prefix`
   - string
   - Y
-  - `imagePrefix`
-  - `image_prefix`
+  -
+  -
   - Registry prefix prepended to image names during sandbox resolution (e.g. `us-central1-docker.pkg.dev/project/repo/`)
 * - `max_connections`
   - int
@@ -144,14 +159,6 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `JobTask.excludeSamples`
   - `JobTask.exclude_samples`
   - Exclude these sample IDs
-* - `tasks`\
-    &nbsp;&nbsp;`.<task_id>`\
-    &nbsp;&nbsp;`.system_message`
-  - string
-  - Y
-  - `JobTask.systemMessage`
-  - `JobTask.system_message`
-  - Override system message for this task
 * - `tasks`\
     &nbsp;&nbsp;`.<task_id>`\
     &nbsp;&nbsp;`.args`
@@ -166,299 +173,354 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `saveExamples`
   - `save_examples`
   - Copy final workspace to `<logDir>/examples/` after each sample (default: `false`)
-* - `retry_attempts`
+* - `inspect_eval_arguments`
+  - object
+  - Y
+  - `inspectEvalArguments`
+  - `inspect_eval_arguments`
+  - All Inspect AI `eval_set()` parameters. See sub-fields below.
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.retry_attempts`
   - int
   - Y
-  - `retryAttempts`
-  - `retry_attempts`
+  -
+  -
   - Max retry attempts before giving up
-* - `max_retries`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.max_retries`
   - int
   - Y
-  - `maxRetries`
-  - `max_retries`
+  -
+  -
   - Max retry attempts for failed samples
-* - `retry_wait`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.retry_wait`
   - float
   - Y
-  - `retryWait`
-  - `retry_wait`
+  -
+  -
   - Seconds between retries (exponential backoff)
-* - `retry_connections`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.retry_connections`
   - float
   - Y
-  - `retryConnections`
-  - `retry_connections`
+  -
+  -
   - Reduce `max_connections` at this rate per retry
-* - `retry_cleanup`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.retry_cleanup`
   - bool
   - Y
-  - `retryCleanup`
-  - `retry_cleanup`
+  -
+  -
   - Cleanup failed log files after retries
-* - `fail_on_error`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.fail_on_error`
   - float
   - Y
-  - `failOnError`
-  - `fail_on_error`
+  -
+  -
   - Fail if error proportion exceeds threshold (`0.0–1.0`)
-* - `continue_on_fail`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.continue_on_fail`
   - bool
   - Y
-  - `continueOnFail`
-  - `continue_on_fail`
+  -
+  -
   - Continue running even if `fail_on_error` condition is met
-* - `retry_on_error`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.retry_on_error`
   - int
   - Y
-  - `retryOnError`
-  - `retry_on_error`
+  -
+  -
   - Retry samples on error (per-sample)
-* - `debug_errors`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.debug_errors`
   - bool
   - Y
-  - `debugErrors`
-  - `debug_errors`
+  -
+  -
   - Raise task errors for debugging
-* - `max_samples`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.max_samples`
   - int
   - Y
-  - `maxSamples`
-  - `max_samples`
+  -
+  -
   - Max concurrent samples per task
-* - `max_tasks`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.max_tasks`
   - int
   - Y
-  - `maxTasks`
-  - `max_tasks`
+  -
+  -
   - Max tasks to run in parallel
-* - `max_subprocesses`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.max_subprocesses`
   - int
   - Y
-  - `maxSubprocesses`
-  - `max_subprocesses`
+  -
+  -
   - Max subprocesses in parallel
-* - `max_sandboxes`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.max_sandboxes`
   - int
   - Y
-  - `maxSandboxes`
-  - `max_sandboxes`
+  -
+  -
   - Max sandboxes (per-provider) in parallel
-* - `log_level`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_level`
   - string
   - Y
-  - `logLevel`
-  - `log_level`
+  -
+  -
   - Console log level (`debug`, `info`, `warning`, `error`)
-* - `log_level_transcript`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_level_transcript`
   - string
   - Y
-  - `logLevelTranscript`
-  - `log_level_transcript`
+  -
+  -
   - Log file level
-* - `log_format`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_format`
   - string
   - Y
-  - `logFormat`
-  - `log_format`
+  -
+  -
   - Log format (`eval` or `json`)
-* - `log_samples`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_samples`
   - bool
   - Y
-  - `logSamples`
-  - `log_samples`
+  -
+  -
   - Log detailed samples and scores
-* - `log_realtime`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_realtime`
   - bool
   - Y
-  - `logRealtime`
-  - `log_realtime`
+  -
+  -
   - Log events in realtime
-* - `log_images`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_images`
   - bool
   - Y
-  - `logImages`
-  - `log_images`
+  -
+  -
   - Log base64-encoded images
-* - `log_buffer`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_buffer`
   - int
   - Y
-  - `logBuffer`
-  - `log_buffer`
+  -
+  -
   - Samples to buffer before log write
-* - `log_shared`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_shared`
   - int
   - Y
-  - `logShared`
-  - `log_shared`
+  -
+  -
   - Sync sample events for realtime viewing
-* - `log_dir_allow_dirty`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.log_dir_allow_dirty`
   - bool
   - Y
-  - `logDirAllowDirty`
-  - `log_dir_allow_dirty`
+  -
+  -
   - Allow log dir with unrelated logs
-* - `model_base_url`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.model_base_url`
   - string
   - Y
-  - `modelBaseUrl`
-  - `model_base_url`
+  -
+  -
   - Base URL for the model API
-* - `model_args`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.model_args`
   - object
   - Y
-  - `modelArgs`
-  - `model_args`
+  -
+  -
   - Model creation arguments
-* - `model_roles`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.model_roles`
   - object
   - Y
-  - `modelRoles`
-  - `model_roles`
+  -
+  -
   - Named roles for `get_model()`
-* - `task_args`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.task_args`
   - object
   - Y
-  - `taskArgs`
-  - `task_args`
+  -
+  -
   - Task creation arguments
-* - `model_cost_config`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.model_cost_config`
   - object
   - Y
-  - `modelCostConfig`
-  - `model_cost_config`
+  -
+  -
   - Model prices for cost tracking
-* - `limit`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.limit`
   - int/list
   - Y
-  - `limit`
-  - `limit`
+  -
+  -
   - Limit samples (count or `[start, end]` range)
-* - `sample_id`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.sample_id`
   - string/list
   - Y
-  - `sampleId`
-  - `sample_id`
+  -
+  -
   - Evaluate specific sample(s)
-* - `sample_shuffle`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.sample_shuffle`
   - bool/int
   - Y
-  - `sampleShuffle`
-  - `sample_shuffle`
+  -
+  -
   - Shuffle samples (pass seed for deterministic order)
-* - `epochs`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.epochs`
   - int/object
   - Y
-  - `epochs`
-  - `epochs`
+  -
+  -
   - Repeat samples and optional score reducer
-* - `message_limit`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.message_limit`
   - int
   - Y
-  - `messageLimit`
-  - `message_limit`
+  -
+  -
   - Max messages per sample
-* - `token_limit`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.token_limit`
   - int
   - Y
-  - `tokenLimit`
-  - `token_limit`
+  -
+  -
   - Max tokens per sample
-* - `time_limit`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.time_limit`
   - int
   - Y
-  - `timeLimit`
-  - `time_limit`
+  -
+  -
   - Max clock time (seconds) per sample
-* - `working_limit`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.working_limit`
   - int
   - Y
-  - `workingLimit`
-  - `working_limit`
+  -
+  -
   - Max working time (seconds) per sample
-* - `cost_limit`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.cost_limit`
   - float
   - Y
-  - `costLimit`
-  - `cost_limit`
+  -
+  -
   - Max cost (dollars) per sample
-* - `tags`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.tags`
   - list
   - Y
-  - `tags`
-  - `tags`
+  -
+  -
   - Tags for this evaluation run
-* - `metadata`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.metadata`
   - object
   - Y
-  - `metadata`
-  - `metadata`
+  -
+  -
   - Metadata for this evaluation run
-* - `trace`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.trace`
   - bool
   - Y
-  - `trace`
-  - `trace`
+  -
+  -
   - Trace model interactions to terminal
-* - `display`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.display`
   - string
   - Y
-  - `display`
-  - `display`
+  -
+  -
   - Task display type (default: `full`)
-* - `score`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.score`
   - bool
   - Y
-  - `score`
-  - `score`
+  -
+  -
   - Score output (default: `true`)
-* - `approval`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.approval`
   - string/object
   - Y
-  - `approval`
-  - `approval`
+  -
+  -
   - Tool use approval policies
-* - `solver`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.solver`
   - string/object
   - Y
-  - `solver`
-  - `solver`
+  -
+  -
   - Alternative solver(s)
-* - `sandbox_cleanup`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.sandbox_cleanup`
   - bool
   - Y
-  - `sandboxCleanup`
-  - `sandbox_cleanup`
+  -
+  -
   - Cleanup sandbox after task
-* - `bundle_dir`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.bundle_dir`
   - string
   - Y
-  - `bundleDir`
-  - `bundle_dir`
+  -
+  -
   - Directory for bundled logs + viewer
-* - `bundle_overwrite`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.bundle_overwrite`
   - bool
   - Y
-  - `bundleOverwrite`
-  - `bundle_overwrite`
+  -
+  -
   - Overwrite files in `bundle_dir`
-* - `eval_set_id`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.eval_set_id`
   - string
   - Y
-  - `evalSetId`
-  - `eval_set_id`
+  -
+  -
   - Custom ID for the eval set
-* - `eval_set_overrides`
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.eval_set_overrides`
   - object
   - Y
-  - `evalSetOverrides`
-  - `eval_set_overrides`
-  - Additional `eval_set()` kwargs not covered by top-level fields
-* - `task_defaults`
+  -
+  -
+  - Additional `eval_set()` kwargs not covered by named fields above
+* - `inspect_eval_arguments`\
+    &nbsp;&nbsp;`.task_defaults`
   - object
   - Y
-  - `taskDefaults`
-  - `task_defaults`
+  -
+  -
   - Default `Task` kwargs applied to every task in this job
 ```
 
@@ -466,6 +528,8 @@ Job files define runtime settings for an evaluation run, including sandbox confi
 
 Task files define a single evaluation task with its samples, prompt configuration, and optional Inspect AI `Task` parameter overrides. Located in `eval/tasks/<task_id>/task.yaml`.
 
+Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are nested under `inspect_task_args`.
+
 ```{list-table}
 :header-rows: 1
 :widths: 20 8 5 12 12 43
@@ -496,18 +560,18 @@ Task files define a single evaluation task with its samples, prompt configuratio
   - Human-readable description
 * - `samples`
   - object
-  - N
+  - Y
   -
   -
-  - Samples config with `inline` and/or `paths` keys
-* - `samples`\
+  - Samples config with `inline` and/or `paths` keys (optional — task can have no samples)
+* - `samples`\\
     &nbsp;&nbsp;`.inline`
   - list
   - Y
   -
   -
   - Inline sample definitions (list of sample objects)
-* - `samples`\
+* - `samples`\\
     &nbsp;&nbsp;`.paths`
   - list
   - Y
@@ -532,12 +596,6 @@ Task files define a single evaluation task with its samples, prompt configuratio
   - `systemMessage`
   - `system_message`
   - Custom system prompt for this task
-* - `sandbox_parameters`
-  - object
-  - Y
-  - `sandboxParameters`
-  - `sandbox_parameters`
-  - Pass-through parameters for sandbox plugin configuration
 * - `workspace`
   - string/object
   - Y
@@ -550,113 +608,141 @@ Task files define a single evaluation task with its samples, prompt configuratio
   -
   -
   - Default test files for all samples
-* - `model`
+* - `display_name`
+  - string
+  - Y
+  - `displayName`
+  - `display_name`
+  - Task display name (e.g. for plotting)
+* - `version`
+  - int
+  - Y
+  - `version`
+  - `version`
+  - Version of task spec
+* - `metadata`
+  - object
+  - Y
+  - `metadata`
+  - `metadata`
+  - Additional metadata to associate with the task
+* - `inspect_task_args`
+  - object
+  - Y
+  -
+  -
+  - Inspect AI `Task` parameters. See sub-fields below.
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.model`
   - string
   - Y
   - `model`
   - `model`
   - Default model for this task
-* - `config`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.config`
   - object
   - Y
   - `config`
   - `config`
   - Model generation config (e.g. `{temperature: 0.2}`)
-* - `model_roles`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.model_roles`
   - object
   - Y
   - `modelRoles`
   - `model_roles`
   - Named roles for `get_model()`
-* - `sandbox`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.sandbox`
   - string/object
   - Y
   - `sandbox`
   - `sandbox`
   - Sandbox environment type or config
-* - `approval`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.sandbox_parameters`
+  - object
+  - Y
+  - `sandboxParameters`
+  - `sandbox_parameters`
+  - Pass-through parameters for sandbox plugin configuration
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.approval`
   - string/object
   - Y
   - `approval`
   - `approval`
   - Tool use approval policies
-* - `epochs`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.epochs`
   - int/object
   - Y
   - `epochs`
   - `epochs`
   - Number of times to repeat each sample
-* - `fail_on_error`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.fail_on_error`
   - number/bool
   - Y
   - `failOnError`
   - `fail_on_error`
   - Fail threshold for sample errors
-* - `continue_on_fail`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.continue_on_fail`
   - bool
   - Y
   - `continueOnFail`
   - `continue_on_fail`
   - Continue running if `fail_on_error` condition is met
-* - `message_limit`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.message_limit`
   - int
   - Y
   - `messageLimit`
   - `message_limit`
   - Max total messages per sample
-* - `token_limit`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.token_limit`
   - int
   - Y
   - `tokenLimit`
   - `token_limit`
   - Max total tokens per sample
-* - `time_limit`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.time_limit`
   - int
   - Y
   - `timeLimit`
   - `time_limit`
   - Max clock time (seconds) per sample
-* - `working_limit`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.working_limit`
   - int
   - Y
   - `workingLimit`
   - `working_limit`
   - Max working time (seconds) per sample
-* - `cost_limit`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.cost_limit`
   - float
   - Y
   - `costLimit`
   - `cost_limit`
   - Max cost (dollars) per sample
-* - `early_stopping`
+* - `inspect_task_args`\\
+    &nbsp;&nbsp;`.early_stopping`
   - string/object
   - Y
   - `earlyStopping`
   - `early_stopping`
   - Early stopping callbacks
-* - `display_name`
-  - string
-  - Y
-  - `displayName`
-  - `display_name`
-  - Task display name (e.g. for plotting)
-* - `version`
-  - int
-  - Y
-  - `version`
-  - `version`
-  - Version of task spec
-* - `metadata`
-  - object
-  - Y
-  - `metadata`
-  - `metadata`
-  - Additional metadata to associate with the task
+```
 ```
 
 ## Sample
 
-Samples are individual test cases defined either inline in `task.yaml` under `samples.inline`, or in external YAML files referenced via `samples.paths`. Fields like `difficulty`, `tags`, `workspace`, and `tests` are parsed from YAML and stored inside the sample's `metadata` dict.
+Samples are individual test cases defined either inline in `task.yaml` under `samples.inline`, or in external YAML files referenced via `samples.paths`. Fields like `difficulty`, `tags`, `workspace`, and `tests` should be nested inside the sample's `metadata` dict.
 
 ```{list-table}
 :header-rows: 1
@@ -686,24 +772,27 @@ Samples are individual test cases defined either inline in `task.yaml` under `sa
   - `target`
   - `target`
   - Expected output or grading criteria
-* - `difficulty`
+* - `metadata`\\
+    &nbsp;&nbsp;`.difficulty`
   - string
   - Y
   -
   -
-  - `easy`, `medium`, or `hard` (stored in `metadata["difficulty"]`)
-* - `tags`
+  - `easy`, `medium`, or `hard`
+* - `metadata`\\
+    &nbsp;&nbsp;`.tags`
   - list
   - Y
   -
   -
-  - Categories for filtering (stored in `metadata["tags"]`)
-* - `system_message`
+  - Categories for filtering
+* - `metadata`\\
+    &nbsp;&nbsp;`.system_message`
   - string
   - Y
   -
   -
-  - Override system prompt for this sample (stored in `metadata`)
+  - Override system prompt for this sample
 * - `workspace`
   - string/object
   - Y
diff --git a/packages/dataset_config_dart/lib/src/models/job.dart b/packages/dataset_config_dart/lib/src/models/job.dart
index 0d8f49d..0fa4599 100644
--- a/packages/dataset_config_dart/lib/src/models/job.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.dart
@@ -17,7 +17,8 @@ part 'job.g.dart';
 /// Example YAML:
 /// ```yaml
 /// log_dir: ./logs/my_run
-/// sandbox: podman
+/// sandbox:
+///   environment: podman
 /// max_connections: 10
 /// models:
 ///   - google/gemini-2.5-flash
@@ -29,15 +30,12 @@ part 'job.g.dart';
 ///   dart_qa:
 ///     include-samples: [sample_1]
 ///
-/// # Pass-through to eval_set()
-/// eval_set_overrides:
+/// # All Inspect AI eval_set() parameters
+/// inspect_eval_arguments:
 ///   retry_attempts: 20
 ///   log_level: debug
-///
-/// # Default Task-level overrides applied to every task
-/// task_defaults:
-///   time_limit: 600
-///   message_limit: 50
+///   task_defaults:
+///     time_limit: 600
 /// ```
 @freezed
 sealed class Job with _$Job {
@@ -49,17 +47,9 @@ sealed class Job with _$Job {
     /// Human-readable description of this job.
     String? description,
 
-    /// Registry URL prefix prepended to image names during sandbox resolution.
-    ///
-    /// Example: `us-central1-docker.pkg.dev/project/repo/`
-    @JsonKey(name: 'image_prefix') String? imagePrefix,
-
     /// Directory to write evaluation logs to.
     @JsonKey(name: 'log_dir') required String logDir,
 
-    /// Sandbox type: `'local'`, `'docker'`, or `'podman'`.
-    @JsonKey(name: 'sandbox_type') @Default('local') String sandboxType,
-
     /// Maximum concurrent API connections.
     @JsonKey(name: 'max_connections') @Default(10) int maxConnections,
 
@@ -81,167 +71,19 @@ sealed class Job with _$Job {
     @JsonKey(name: 'save_examples') @Default(false) bool saveExamples,
 
     // ------------------------------------------------------------------
-    // Promoted eval_set() parameters (convenience top-level keys)
+    // Sandbox configuration
     // ------------------------------------------------------------------
 
-    /// Maximum retry attempts before giving up (defaults to 10).
-    @JsonKey(name: 'retry_attempts') int? retryAttempts,
-
-    /// Maximum number of retry attempts for failed samples.
-    @JsonKey(name: 'max_retries') int? maxRetries,
-
-    /// Time in seconds to wait between retry attempts (exponential backoff).
-    @JsonKey(name: 'retry_wait') double? retryWait,
-
-    /// Reduce `max_connections` at this rate with each retry (default 1.0).
-    @JsonKey(name: 'retry_connections') double? retryConnections,
-
-    /// Cleanup failed log files after retries (defaults to true).
-    @JsonKey(name: 'retry_cleanup') bool? retryCleanup,
-
-    /// Fail on sample errors.
-    ///
-    /// `0.0–1.0` = fail if proportion exceeds threshold,
-    /// `>1` = fail if count exceeds threshold.
-    @JsonKey(name: 'fail_on_error') double? failOnError,
-
-    /// Continue running even if `fail_on_error` condition is met.
-    @JsonKey(name: 'continue_on_fail') bool? continueOnFail,
-
-    /// Number of times to retry samples on error (default: no retries).
-    @JsonKey(name: 'retry_on_error') int? retryOnError,
-
-    /// Raise task errors for debugging (defaults to false).
-    @JsonKey(name: 'debug_errors') bool? debugErrors,
-
-    /// Maximum samples to run in parallel (default is `max_connections`).
-    @JsonKey(name: 'max_samples') int? maxSamples,
-
-    /// Maximum tasks to run in parallel.
-    @JsonKey(name: 'max_tasks') int? maxTasks,
-
-    /// Maximum subprocesses to run in parallel.
-    @JsonKey(name: 'max_subprocesses') int? maxSubprocesses,
-
-    /// Maximum sandboxes (per-provider) to run in parallel.
-    @JsonKey(name: 'max_sandboxes') int? maxSandboxes,
-
-    /// Level for logging to the console (e.g. `"warning"`, `"info"`, `"debug"`).
-    @JsonKey(name: 'log_level') String? logLevel,
-
-    /// Level for logging to the log file (defaults to `"info"`).
-    @JsonKey(name: 'log_level_transcript') String? logLevelTranscript,
-
-    /// Format for writing log files (`"eval"` or `"json"`).
-    @JsonKey(name: 'log_format') String? logFormat,
-
-    /// Tags to associate with this evaluation run.
-    List<String>? tags,
-
-    /// Metadata to associate with this evaluation run.
-    Map<String, dynamic>? metadata,
-
-    /// Trace message interactions with evaluated model to terminal.
-    bool? trace,
-
-    /// Task display type (defaults to `"full"`).
-    String? display,
-
-    /// Score output (defaults to true).
-    bool? score,
-
-    /// Limit evaluated samples (int count or `[start, end]` range).
-    Object? limit,
-
-    /// Evaluate specific sample(s) from the dataset.
-    @JsonKey(name: 'sample_id') Object? sampleId,
-
-    /// Shuffle order of samples (pass a seed to make order deterministic).
-    @JsonKey(name: 'sample_shuffle') Object? sampleShuffle,
-
-    /// Epochs to repeat samples for and optional score reducer function(s).
-    Object? epochs,
-
-    /// Tool use approval policies (string or config dict).
-    Object? approval,
-
-    /// Alternative solver(s) for evaluating task(s) (string or config dict).
-    Object? solver,
-
-    /// Sandbox cleanup after task completes (defaults to true).
-    @JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,
-
-    /// Base URL for communicating with the model API.
-    @JsonKey(name: 'model_base_url') String? modelBaseUrl,
-
-    /// Model creation arguments.
-    @JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,
-
-    /// Named roles for use in `get_model()`.
-    @JsonKey(name: 'model_roles') Map<String, String>? modelRoles,
-
-    /// Task creation arguments.
-    @JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,
-
-    /// Limit on total messages per sample.
-    @JsonKey(name: 'message_limit') int? messageLimit,
-
-    /// Limit on total tokens per sample.
-    @JsonKey(name: 'token_limit') int? tokenLimit,
-
-    /// Limit on clock time (in seconds) per sample.
-    @JsonKey(name: 'time_limit') int? timeLimit,
-
-    /// Limit on working time (in seconds) per sample.
-    @JsonKey(name: 'working_limit') int? workingLimit,
-
-    /// Limit on total cost (in dollars) per sample.
-    @JsonKey(name: 'cost_limit') double? costLimit,
-
-    /// JSON file with model prices for cost tracking.
-    @JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,
-
-    /// Log detailed samples and scores (defaults to true).
-    @JsonKey(name: 'log_samples') bool? logSamples,
-
-    /// Log events in realtime (defaults to true).
-    @JsonKey(name: 'log_realtime') bool? logRealtime,
-
-    /// Log base64-encoded images (defaults to false).
-    @JsonKey(name: 'log_images') bool? logImages,
-
-    /// Number of samples to buffer before writing log file.
-    @JsonKey(name: 'log_buffer') int? logBuffer,
-
-    /// Sync sample events for realtime viewing.
-    @JsonKey(name: 'log_shared') int? logShared,
-
-    /// Directory to bundle logs and viewer into.
-    @JsonKey(name: 'bundle_dir') String? bundleDir,
-
-    /// Overwrite files in `bundle_dir` (defaults to false).
-    @JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,
-
-    /// Allow log directory to contain unrelated logs (defaults to false).
-    @JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,
-
-    /// ID for the eval set. Generated if not specified.
-    @JsonKey(name: 'eval_set_id') String? evalSetId,
+    /// Sandbox config with keys: environment, parameters, image_prefix.
+    Map<String, dynamic>? sandbox,
 
     // ------------------------------------------------------------------
-    // Pass-through overrides
+    // Inspect eval arguments (passed through to eval_set())
     // ------------------------------------------------------------------
 
-    /// Additional `eval_set()` kwargs not covered by top-level fields.
-    ///
-    /// Any valid `eval_set()` parameter can be specified here and will be
-    /// merged into the output JSON. Top-level fields take precedence.
-    @JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,
-
-    /// Default `Task` kwargs applied to every task in this job.
-    ///
-    /// Per-task overrides (from `task.yaml`) take precedence.
-    @JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults,
+    /// All Inspect AI eval_set() parameters, nested under one key.
+    @JsonKey(name: 'inspect_eval_arguments')
+    Map<String, dynamic>? inspectEvalArguments,
 
     // ------------------------------------------------------------------
     // Tag-based filtering
@@ -259,8 +101,7 @@ sealed class Job with _$Job {
 
 /// Per-task configuration within a job.
 ///
-/// Allows overriding which samples run for specific tasks and providing
-/// a custom system message.
+/// Allows overriding which samples run for specific tasks.
 @freezed
 sealed class JobTask with _$JobTask {
   const factory JobTask({
@@ -273,9 +114,6 @@ sealed class JobTask with _$JobTask {
     /// Exclude these sample IDs. Mutually exclusive with [includeSamples].
     @JsonKey(name: 'exclude_samples') List<String>? excludeSamples,
 
-    /// Override system message for this task.
-    @JsonKey(name: 'system_message') String? systemMessage,
-
     /// Per-task argument overrides passed to the task function.
     @JsonKey(name: 'args') Map<String, dynamic>? args,
   }) = _JobTask;
@@ -284,9 +122,6 @@ sealed class JobTask with _$JobTask {
       _$JobTaskFromJson(json);
 
   /// Create a [JobTask] from parsed YAML data.
-  ///
-  /// The [taskId] is the map key from the job YAML `tasks:` section.
-  /// The [data] may be `null` for a simple task reference with no overrides.
   factory JobTask.fromYaml(String taskId, Map<String, dynamic>? data) {
     if (data == null) {
       return JobTask(id: taskId);
@@ -295,7 +130,6 @@ sealed class JobTask with _$JobTask {
       id: taskId,
       includeSamples: (data['include-samples'] as List?)?.cast<String>(),
       excludeSamples: (data['exclude-samples'] as List?)?.cast<String>(),
-      systemMessage: data['system_message'] as String?,
       args: (data['args'] as Map?)?.cast<String, dynamic>(),
     );
   }
diff --git a/packages/dataset_config_dart/lib/src/models/job.freezed.dart b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
index 4b955bd..ed172a2 100644
--- a/packages/dataset_config_dart/lib/src/models/job.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
@@ -19,12 +19,8 @@ mixin _$Job {
 // Core job settings
 // ------------------------------------------------------------------
 /// Human-readable description of this job.
- String? get description;/// Registry URL prefix prepended to image names during sandbox resolution.
-///
-/// Example: `us-central1-docker.pkg.dev/project/repo/`
-@JsonKey(name: 'image_prefix') String? get imagePrefix;/// Directory to write evaluation logs to.
-@JsonKey(name: 'log_dir') String get logDir;/// Sandbox type: `'local'`, `'docker'`, or `'podman'`.
-@JsonKey(name: 'sandbox_type') String get sandboxType;/// Maximum concurrent API connections.
+ String? get description;/// Directory to write evaluation logs to.
+@JsonKey(name: 'log_dir') String get logDir;/// Maximum concurrent API connections.
 @JsonKey(name: 'max_connections') int get maxConnections;/// Models to run. `null` means use defaults from registries.
  List<String>? get models;/// Named variant map. Keys are variant names, values are config dicts.
 /// `null` means baseline only.
@@ -33,69 +29,14 @@ mixin _$Job {
 /// `null` means run all tasks.
  Map<String, JobTask>? get tasks;/// If `true`, copy final workspace to `<logDir>/examples/` after each sample.
 @JsonKey(name: 'save_examples') bool get saveExamples;// ------------------------------------------------------------------
-// Promoted eval_set() parameters (convenience top-level keys)
+// Sandbox configuration
 // ------------------------------------------------------------------
-/// Maximum retry attempts before giving up (defaults to 10).
-@JsonKey(name: 'retry_attempts') int? get retryAttempts;/// Maximum number of retry attempts for failed samples.
-@JsonKey(name: 'max_retries') int? get maxRetries;/// Time in seconds to wait between retry attempts (exponential backoff).
-@JsonKey(name: 'retry_wait') double? get retryWait;/// Reduce `max_connections` at this rate with each retry (default 1.0).
-@JsonKey(name: 'retry_connections') double? get retryConnections;/// Cleanup failed log files after retries (defaults to true).
-@JsonKey(name: 'retry_cleanup') bool? get retryCleanup;/// Fail on sample errors.
-///
-/// `0.0–1.0` = fail if proportion exceeds threshold,
-/// `>1` = fail if count exceeds threshold.
-@JsonKey(name: 'fail_on_error') double? get failOnError;/// Continue running even if `fail_on_error` condition is met.
-@JsonKey(name: 'continue_on_fail') bool? get continueOnFail;/// Number of times to retry samples on error (default: no retries).
-@JsonKey(name: 'retry_on_error') int? get retryOnError;/// Raise task errors for debugging (defaults to false).
-@JsonKey(name: 'debug_errors') bool? get debugErrors;/// Maximum samples to run in parallel (default is `max_connections`).
-@JsonKey(name: 'max_samples') int? get maxSamples;/// Maximum tasks to run in parallel.
-@JsonKey(name: 'max_tasks') int? get maxTasks;/// Maximum subprocesses to run in parallel.
-@JsonKey(name: 'max_subprocesses') int? get maxSubprocesses;/// Maximum sandboxes (per-provider) to run in parallel.
-@JsonKey(name: 'max_sandboxes') int? get maxSandboxes;/// Level for logging to the console (e.g. `"warning"`, `"info"`, `"debug"`).
-@JsonKey(name: 'log_level') String? get logLevel;/// Level for logging to the log file (defaults to `"info"`).
-@JsonKey(name: 'log_level_transcript') String? get logLevelTranscript;/// Format for writing log files (`"eval"` or `"json"`).
-@JsonKey(name: 'log_format') String? get logFormat;/// Tags to associate with this evaluation run.
- List<String>? get tags;/// Metadata to associate with this evaluation run.
- Map<String, dynamic>? get metadata;/// Trace message interactions with evaluated model to terminal.
- bool? get trace;/// Task display type (defaults to `"full"`).
- String? get display;/// Score output (defaults to true).
- bool? get score;/// Limit evaluated samples (int count or `[start, end]` range).
- Object? get limit;/// Evaluate specific sample(s) from the dataset.
-@JsonKey(name: 'sample_id') Object? get sampleId;/// Shuffle order of samples (pass a seed to make order deterministic).
-@JsonKey(name: 'sample_shuffle') Object? get sampleShuffle;/// Epochs to repeat samples for and optional score reducer function(s).
- Object? get epochs;/// Tool use approval policies (string or config dict).
- Object? get approval;/// Alternative solver(s) for evaluating task(s) (string or config dict).
- Object? get solver;/// Sandbox cleanup after task completes (defaults to true).
-@JsonKey(name: 'sandbox_cleanup') bool? get sandboxCleanup;/// Base URL for communicating with the model API.
-@JsonKey(name: 'model_base_url') String? get modelBaseUrl;/// Model creation arguments.
-@JsonKey(name: 'model_args') Map<String, Object?>? get modelArgs;/// Named roles for use in `get_model()`.
-@JsonKey(name: 'model_roles') Map<String, String>? get modelRoles;/// Task creation arguments.
-@JsonKey(name: 'task_args') Map<String, Object?>? get taskArgs;/// Limit on total messages per sample.
-@JsonKey(name: 'message_limit') int? get messageLimit;/// Limit on total tokens per sample.
-@JsonKey(name: 'token_limit') int? get tokenLimit;/// Limit on clock time (in seconds) per sample.
-@JsonKey(name: 'time_limit') int? get timeLimit;/// Limit on working time (in seconds) per sample.
-@JsonKey(name: 'working_limit') int? get workingLimit;/// Limit on total cost (in dollars) per sample.
-@JsonKey(name: 'cost_limit') double? get costLimit;/// JSON file with model prices for cost tracking.
-@JsonKey(name: 'model_cost_config') Map<String, Object?>? get modelCostConfig;/// Log detailed samples and scores (defaults to true).
-@JsonKey(name: 'log_samples') bool? get logSamples;/// Log events in realtime (defaults to true).
-@JsonKey(name: 'log_realtime') bool? get logRealtime;/// Log base64-encoded images (defaults to false).
-@JsonKey(name: 'log_images') bool? get logImages;/// Number of samples to buffer before writing log file.
-@JsonKey(name: 'log_buffer') int? get logBuffer;/// Sync sample events for realtime viewing.
-@JsonKey(name: 'log_shared') int? get logShared;/// Directory to bundle logs and viewer into.
-@JsonKey(name: 'bundle_dir') String? get bundleDir;/// Overwrite files in `bundle_dir` (defaults to false).
-@JsonKey(name: 'bundle_overwrite') bool? get bundleOverwrite;/// Allow log directory to contain unrelated logs (defaults to false).
-@JsonKey(name: 'log_dir_allow_dirty') bool? get logDirAllowDirty;/// ID for the eval set. Generated if not specified.
-@JsonKey(name: 'eval_set_id') String? get evalSetId;// ------------------------------------------------------------------
-// Pass-through overrides
+/// Sandbox config with keys: environment, parameters, image_prefix.
+ Map<String, dynamic>? get sandbox;// ------------------------------------------------------------------
+// Inspect eval arguments (passed through to eval_set())
 // ------------------------------------------------------------------
-/// Additional `eval_set()` kwargs not covered by top-level fields.
-///
-/// Any valid `eval_set()` parameter can be specified here and will be
-/// merged into the output JSON. Top-level fields take precedence.
-@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? get evalSetOverrides;/// Default `Task` kwargs applied to every task in this job.
-///
-/// Per-task overrides (from `task.yaml`) take precedence.
-@JsonKey(name: 'task_defaults') Map<String, dynamic>? get taskDefaults;// ------------------------------------------------------------------
+/// All Inspect AI eval_set() parameters, nested under one key.
+@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? get inspectEvalArguments;// ------------------------------------------------------------------
 // Tag-based filtering
 // ------------------------------------------------------------------
 /// Tag filters applied to tasks.
@@ -113,16 +54,16 @@ $JobCopyWith<Job> get copyWith => _$JobCopyWithImpl<Job>(this as Job, _$identity
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Job&&(identical(other.description, description) || other.description == description)&&(identical(other.imagePrefix, imagePrefix) || other.imagePrefix == imagePrefix)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other.models, models)&&const DeepCollectionEquality().equals(other.variants, variants)&&const DeepCollectionEquality().equals(other.taskPaths, taskPaths)&&const DeepCollectionEquality().equals(other.tasks, tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other.tags, tags)&&const DeepCollectionEquality().equals(other.metadata, metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other.modelArgs, modelArgs)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.taskArgs, taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.modelCostConfig, modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other.evalSetOverrides, evalSetOverrides)&&const DeepCollectionEquality().equals(other.taskDefaults, taskDefaults)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Job&&(identical(other.description, description) || other.description == description)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other.models, models)&&const DeepCollectionEquality().equals(other.variants, variants)&&const DeepCollectionEquality().equals(other.taskPaths, taskPaths)&&const DeepCollectionEquality().equals(other.tasks, tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.inspectEvalArguments, inspectEvalArguments)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,description,imagePrefix,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(models),const DeepCollectionEquality().hash(variants),const DeepCollectionEquality().hash(taskPaths),const DeepCollectionEquality().hash(tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(tags),const DeepCollectionEquality().hash(metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(modelArgs),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(evalSetOverrides),const DeepCollectionEquality().hash(taskDefaults),taskFilters,sampleFilters]);
+int get hashCode => Object.hash(runtimeType,description,logDir,maxConnections,const DeepCollectionEquality().hash(models),const DeepCollectionEquality().hash(variants),const DeepCollectionEquality().hash(taskPaths),const DeepCollectionEquality().hash(tasks),saveExamples,const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(inspectEvalArguments),taskFilters,sampleFilters);
 
 @override
 String toString() {
-  return 'Job(description: $description, imagePrefix: $imagePrefix, logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults, taskFilters: $taskFilters, sampleFilters: $sampleFilters)';
+  return 'Job(description: $description, logDir: $logDir, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, sandbox: $sandbox, inspectEvalArguments: $inspectEvalArguments, taskFilters: $taskFilters, sampleFilters: $sampleFilters)';
 }
 
 
@@ -133,7 +74,7 @@ abstract mixin class $JobCopyWith<$Res>  {
   factory $JobCopyWith(Job value, $Res Function(Job) _then) = _$JobCopyWithImpl;
 @useResult
 $Res call({
- String? description,@JsonKey(name: 'image_prefix') String? imagePrefix,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles,@JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,@JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
+ String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map<String, dynamic>? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
 });
 
 
@@ -150,61 +91,18 @@ class _$JobCopyWithImpl<$Res>
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? description = freezed,Object? imagePrefix = freezed,Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
   return _then(_self.copyWith(
 description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable
-as String?,imagePrefix: freezed == imagePrefix ? _self.imagePrefix : imagePrefix // ignore: cast_nullable_to_non_nullable
 as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
-as String,sandboxType: null == sandboxType ? _self.sandboxType : sandboxType // ignore: cast_nullable_to_non_nullable
 as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable
 as int,models: freezed == models ? _self.models : models // ignore: cast_nullable_to_non_nullable
 as List<String>?,variants: freezed == variants ? _self.variants : variants // ignore: cast_nullable_to_non_nullable
 as Map<String, Map<String, dynamic>>?,taskPaths: freezed == taskPaths ? _self.taskPaths : taskPaths // ignore: cast_nullable_to_non_nullable
 as List<String>?,tasks: freezed == tasks ? _self.tasks : tasks // ignore: cast_nullable_to_non_nullable
 as Map<String, JobTask>?,saveExamples: null == saveExamples ? _self.saveExamples : saveExamples // ignore: cast_nullable_to_non_nullable
-as bool,retryAttempts: freezed == retryAttempts ? _self.retryAttempts : retryAttempts // ignore: cast_nullable_to_non_nullable
-as int?,maxRetries: freezed == maxRetries ? _self.maxRetries : maxRetries // ignore: cast_nullable_to_non_nullable
-as int?,retryWait: freezed == retryWait ? _self.retryWait : retryWait // ignore: cast_nullable_to_non_nullable
-as double?,retryConnections: freezed == retryConnections ? _self.retryConnections : retryConnections // ignore: cast_nullable_to_non_nullable
-as double?,retryCleanup: freezed == retryCleanup ? _self.retryCleanup : retryCleanup // ignore: cast_nullable_to_non_nullable
-as bool?,failOnError: freezed == failOnError ? _self.failOnError : failOnError // ignore: cast_nullable_to_non_nullable
-as double?,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable
-as bool?,retryOnError: freezed == retryOnError ? _self.retryOnError : retryOnError // ignore: cast_nullable_to_non_nullable
-as int?,debugErrors: freezed == debugErrors ? _self.debugErrors : debugErrors // ignore: cast_nullable_to_non_nullable
-as bool?,maxSamples: freezed == maxSamples ? _self.maxSamples : maxSamples // ignore: cast_nullable_to_non_nullable
-as int?,maxTasks: freezed == maxTasks ? _self.maxTasks : maxTasks // ignore: cast_nullable_to_non_nullable
-as int?,maxSubprocesses: freezed == maxSubprocesses ? _self.maxSubprocesses : maxSubprocesses // ignore: cast_nullable_to_non_nullable
-as int?,maxSandboxes: freezed == maxSandboxes ? _self.maxSandboxes : maxSandboxes // ignore: cast_nullable_to_non_nullable
-as int?,logLevel: freezed == logLevel ? _self.logLevel : logLevel // ignore: cast_nullable_to_non_nullable
-as String?,logLevelTranscript: freezed == logLevelTranscript ? _self.logLevelTranscript : logLevelTranscript // ignore: cast_nullable_to_non_nullable
-as String?,logFormat: freezed == logFormat ? _self.logFormat : logFormat // ignore: cast_nullable_to_non_nullable
-as String?,tags: freezed == tags ? _self.tags : tags // ignore: cast_nullable_to_non_nullable
-as List<String>?,metadata: freezed == metadata ? _self.metadata : metadata // ignore: cast_nullable_to_non_nullable
-as Map<String, dynamic>?,trace: freezed == trace ? _self.trace : trace // ignore: cast_nullable_to_non_nullable
-as bool?,display: freezed == display ? _self.display : display // ignore: cast_nullable_to_non_nullable
-as String?,score: freezed == score ? _self.score : score // ignore: cast_nullable_to_non_nullable
-as bool?,limit: freezed == limit ? _self.limit : limit ,sampleId: freezed == sampleId ? _self.sampleId : sampleId ,sampleShuffle: freezed == sampleShuffle ? _self.sampleShuffle : sampleShuffle ,epochs: freezed == epochs ? _self.epochs : epochs ,approval: freezed == approval ? _self.approval : approval ,solver: freezed == solver ? _self.solver : solver ,sandboxCleanup: freezed == sandboxCleanup ? _self.sandboxCleanup : sandboxCleanup // ignore: cast_nullable_to_non_nullable
-as bool?,modelBaseUrl: freezed == modelBaseUrl ? _self.modelBaseUrl : modelBaseUrl // ignore: cast_nullable_to_non_nullable
-as String?,modelArgs: freezed == modelArgs ? _self.modelArgs : modelArgs // ignore: cast_nullable_to_non_nullable
-as Map<String, Object?>?,modelRoles: freezed == modelRoles ? _self.modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable
-as Map<String, String>?,taskArgs: freezed == taskArgs ? _self.taskArgs : taskArgs // ignore: cast_nullable_to_non_nullable
-as Map<String, Object?>?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable
-as int?,tokenLimit: freezed == tokenLimit ? _self.tokenLimit : tokenLimit // ignore: cast_nullable_to_non_nullable
-as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore: cast_nullable_to_non_nullable
-as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable
-as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable
-as double?,modelCostConfig: freezed == modelCostConfig ? _self.modelCostConfig : modelCostConfig // ignore: cast_nullable_to_non_nullable
-as Map<String, Object?>?,logSamples: freezed == logSamples ? _self.logSamples : logSamples // ignore: cast_nullable_to_non_nullable
-as bool?,logRealtime: freezed == logRealtime ? _self.logRealtime : logRealtime // ignore: cast_nullable_to_non_nullable
-as bool?,logImages: freezed == logImages ? _self.logImages : logImages // ignore: cast_nullable_to_non_nullable
-as bool?,logBuffer: freezed == logBuffer ? _self.logBuffer : logBuffer // ignore: cast_nullable_to_non_nullable
-as int?,logShared: freezed == logShared ? _self.logShared : logShared // ignore: cast_nullable_to_non_nullable
-as int?,bundleDir: freezed == bundleDir ? _self.bundleDir : bundleDir // ignore: cast_nullable_to_non_nullable
-as String?,bundleOverwrite: freezed == bundleOverwrite ? _self.bundleOverwrite : bundleOverwrite // ignore: cast_nullable_to_non_nullable
-as bool?,logDirAllowDirty: freezed == logDirAllowDirty ? _self.logDirAllowDirty : logDirAllowDirty // ignore: cast_nullable_to_non_nullable
-as bool?,evalSetId: freezed == evalSetId ? _self.evalSetId : evalSetId // ignore: cast_nullable_to_non_nullable
-as String?,evalSetOverrides: freezed == evalSetOverrides ? _self.evalSetOverrides : evalSetOverrides // ignore: cast_nullable_to_non_nullable
-as Map<String, dynamic>?,taskDefaults: freezed == taskDefaults ? _self.taskDefaults : taskDefaults // ignore: cast_nullable_to_non_nullable
+as bool,sandbox: freezed == sandbox ? _self.sandbox : sandbox // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,inspectEvalArguments: freezed == inspectEvalArguments ? _self.inspectEvalArguments : inspectEvalArguments // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,taskFilters: freezed == taskFilters ? _self.taskFilters : taskFilters // ignore: cast_nullable_to_non_nullable
 as TagFilter?,sampleFilters: freezed == sampleFilters ? _self.sampleFilters : sampleFilters // ignore: cast_nullable_to_non_nullable
 as TagFilter?,
@@ -313,10 +211,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'image_prefix')  String? imagePrefix, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Job() when $default != null:
-return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults,_that.taskFilters,_that.sampleFilters);case _:
+return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);case _:
   return orElse();
 
 }
@@ -334,10 +232,10 @@ return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxTy
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'image_prefix')  String? imagePrefix, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)  $default,) {final _that = this;
 switch (_that) {
 case _Job():
-return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults,_that.taskFilters,_that.sampleFilters);}
+return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -351,10 +249,10 @@ return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxTy
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String? description, @JsonKey(name: 'image_prefix')  String? imagePrefix, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'sandbox_type')  String sandboxType, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples, @JsonKey(name: 'retry_attempts')  int? retryAttempts, @JsonKey(name: 'max_retries')  int? maxRetries, @JsonKey(name: 'retry_wait')  double? retryWait, @JsonKey(name: 'retry_connections')  double? retryConnections, @JsonKey(name: 'retry_cleanup')  bool? retryCleanup, @JsonKey(name: 'fail_on_error')  double? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'retry_on_error')  int? retryOnError, @JsonKey(name: 'debug_errors')  bool? debugErrors, @JsonKey(name: 'max_samples')  int? maxSamples, @JsonKey(name: 'max_tasks')  int? maxTasks, @JsonKey(name: 'max_subprocesses')  int? maxSubprocesses, @JsonKey(name: 'max_sandboxes')  int? maxSandboxes, @JsonKey(name: 'log_level')  String? logLevel, @JsonKey(name: 'log_level_transcript')  String? logLevelTranscript, @JsonKey(name: 'log_format')  String? logFormat,  List<String>? tags,  Map<String, dynamic>? metadata,  bool? trace,  String? display,  bool? score,  Object? limit, @JsonKey(name: 'sample_id')  Object? sampleId, @JsonKey(name: 'sample_shuffle')  Object? sampleShuffle,  Object? epochs,  Object? approval,  Object? solver, @JsonKey(name: 'sandbox_cleanup')  bool? sandboxCleanup, @JsonKey(name: 'model_base_url')  String? modelBaseUrl, @JsonKey(name: 'model_args')  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles, @JsonKey(name: 'task_args')  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'model_cost_config')  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples')  bool? logSamples, @JsonKey(name: 'log_realtime')  bool? logRealtime, @JsonKey(name: 'log_images')  bool? logImages, @JsonKey(name: 'log_buffer')  int? logBuffer, @JsonKey(name: 'log_shared')  int? logShared, @JsonKey(name: 'bundle_dir')  String? bundleDir, @JsonKey(name: 'bundle_overwrite')  bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty')  bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id')  String? evalSetId, @JsonKey(name: 'eval_set_overrides')  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults')  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,) {final _that = this;
 switch (_that) {
 case _Job() when $default != null:
-return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults,_that.taskFilters,_that.sampleFilters);case _:
+return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);case _:
   return null;
 
 }
@@ -366,7 +264,7 @@ return $default(_that.description,_that.imagePrefix,_that.logDir,_that.sandboxTy
 @JsonSerializable()
 
 class _Job implements Job {
-  const _Job({this.description, @JsonKey(name: 'image_prefix') this.imagePrefix, @JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'sandbox_type') this.sandboxType = 'local', @JsonKey(name: 'max_connections') this.maxConnections = 10, final  List<String>? models, final  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths') final  List<String>? taskPaths, final  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, @JsonKey(name: 'retry_attempts') this.retryAttempts, @JsonKey(name: 'max_retries') this.maxRetries, @JsonKey(name: 'retry_wait') this.retryWait, @JsonKey(name: 'retry_connections') this.retryConnections, @JsonKey(name: 'retry_cleanup') this.retryCleanup, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'retry_on_error') this.retryOnError, @JsonKey(name: 'debug_errors') this.debugErrors, @JsonKey(name: 'max_samples') this.maxSamples, @JsonKey(name: 'max_tasks') this.maxTasks, @JsonKey(name: 'max_subprocesses') this.maxSubprocesses, @JsonKey(name: 'max_sandboxes') this.maxSandboxes, @JsonKey(name: 'log_level') this.logLevel, @JsonKey(name: 'log_level_transcript') this.logLevelTranscript, @JsonKey(name: 'log_format') this.logFormat, final  List<String>? tags, final  Map<String, dynamic>? metadata, this.trace, this.display, this.score, this.limit, @JsonKey(name: 'sample_id') this.sampleId, @JsonKey(name: 'sample_shuffle') this.sampleShuffle, this.epochs, this.approval, this.solver, @JsonKey(name: 'sandbox_cleanup') this.sandboxCleanup, @JsonKey(name: 'model_base_url') this.modelBaseUrl, @JsonKey(name: 'model_args') final  Map<String, Object?>? modelArgs, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, @JsonKey(name: 'task_args') final  Map<String, Object?>? taskArgs, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'model_cost_config') final  Map<String, Object?>? modelCostConfig, @JsonKey(name: 'log_samples') this.logSamples, @JsonKey(name: 'log_realtime') this.logRealtime, @JsonKey(name: 'log_images') this.logImages, @JsonKey(name: 'log_buffer') this.logBuffer, @JsonKey(name: 'log_shared') this.logShared, @JsonKey(name: 'bundle_dir') this.bundleDir, @JsonKey(name: 'bundle_overwrite') this.bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') this.logDirAllowDirty, @JsonKey(name: 'eval_set_id') this.evalSetId, @JsonKey(name: 'eval_set_overrides') final  Map<String, dynamic>? evalSetOverrides, @JsonKey(name: 'task_defaults') final  Map<String, dynamic>? taskDefaults, @JsonKey(name: 'task_filters') this.taskFilters, @JsonKey(name: 'sample_filters') this.sampleFilters}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_tags = tags,_metadata = metadata,_modelArgs = modelArgs,_modelRoles = modelRoles,_taskArgs = taskArgs,_modelCostConfig = modelCostConfig,_evalSetOverrides = evalSetOverrides,_taskDefaults = taskDefaults;
+  const _Job({this.description, @JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'max_connections') this.maxConnections = 10, final  List<String>? models, final  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths') final  List<String>? taskPaths, final  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, final  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments') final  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters') this.taskFilters, @JsonKey(name: 'sample_filters') this.sampleFilters}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_sandbox = sandbox,_inspectEvalArguments = inspectEvalArguments;
   factory _Job.fromJson(Map<String, dynamic> json) => _$JobFromJson(json);
 
 // ------------------------------------------------------------------
@@ -374,14 +272,8 @@ class _Job implements Job {
 // ------------------------------------------------------------------
 /// Human-readable description of this job.
 @override final  String? description;
-/// Registry URL prefix prepended to image names during sandbox resolution.
-///
-/// Example: `us-central1-docker.pkg.dev/project/repo/`
-@override@JsonKey(name: 'image_prefix') final  String? imagePrefix;
 /// Directory to write evaluation logs to.
 @override@JsonKey(name: 'log_dir') final  String logDir;
-/// Sandbox type: `'local'`, `'docker'`, or `'podman'`.
-@override@JsonKey(name: 'sandbox_type') final  String sandboxType;
 /// Maximum concurrent API connections.
 @override@JsonKey(name: 'max_connections') final  int maxConnections;
 /// Models to run. `null` means use defaults from registries.
@@ -435,193 +327,35 @@ class _Job implements Job {
 /// If `true`, copy final workspace to `<logDir>/examples/` after each sample.
 @override@JsonKey(name: 'save_examples') final  bool saveExamples;
 // ------------------------------------------------------------------
-// Promoted eval_set() parameters (convenience top-level keys)
+// Sandbox configuration
 // ------------------------------------------------------------------
-/// Maximum retry attempts before giving up (defaults to 10).
-@override@JsonKey(name: 'retry_attempts') final  int? retryAttempts;
-/// Maximum number of retry attempts for failed samples.
-@override@JsonKey(name: 'max_retries') final  int? maxRetries;
-/// Time in seconds to wait between retry attempts (exponential backoff).
-@override@JsonKey(name: 'retry_wait') final  double? retryWait;
-/// Reduce `max_connections` at this rate with each retry (default 1.0).
-@override@JsonKey(name: 'retry_connections') final  double? retryConnections;
-/// Cleanup failed log files after retries (defaults to true).
-@override@JsonKey(name: 'retry_cleanup') final  bool? retryCleanup;
-/// Fail on sample errors.
-///
-/// `0.0–1.0` = fail if proportion exceeds threshold,
-/// `>1` = fail if count exceeds threshold.
-@override@JsonKey(name: 'fail_on_error') final  double? failOnError;
-/// Continue running even if `fail_on_error` condition is met.
-@override@JsonKey(name: 'continue_on_fail') final  bool? continueOnFail;
-/// Number of times to retry samples on error (default: no retries).
-@override@JsonKey(name: 'retry_on_error') final  int? retryOnError;
-/// Raise task errors for debugging (defaults to false).
-@override@JsonKey(name: 'debug_errors') final  bool? debugErrors;
-/// Maximum samples to run in parallel (default is `max_connections`).
-@override@JsonKey(name: 'max_samples') final  int? maxSamples;
-/// Maximum tasks to run in parallel.
-@override@JsonKey(name: 'max_tasks') final  int? maxTasks;
-/// Maximum subprocesses to run in parallel.
-@override@JsonKey(name: 'max_subprocesses') final  int? maxSubprocesses;
-/// Maximum sandboxes (per-provider) to run in parallel.
-@override@JsonKey(name: 'max_sandboxes') final  int? maxSandboxes;
-/// Level for logging to the console (e.g. `"warning"`, `"info"`, `"debug"`).
-@override@JsonKey(name: 'log_level') final  String? logLevel;
-/// Level for logging to the log file (defaults to `"info"`).
-@override@JsonKey(name: 'log_level_transcript') final  String? logLevelTranscript;
-/// Format for writing log files (`"eval"` or `"json"`).
-@override@JsonKey(name: 'log_format') final  String? logFormat;
-/// Tags to associate with this evaluation run.
- final  List<String>? _tags;
-/// Tags to associate with this evaluation run.
-@override List<String>? get tags {
-  final value = _tags;
-  if (value == null) return null;
-  if (_tags is EqualUnmodifiableListView) return _tags;
-  // ignore: implicit_dynamic_type
-  return EqualUnmodifiableListView(value);
-}
-
-/// Metadata to associate with this evaluation run.
- final  Map<String, dynamic>? _metadata;
-/// Metadata to associate with this evaluation run.
-@override Map<String, dynamic>? get metadata {
-  final value = _metadata;
-  if (value == null) return null;
-  if (_metadata is EqualUnmodifiableMapView) return _metadata;
-  // ignore: implicit_dynamic_type
-  return EqualUnmodifiableMapView(value);
-}
-
-/// Trace message interactions with evaluated model to terminal.
-@override final  bool? trace;
-/// Task display type (defaults to `"full"`).
-@override final  String? display;
-/// Score output (defaults to true).
-@override final  bool? score;
-/// Limit evaluated samples (int count or `[start, end]` range).
-@override final  Object? limit;
-/// Evaluate specific sample(s) from the dataset.
-@override@JsonKey(name: 'sample_id') final  Object? sampleId;
-/// Shuffle order of samples (pass a seed to make order deterministic).
-@override@JsonKey(name: 'sample_shuffle') final  Object? sampleShuffle;
-/// Epochs to repeat samples for and optional score reducer function(s).
-@override final  Object? epochs;
-/// Tool use approval policies (string or config dict).
-@override final  Object? approval;
-/// Alternative solver(s) for evaluating task(s) (string or config dict).
-@override final  Object? solver;
-/// Sandbox cleanup after task completes (defaults to true).
-@override@JsonKey(name: 'sandbox_cleanup') final  bool? sandboxCleanup;
-/// Base URL for communicating with the model API.
-@override@JsonKey(name: 'model_base_url') final  String? modelBaseUrl;
-/// Model creation arguments.
- final  Map<String, Object?>? _modelArgs;
-/// Model creation arguments.
-@override@JsonKey(name: 'model_args') Map<String, Object?>? get modelArgs {
-  final value = _modelArgs;
-  if (value == null) return null;
-  if (_modelArgs is EqualUnmodifiableMapView) return _modelArgs;
-  // ignore: implicit_dynamic_type
-  return EqualUnmodifiableMapView(value);
-}
-
-/// Named roles for use in `get_model()`.
- final  Map<String, String>? _modelRoles;
-/// Named roles for use in `get_model()`.
-@override@JsonKey(name: 'model_roles') Map<String, String>? get modelRoles {
-  final value = _modelRoles;
-  if (value == null) return null;
-  if (_modelRoles is EqualUnmodifiableMapView) return _modelRoles;
-  // ignore: implicit_dynamic_type
-  return EqualUnmodifiableMapView(value);
-}
-
-/// Task creation arguments.
- final  Map<String, Object?>? _taskArgs;
-/// Task creation arguments.
-@override@JsonKey(name: 'task_args') Map<String, Object?>? get taskArgs {
-  final value = _taskArgs;
-  if (value == null) return null;
-  if (_taskArgs is EqualUnmodifiableMapView) return _taskArgs;
-  // ignore: implicit_dynamic_type
-  return EqualUnmodifiableMapView(value);
-}
-
-/// Limit on total messages per sample.
-@override@JsonKey(name: 'message_limit') final  int? messageLimit;
-/// Limit on total tokens per sample.
-@override@JsonKey(name: 'token_limit') final  int? tokenLimit;
-/// Limit on clock time (in seconds) per sample.
-@override@JsonKey(name: 'time_limit') final  int? timeLimit;
-/// Limit on working time (in seconds) per sample.
-@override@JsonKey(name: 'working_limit') final  int? workingLimit;
-/// Limit on total cost (in dollars) per sample.
-@override@JsonKey(name: 'cost_limit') final  double? costLimit;
-/// JSON file with model prices for cost tracking.
- final  Map<String, Object?>? _modelCostConfig;
-/// JSON file with model prices for cost tracking.
-@override@JsonKey(name: 'model_cost_config') Map<String, Object?>? get modelCostConfig {
-  final value = _modelCostConfig;
+/// Sandbox config with keys: environment, parameters, image_prefix.
+ final  Map<String, dynamic>? _sandbox;
+// ------------------------------------------------------------------
+// Sandbox configuration
+// ------------------------------------------------------------------
+/// Sandbox config with keys: environment, parameters, image_prefix.
+@override Map<String, dynamic>? get sandbox {
+  final value = _sandbox;
   if (value == null) return null;
-  if (_modelCostConfig is EqualUnmodifiableMapView) return _modelCostConfig;
+  if (_sandbox is EqualUnmodifiableMapView) return _sandbox;
   // ignore: implicit_dynamic_type
   return EqualUnmodifiableMapView(value);
 }
 
-/// Log detailed samples and scores (defaults to true).
-@override@JsonKey(name: 'log_samples') final  bool? logSamples;
-/// Log events in realtime (defaults to true).
-@override@JsonKey(name: 'log_realtime') final  bool? logRealtime;
-/// Log base64-encoded images (defaults to false).
-@override@JsonKey(name: 'log_images') final  bool? logImages;
-/// Number of samples to buffer before writing log file.
-@override@JsonKey(name: 'log_buffer') final  int? logBuffer;
-/// Sync sample events for realtime viewing.
-@override@JsonKey(name: 'log_shared') final  int? logShared;
-/// Directory to bundle logs and viewer into.
-@override@JsonKey(name: 'bundle_dir') final  String? bundleDir;
-/// Overwrite files in `bundle_dir` (defaults to false).
-@override@JsonKey(name: 'bundle_overwrite') final  bool? bundleOverwrite;
-/// Allow log directory to contain unrelated logs (defaults to false).
-@override@JsonKey(name: 'log_dir_allow_dirty') final  bool? logDirAllowDirty;
-/// ID for the eval set. Generated if not specified.
-@override@JsonKey(name: 'eval_set_id') final  String? evalSetId;
 // ------------------------------------------------------------------
-// Pass-through overrides
+// Inspect eval arguments (passed through to eval_set())
 // ------------------------------------------------------------------
-/// Additional `eval_set()` kwargs not covered by top-level fields.
-///
-/// Any valid `eval_set()` parameter can be specified here and will be
-/// merged into the output JSON. Top-level fields take precedence.
- final  Map<String, dynamic>? _evalSetOverrides;
+/// All Inspect AI eval_set() parameters, nested under one key.
+ final  Map<String, dynamic>? _inspectEvalArguments;
 // ------------------------------------------------------------------
-// Pass-through overrides
+// Inspect eval arguments (passed through to eval_set())
 // ------------------------------------------------------------------
-/// Additional `eval_set()` kwargs not covered by top-level fields.
-///
-/// Any valid `eval_set()` parameter can be specified here and will be
-/// merged into the output JSON. Top-level fields take precedence.
-@override@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? get evalSetOverrides {
-  final value = _evalSetOverrides;
-  if (value == null) return null;
-  if (_evalSetOverrides is EqualUnmodifiableMapView) return _evalSetOverrides;
-  // ignore: implicit_dynamic_type
-  return EqualUnmodifiableMapView(value);
-}
-
-/// Default `Task` kwargs applied to every task in this job.
-///
-/// Per-task overrides (from `task.yaml`) take precedence.
- final  Map<String, dynamic>? _taskDefaults;
-/// Default `Task` kwargs applied to every task in this job.
-///
-/// Per-task overrides (from `task.yaml`) take precedence.
-@override@JsonKey(name: 'task_defaults') Map<String, dynamic>? get taskDefaults {
-  final value = _taskDefaults;
+/// All Inspect AI eval_set() parameters, nested under one key.
+@override@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? get inspectEvalArguments {
+  final value = _inspectEvalArguments;
   if (value == null) return null;
-  if (_taskDefaults is EqualUnmodifiableMapView) return _taskDefaults;
+  if (_inspectEvalArguments is EqualUnmodifiableMapView) return _inspectEvalArguments;
   // ignore: implicit_dynamic_type
   return EqualUnmodifiableMapView(value);
 }
@@ -647,16 +381,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Job&&(identical(other.description, description) || other.description == description)&&(identical(other.imagePrefix, imagePrefix) || other.imagePrefix == imagePrefix)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other._models, _models)&&const DeepCollectionEquality().equals(other._variants, _variants)&&const DeepCollectionEquality().equals(other._taskPaths, _taskPaths)&&const DeepCollectionEquality().equals(other._tasks, _tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other._tags, _tags)&&const DeepCollectionEquality().equals(other._metadata, _metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other._modelArgs, _modelArgs)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other._taskArgs, _taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other._modelCostConfig, _modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other._evalSetOverrides, _evalSetOverrides)&&const DeepCollectionEquality().equals(other._taskDefaults, _taskDefaults)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Job&&(identical(other.description, description) || other.description == description)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other._models, _models)&&const DeepCollectionEquality().equals(other._variants, _variants)&&const DeepCollectionEquality().equals(other._taskPaths, _taskPaths)&&const DeepCollectionEquality().equals(other._tasks, _tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&const DeepCollectionEquality().equals(other._sandbox, _sandbox)&&const DeepCollectionEquality().equals(other._inspectEvalArguments, _inspectEvalArguments)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,description,imagePrefix,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(_models),const DeepCollectionEquality().hash(_variants),const DeepCollectionEquality().hash(_taskPaths),const DeepCollectionEquality().hash(_tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(_tags),const DeepCollectionEquality().hash(_metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(_modelArgs),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(_taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(_modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(_evalSetOverrides),const DeepCollectionEquality().hash(_taskDefaults),taskFilters,sampleFilters]);
+int get hashCode => Object.hash(runtimeType,description,logDir,maxConnections,const DeepCollectionEquality().hash(_models),const DeepCollectionEquality().hash(_variants),const DeepCollectionEquality().hash(_taskPaths),const DeepCollectionEquality().hash(_tasks),saveExamples,const DeepCollectionEquality().hash(_sandbox),const DeepCollectionEquality().hash(_inspectEvalArguments),taskFilters,sampleFilters);
 
 @override
 String toString() {
-  return 'Job(description: $description, imagePrefix: $imagePrefix, logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults, taskFilters: $taskFilters, sampleFilters: $sampleFilters)';
+  return 'Job(description: $description, logDir: $logDir, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, sandbox: $sandbox, inspectEvalArguments: $inspectEvalArguments, taskFilters: $taskFilters, sampleFilters: $sampleFilters)';
 }
 
 
@@ -667,7 +401,7 @@ abstract mixin class _$JobCopyWith<$Res> implements $JobCopyWith<$Res> {
   factory _$JobCopyWith(_Job value, $Res Function(_Job) _then) = __$JobCopyWithImpl;
 @override @useResult
 $Res call({
- String? description,@JsonKey(name: 'image_prefix') String? imagePrefix,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map<String, Object?>? modelArgs,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles,@JsonKey(name: 'task_args') Map<String, Object?>? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map<String, Object?>? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map<String, dynamic>? evalSetOverrides,@JsonKey(name: 'task_defaults') Map<String, dynamic>? taskDefaults,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
+ String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map<String, dynamic>? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
 });
 
 
@@ -684,61 +418,18 @@ class __$JobCopyWithImpl<$Res>
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? description = freezed,Object? imagePrefix = freezed,Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
   return _then(_Job(
 description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable
-as String?,imagePrefix: freezed == imagePrefix ? _self.imagePrefix : imagePrefix // ignore: cast_nullable_to_non_nullable
 as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
-as String,sandboxType: null == sandboxType ? _self.sandboxType : sandboxType // ignore: cast_nullable_to_non_nullable
 as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable
 as int,models: freezed == models ? _self._models : models // ignore: cast_nullable_to_non_nullable
 as List<String>?,variants: freezed == variants ? _self._variants : variants // ignore: cast_nullable_to_non_nullable
 as Map<String, Map<String, dynamic>>?,taskPaths: freezed == taskPaths ? _self._taskPaths : taskPaths // ignore: cast_nullable_to_non_nullable
 as List<String>?,tasks: freezed == tasks ? _self._tasks : tasks // ignore: cast_nullable_to_non_nullable
 as Map<String, JobTask>?,saveExamples: null == saveExamples ? _self.saveExamples : saveExamples // ignore: cast_nullable_to_non_nullable
-as bool,retryAttempts: freezed == retryAttempts ? _self.retryAttempts : retryAttempts // ignore: cast_nullable_to_non_nullable
-as int?,maxRetries: freezed == maxRetries ? _self.maxRetries : maxRetries // ignore: cast_nullable_to_non_nullable
-as int?,retryWait: freezed == retryWait ? _self.retryWait : retryWait // ignore: cast_nullable_to_non_nullable
-as double?,retryConnections: freezed == retryConnections ? _self.retryConnections : retryConnections // ignore: cast_nullable_to_non_nullable
-as double?,retryCleanup: freezed == retryCleanup ? _self.retryCleanup : retryCleanup // ignore: cast_nullable_to_non_nullable
-as bool?,failOnError: freezed == failOnError ? _self.failOnError : failOnError // ignore: cast_nullable_to_non_nullable
-as double?,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable
-as bool?,retryOnError: freezed == retryOnError ? _self.retryOnError : retryOnError // ignore: cast_nullable_to_non_nullable
-as int?,debugErrors: freezed == debugErrors ? _self.debugErrors : debugErrors // ignore: cast_nullable_to_non_nullable
-as bool?,maxSamples: freezed == maxSamples ? _self.maxSamples : maxSamples // ignore: cast_nullable_to_non_nullable
-as int?,maxTasks: freezed == maxTasks ? _self.maxTasks : maxTasks // ignore: cast_nullable_to_non_nullable
-as int?,maxSubprocesses: freezed == maxSubprocesses ? _self.maxSubprocesses : maxSubprocesses // ignore: cast_nullable_to_non_nullable
-as int?,maxSandboxes: freezed == maxSandboxes ? _self.maxSandboxes : maxSandboxes // ignore: cast_nullable_to_non_nullable
-as int?,logLevel: freezed == logLevel ? _self.logLevel : logLevel // ignore: cast_nullable_to_non_nullable
-as String?,logLevelTranscript: freezed == logLevelTranscript ? _self.logLevelTranscript : logLevelTranscript // ignore: cast_nullable_to_non_nullable
-as String?,logFormat: freezed == logFormat ? _self.logFormat : logFormat // ignore: cast_nullable_to_non_nullable
-as String?,tags: freezed == tags ? _self._tags : tags // ignore: cast_nullable_to_non_nullable
-as List<String>?,metadata: freezed == metadata ? _self._metadata : metadata // ignore: cast_nullable_to_non_nullable
-as Map<String, dynamic>?,trace: freezed == trace ? _self.trace : trace // ignore: cast_nullable_to_non_nullable
-as bool?,display: freezed == display ? _self.display : display // ignore: cast_nullable_to_non_nullable
-as String?,score: freezed == score ? _self.score : score // ignore: cast_nullable_to_non_nullable
-as bool?,limit: freezed == limit ? _self.limit : limit ,sampleId: freezed == sampleId ? _self.sampleId : sampleId ,sampleShuffle: freezed == sampleShuffle ? _self.sampleShuffle : sampleShuffle ,epochs: freezed == epochs ? _self.epochs : epochs ,approval: freezed == approval ? _self.approval : approval ,solver: freezed == solver ? _self.solver : solver ,sandboxCleanup: freezed == sandboxCleanup ? _self.sandboxCleanup : sandboxCleanup // ignore: cast_nullable_to_non_nullable
-as bool?,modelBaseUrl: freezed == modelBaseUrl ? _self.modelBaseUrl : modelBaseUrl // ignore: cast_nullable_to_non_nullable
-as String?,modelArgs: freezed == modelArgs ? _self._modelArgs : modelArgs // ignore: cast_nullable_to_non_nullable
-as Map<String, Object?>?,modelRoles: freezed == modelRoles ? _self._modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable
-as Map<String, String>?,taskArgs: freezed == taskArgs ? _self._taskArgs : taskArgs // ignore: cast_nullable_to_non_nullable
-as Map<String, Object?>?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable
-as int?,tokenLimit: freezed == tokenLimit ? _self.tokenLimit : tokenLimit // ignore: cast_nullable_to_non_nullable
-as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore: cast_nullable_to_non_nullable
-as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable
-as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable
-as double?,modelCostConfig: freezed == modelCostConfig ? _self._modelCostConfig : modelCostConfig // ignore: cast_nullable_to_non_nullable
-as Map<String, Object?>?,logSamples: freezed == logSamples ? _self.logSamples : logSamples // ignore: cast_nullable_to_non_nullable
-as bool?,logRealtime: freezed == logRealtime ? _self.logRealtime : logRealtime // ignore: cast_nullable_to_non_nullable
-as bool?,logImages: freezed == logImages ? _self.logImages : logImages // ignore: cast_nullable_to_non_nullable
-as bool?,logBuffer: freezed == logBuffer ? _self.logBuffer : logBuffer // ignore: cast_nullable_to_non_nullable
-as int?,logShared: freezed == logShared ? _self.logShared : logShared // ignore: cast_nullable_to_non_nullable
-as int?,bundleDir: freezed == bundleDir ? _self.bundleDir : bundleDir // ignore: cast_nullable_to_non_nullable
-as String?,bundleOverwrite: freezed == bundleOverwrite ? _self.bundleOverwrite : bundleOverwrite // ignore: cast_nullable_to_non_nullable
-as bool?,logDirAllowDirty: freezed == logDirAllowDirty ? _self.logDirAllowDirty : logDirAllowDirty // ignore: cast_nullable_to_non_nullable
-as bool?,evalSetId: freezed == evalSetId ? _self.evalSetId : evalSetId // ignore: cast_nullable_to_non_nullable
-as String?,evalSetOverrides: freezed == evalSetOverrides ? _self._evalSetOverrides : evalSetOverrides // ignore: cast_nullable_to_non_nullable
-as Map<String, dynamic>?,taskDefaults: freezed == taskDefaults ? _self._taskDefaults : taskDefaults // ignore: cast_nullable_to_non_nullable
+as bool,sandbox: freezed == sandbox ? _self._sandbox : sandbox // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,inspectEvalArguments: freezed == inspectEvalArguments ? _self._inspectEvalArguments : inspectEvalArguments // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,taskFilters: freezed == taskFilters ? _self.taskFilters : taskFilters // ignore: cast_nullable_to_non_nullable
 as TagFilter?,sampleFilters: freezed == sampleFilters ? _self.sampleFilters : sampleFilters // ignore: cast_nullable_to_non_nullable
 as TagFilter?,
@@ -779,8 +470,7 @@ mixin _$JobTask {
 /// Task identifier matching a task directory name in `tasks/`.
  String get id;/// Only run these sample IDs. Mutually exclusive with [excludeSamples].
 @JsonKey(name: 'include_samples') List<String>? get includeSamples;/// Exclude these sample IDs. Mutually exclusive with [includeSamples].
-@JsonKey(name: 'exclude_samples') List<String>? get excludeSamples;/// Override system message for this task.
-@JsonKey(name: 'system_message') String? get systemMessage;/// Per-task argument overrides passed to the task function.
+@JsonKey(name: 'exclude_samples') List<String>? get excludeSamples;/// Per-task argument overrides passed to the task function.
 @JsonKey(name: 'args') Map<String, dynamic>? get args;
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
@@ -794,16 +484,16 @@ $JobTaskCopyWith<JobTask> get copyWith => _$JobTaskCopyWithImpl<JobTask>(this as
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other.args, args));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&const DeepCollectionEquality().equals(other.args, args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),systemMessage,const DeepCollectionEquality().hash(args));
+int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),const DeepCollectionEquality().hash(args));
 
 @override
 String toString() {
-  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage, args: $args)';
+  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, args: $args)';
 }
 
 
@@ -814,7 +504,7 @@ abstract mixin class $JobTaskCopyWith<$Res>  {
   factory $JobTaskCopyWith(JobTask value, $Res Function(JobTask) _then) = _$JobTaskCopyWithImpl;
 @useResult
 $Res call({
- String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'args') Map<String, dynamic>? args
+ String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'args') Map<String, dynamic>? args
 });
 
 
@@ -831,13 +521,12 @@ class _$JobTaskCopyWithImpl<$Res>
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,Object? args = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? args = freezed,}) {
   return _then(_self.copyWith(
 id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable
 as String,includeSamples: freezed == includeSamples ? _self.includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,excludeSamples: freezed == excludeSamples ? _self.excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable
-as List<String>?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable
-as String?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable
+as List<String>?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,
   ));
 }
@@ -920,10 +609,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _JobTask() when $default != null:
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage,_that.args);case _:
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);case _:
   return orElse();
 
 }
@@ -941,10 +630,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'args')  Map<String, dynamic>? args)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'args')  Map<String, dynamic>? args)  $default,) {final _that = this;
 switch (_that) {
 case _JobTask():
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage,_that.args);}
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -958,10 +647,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,) {final _that = this;
 switch (_that) {
 case _JobTask() when $default != null:
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage,_that.args);case _:
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);case _:
   return null;
 
 }
@@ -973,7 +662,7 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM
 @JsonSerializable()
 
 class _JobTask implements JobTask {
-  const _JobTask({required this.id, @JsonKey(name: 'include_samples') final  List<String>? includeSamples, @JsonKey(name: 'exclude_samples') final  List<String>? excludeSamples, @JsonKey(name: 'system_message') this.systemMessage, @JsonKey(name: 'args') final  Map<String, dynamic>? args}): _includeSamples = includeSamples,_excludeSamples = excludeSamples,_args = args;
+  const _JobTask({required this.id, @JsonKey(name: 'include_samples') final  List<String>? includeSamples, @JsonKey(name: 'exclude_samples') final  List<String>? excludeSamples, @JsonKey(name: 'args') final  Map<String, dynamic>? args}): _includeSamples = includeSamples,_excludeSamples = excludeSamples,_args = args;
   factory _JobTask.fromJson(Map<String, dynamic> json) => _$JobTaskFromJson(json);
 
 /// Task identifier matching a task directory name in `tasks/`.
@@ -1000,8 +689,6 @@ class _JobTask implements JobTask {
   return EqualUnmodifiableListView(value);
 }
 
-/// Override system message for this task.
-@override@JsonKey(name: 'system_message') final  String? systemMessage;
 /// Per-task argument overrides passed to the task function.
  final  Map<String, dynamic>? _args;
 /// Per-task argument overrides passed to the task function.
@@ -1027,16 +714,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other._args, _args));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&const DeepCollectionEquality().equals(other._args, _args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),systemMessage,const DeepCollectionEquality().hash(_args));
+int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),const DeepCollectionEquality().hash(_args));
 
 @override
 String toString() {
-  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage, args: $args)';
+  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, args: $args)';
 }
 
 
@@ -1047,7 +734,7 @@ abstract mixin class _$JobTaskCopyWith<$Res> implements $JobTaskCopyWith<$Res> {
   factory _$JobTaskCopyWith(_JobTask value, $Res Function(_JobTask) _then) = __$JobTaskCopyWithImpl;
 @override @useResult
 $Res call({
- String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'args') Map<String, dynamic>? args
+ String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'args') Map<String, dynamic>? args
 });
 
 
@@ -1064,13 +751,12 @@ class __$JobTaskCopyWithImpl<$Res>
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,Object? args = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? args = freezed,}) {
   return _then(_JobTask(
 id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable
 as String,includeSamples: freezed == includeSamples ? _self._includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,excludeSamples: freezed == excludeSamples ? _self._excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable
-as List<String>?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable
-as String?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable
+as List<String>?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,
   ));
 }
diff --git a/packages/dataset_config_dart/lib/src/models/job.g.dart b/packages/dataset_config_dart/lib/src/models/job.g.dart
index a3abef1..929b022 100644
--- a/packages/dataset_config_dart/lib/src/models/job.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.g.dart
@@ -8,9 +8,7 @@ part of 'job.dart';
 
 _Job _$JobFromJson(Map<String, dynamic> json) => _Job(
   description: json['description'] as String?,
-  imagePrefix: json['image_prefix'] as String?,
   logDir: json['log_dir'] as String,
-  sandboxType: json['sandbox_type'] as String? ?? 'local',
   maxConnections: (json['max_connections'] as num?)?.toInt() ?? 10,
   models: (json['models'] as List<dynamic>?)?.map((e) => e as String).toList(),
   variants: (json['variants'] as Map<String, dynamic>?)?.map(
@@ -23,57 +21,8 @@ _Job _$JobFromJson(Map<String, dynamic> json) => _Job(
     (k, e) => MapEntry(k, JobTask.fromJson(e as Map<String, dynamic>)),
   ),
   saveExamples: json['save_examples'] as bool? ?? false,
-  retryAttempts: (json['retry_attempts'] as num?)?.toInt(),
-  maxRetries: (json['max_retries'] as num?)?.toInt(),
-  retryWait: (json['retry_wait'] as num?)?.toDouble(),
-  retryConnections: (json['retry_connections'] as num?)?.toDouble(),
-  retryCleanup: json['retry_cleanup'] as bool?,
-  failOnError: (json['fail_on_error'] as num?)?.toDouble(),
-  continueOnFail: json['continue_on_fail'] as bool?,
-  retryOnError: (json['retry_on_error'] as num?)?.toInt(),
-  debugErrors: json['debug_errors'] as bool?,
-  maxSamples: (json['max_samples'] as num?)?.toInt(),
-  maxTasks: (json['max_tasks'] as num?)?.toInt(),
-  maxSubprocesses: (json['max_subprocesses'] as num?)?.toInt(),
-  maxSandboxes: (json['max_sandboxes'] as num?)?.toInt(),
-  logLevel: json['log_level'] as String?,
-  logLevelTranscript: json['log_level_transcript'] as String?,
-  logFormat: json['log_format'] as String?,
-  tags: (json['tags'] as List<dynamic>?)?.map((e) => e as String).toList(),
-  metadata: json['metadata'] as Map<String, dynamic>?,
-  trace: json['trace'] as bool?,
-  display: json['display'] as String?,
-  score: json['score'] as bool?,
-  limit: json['limit'],
-  sampleId: json['sample_id'],
-  sampleShuffle: json['sample_shuffle'],
-  epochs: json['epochs'],
-  approval: json['approval'],
-  solver: json['solver'],
-  sandboxCleanup: json['sandbox_cleanup'] as bool?,
-  modelBaseUrl: json['model_base_url'] as String?,
-  modelArgs: json['model_args'] as Map<String, dynamic>?,
-  modelRoles: (json['model_roles'] as Map<String, dynamic>?)?.map(
-    (k, e) => MapEntry(k, e as String),
-  ),
-  taskArgs: json['task_args'] as Map<String, dynamic>?,
-  messageLimit: (json['message_limit'] as num?)?.toInt(),
-  tokenLimit: (json['token_limit'] as num?)?.toInt(),
-  timeLimit: (json['time_limit'] as num?)?.toInt(),
-  workingLimit: (json['working_limit'] as num?)?.toInt(),
-  costLimit: (json['cost_limit'] as num?)?.toDouble(),
-  modelCostConfig: json['model_cost_config'] as Map<String, dynamic>?,
-  logSamples: json['log_samples'] as bool?,
-  logRealtime: json['log_realtime'] as bool?,
-  logImages: json['log_images'] as bool?,
-  logBuffer: (json['log_buffer'] as num?)?.toInt(),
-  logShared: (json['log_shared'] as num?)?.toInt(),
-  bundleDir: json['bundle_dir'] as String?,
-  bundleOverwrite: json['bundle_overwrite'] as bool?,
-  logDirAllowDirty: json['log_dir_allow_dirty'] as bool?,
-  evalSetId: json['eval_set_id'] as String?,
-  evalSetOverrides: json['eval_set_overrides'] as Map<String, dynamic>?,
-  taskDefaults: json['task_defaults'] as Map<String, dynamic>?,
+  sandbox: json['sandbox'] as Map<String, dynamic>?,
+  inspectEvalArguments: json['inspect_eval_arguments'] as Map<String, dynamic>?,
   taskFilters: json['task_filters'] == null
       ? null
       : TagFilter.fromJson(json['task_filters'] as Map<String, dynamic>),
@@ -84,64 +33,15 @@ _Job _$JobFromJson(Map<String, dynamic> json) => _Job(
 
 Map<String, dynamic> _$JobToJson(_Job instance) => <String, dynamic>{
   'description': instance.description,
-  'image_prefix': instance.imagePrefix,
   'log_dir': instance.logDir,
-  'sandbox_type': instance.sandboxType,
   'max_connections': instance.maxConnections,
   'models': instance.models,
   'variants': instance.variants,
   'task_paths': instance.taskPaths,
   'tasks': instance.tasks,
   'save_examples': instance.saveExamples,
-  'retry_attempts': instance.retryAttempts,
-  'max_retries': instance.maxRetries,
-  'retry_wait': instance.retryWait,
-  'retry_connections': instance.retryConnections,
-  'retry_cleanup': instance.retryCleanup,
-  'fail_on_error': instance.failOnError,
-  'continue_on_fail': instance.continueOnFail,
-  'retry_on_error': instance.retryOnError,
-  'debug_errors': instance.debugErrors,
-  'max_samples': instance.maxSamples,
-  'max_tasks': instance.maxTasks,
-  'max_subprocesses': instance.maxSubprocesses,
-  'max_sandboxes': instance.maxSandboxes,
-  'log_level': instance.logLevel,
-  'log_level_transcript': instance.logLevelTranscript,
-  'log_format': instance.logFormat,
-  'tags': instance.tags,
-  'metadata': instance.metadata,
-  'trace': instance.trace,
-  'display': instance.display,
-  'score': instance.score,
-  'limit': instance.limit,
-  'sample_id': instance.sampleId,
-  'sample_shuffle': instance.sampleShuffle,
-  'epochs': instance.epochs,
-  'approval': instance.approval,
-  'solver': instance.solver,
-  'sandbox_cleanup': instance.sandboxCleanup,
-  'model_base_url': instance.modelBaseUrl,
-  'model_args': instance.modelArgs,
-  'model_roles': instance.modelRoles,
-  'task_args': instance.taskArgs,
-  'message_limit': instance.messageLimit,
-  'token_limit': instance.tokenLimit,
-  'time_limit': instance.timeLimit,
-  'working_limit': instance.workingLimit,
-  'cost_limit': instance.costLimit,
-  'model_cost_config': instance.modelCostConfig,
-  'log_samples': instance.logSamples,
-  'log_realtime': instance.logRealtime,
-  'log_images': instance.logImages,
-  'log_buffer': instance.logBuffer,
-  'log_shared': instance.logShared,
-  'bundle_dir': instance.bundleDir,
-  'bundle_overwrite': instance.bundleOverwrite,
-  'log_dir_allow_dirty': instance.logDirAllowDirty,
-  'eval_set_id': instance.evalSetId,
-  'eval_set_overrides': instance.evalSetOverrides,
-  'task_defaults': instance.taskDefaults,
+  'sandbox': instance.sandbox,
+  'inspect_eval_arguments': instance.inspectEvalArguments,
   'task_filters': instance.taskFilters,
   'sample_filters': instance.sampleFilters,
 };
@@ -154,7 +54,6 @@ _JobTask _$JobTaskFromJson(Map<String, dynamic> json) => _JobTask(
   excludeSamples: (json['exclude_samples'] as List<dynamic>?)
       ?.map((e) => e as String)
       .toList(),
-  systemMessage: json['system_message'] as String?,
   args: json['args'] as Map<String, dynamic>?,
 );
 
@@ -162,6 +61,5 @@ Map<String, dynamic> _$JobTaskToJson(_JobTask instance) => <String, dynamic>{
   'id': instance.id,
   'include_samples': instance.includeSamples,
   'exclude_samples': instance.excludeSamples,
-  'system_message': instance.systemMessage,
   'args': instance.args,
 };
diff --git a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
index 3175ffd..cbacb17 100644
--- a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
@@ -26,7 +26,7 @@ class JsonParser extends Parser {
       final allowedVariants = (data['allowed_variants'] as List?)
           ?.cast<String>();
 
-      // Parse samples from inline data (no file I/O)
+      // Parse samples from inline data (no file I/O) - optional
       final samplesRaw = data['samples'];
       final samples = <Sample>[];
       if (samplesRaw is Map) {
@@ -46,8 +46,13 @@ class JsonParser extends Parser {
             }
           }
 
-          // Normalize tags
-          final rawTags = def['tags'];
+          // Read metadata from the metadata dict
+          final metaRaw = Map<String, dynamic>.from(
+            def['metadata'] as Map? ?? {},
+          );
+
+          // Normalize tags from metadata
+          final rawTags = metaRaw['tags'];
           final List<String> tags;
           if (rawTags is String) {
             tags = rawTags.split(',').map((t) => t.trim()).toList();
@@ -71,10 +76,8 @@ class JsonParser extends Parser {
               input: def['input'] as String,
               target: def['target'] as String,
               metadata: <String, dynamic>{
-                ...Map<String, dynamic>.from(
-                  def['metadata'] as Map? ?? {},
-                ),
-                'difficulty': def['difficulty'] as String? ?? 'medium',
+                ...metaRaw,
+                'difficulty': metaRaw['difficulty'] as String? ?? 'medium',
                 'tags': tags,
               },
               choices: choices,
@@ -86,25 +89,28 @@ class JsonParser extends Parser {
         }
       }
 
-      // Parse Task-level settings
-      final model = data['model'] as String?;
-      final config = data['config'] is Map
-          ? Map<String, dynamic>.from(data['config'] as Map)
+      // Task-level Inspect AI args from inspect_task_args
+      final taskArgs = data['inspect_task_args'] is Map
+          ? Map<String, dynamic>.from(data['inspect_task_args'] as Map)
+          : <String, dynamic>{};
+      final model = taskArgs['model'] as String?;
+      final config = taskArgs['config'] is Map
+          ? Map<String, dynamic>.from(taskArgs['config'] as Map)
           : null;
-      final modelRoles = data['model_roles'] is Map
-          ? Map<String, String>.from(data['model_roles'] as Map)
+      final modelRoles = taskArgs['model_roles'] is Map
+          ? Map<String, String>.from(taskArgs['model_roles'] as Map)
           : null;
-      final sandbox = data['sandbox'];
-      final approval = data['approval'];
-      final epochs = data['epochs'];
-      final failOnError = data['fail_on_error'];
-      final continueOnFail = data['continue_on_fail'] as bool?;
-      final messageLimit = data['message_limit'] as int?;
-      final tokenLimit = data['token_limit'] as int?;
-      final timeLimit = data['time_limit'] as int?;
-      final workingLimit = data['working_limit'] as int?;
-      final costLimit = (data['cost_limit'] as num?)?.toDouble();
-      final earlyStopping = data['early_stopping'];
+      final sandbox = taskArgs['sandbox'];
+      final approval = taskArgs['approval'];
+      final epochs = taskArgs['epochs'];
+      final failOnError = taskArgs['fail_on_error'];
+      final continueOnFail = taskArgs['continue_on_fail'] as bool?;
+      final messageLimit = taskArgs['message_limit'] as int?;
+      final tokenLimit = taskArgs['token_limit'] as int?;
+      final timeLimit = taskArgs['time_limit'] as int?;
+      final workingLimit = taskArgs['working_limit'] as int?;
+      final costLimit = (taskArgs['cost_limit'] as num?)?.toDouble();
+      final earlyStopping = taskArgs['early_stopping'];
       final displayName = data['display_name'] as String?;
       final version = data['version'];
       final taskMetadata = data['metadata'] is Map
@@ -152,76 +158,23 @@ class JsonParser extends Parser {
 
   /// Parse a job from a pre-parsed map.
   Job parseJobFromMap(Map<String, dynamic> data) {
+    // Parse sandbox config
+    Map<String, dynamic>? sandbox;
+    final sandboxRaw = data['sandbox'];
+    if (sandboxRaw is Map) {
+      sandbox = Map<String, dynamic>.from(sandboxRaw);
+    } else if (sandboxRaw is String) {
+      sandbox = {'environment': sandboxRaw};
+    }
+
     return Job(
       logDir: (data['log_dir'] as String?) ?? '',
-      sandboxType: (data['sandbox_type'] as String?) ?? 'local',
       maxConnections: (data['max_connections'] as int?) ?? 10,
       models: (data['models'] as List?)?.cast<String>(),
       saveExamples: data['save_examples'] == true,
-      // Promoted eval_set() fields
-      retryAttempts: data['retry_attempts'] as int?,
-      maxRetries: data['max_retries'] as int?,
-      retryWait: (data['retry_wait'] as num?)?.toDouble(),
-      retryConnections: (data['retry_connections'] as num?)?.toDouble(),
-      retryCleanup: data['retry_cleanup'] as bool?,
-      failOnError: (data['fail_on_error'] as num?)?.toDouble(),
-      continueOnFail: data['continue_on_fail'] as bool?,
-      retryOnError: data['retry_on_error'] as int?,
-      debugErrors: data['debug_errors'] as bool?,
-      maxSamples: data['max_samples'] as int?,
-      maxTasks: data['max_tasks'] as int?,
-      maxSubprocesses: data['max_subprocesses'] as int?,
-      maxSandboxes: data['max_sandboxes'] as int?,
-      logLevel: data['log_level'] as String?,
-      logLevelTranscript: data['log_level_transcript'] as String?,
-      logFormat: data['log_format'] as String?,
-      tags: (data['tags'] as List?)?.cast<String>(),
-      metadata: data['metadata'] is Map
-          ? Map<String, dynamic>.from(data['metadata'] as Map)
-          : null,
-      trace: data['trace'] as bool?,
-      display: data['display'] as String?,
-      score: data['score'] as bool?,
-      limit: data['limit'],
-      sampleId: data['sample_id'],
-      sampleShuffle: data['sample_shuffle'],
-      epochs: data['epochs'],
-      approval: data['approval'],
-      solver: data['solver'],
-      sandboxCleanup: data['sandbox_cleanup'] as bool?,
-      modelBaseUrl: data['model_base_url'] as String?,
-      modelArgs: data['model_args'] is Map
-          ? Map<String, Object?>.from(data['model_args'] as Map)
-          : null,
-      modelRoles: data['model_roles'] is Map
-          ? Map<String, String>.from(data['model_roles'] as Map)
-          : null,
-      taskArgs: data['task_args'] is Map
-          ? Map<String, Object?>.from(data['task_args'] as Map)
-          : null,
-      messageLimit: data['message_limit'] as int?,
-      tokenLimit: data['token_limit'] as int?,
-      timeLimit: data['time_limit'] as int?,
-      workingLimit: data['working_limit'] as int?,
-      costLimit: (data['cost_limit'] as num?)?.toDouble(),
-      modelCostConfig: data['model_cost_config'] is Map
-          ? Map<String, Object?>.from(data['model_cost_config'] as Map)
-          : null,
-      logSamples: data['log_samples'] as bool?,
-      logRealtime: data['log_realtime'] as bool?,
-      logImages: data['log_images'] as bool?,
-      logBuffer: data['log_buffer'] as int?,
-      logShared: data['log_shared'] as int?,
-      bundleDir: data['bundle_dir'] as String?,
-      bundleOverwrite: data['bundle_overwrite'] as bool?,
-      logDirAllowDirty: data['log_dir_allow_dirty'] as bool?,
-      evalSetId: data['eval_set_id'] as String?,
-      // Pass-through sections
-      evalSetOverrides: data['eval_set_overrides'] is Map
-          ? Map<String, dynamic>.from(data['eval_set_overrides'] as Map)
-          : null,
-      taskDefaults: data['task_defaults'] is Map
-          ? Map<String, dynamic>.from(data['task_defaults'] as Map)
+      sandbox: sandbox,
+      inspectEvalArguments: data['inspect_eval_arguments'] is Map
+          ? Map<String, dynamic>.from(data['inspect_eval_arguments'] as Map)
           : null,
     );
   }
diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index d7c0b36..b9b41c0 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -80,21 +80,22 @@ class YamlParser extends Parser {
       taskDir,
     );
 
-    // Parse Task-level settings
-    final model = data['model'] as String?;
-    final config = _asMap(data['config']);
-    final modelRoles = _asStringMap(data['model_roles']);
-    final sandbox = data['sandbox'];
-    final approval = data['approval'];
-    final epochs = data['epochs'];
-    final failOnError = data['fail_on_error'];
-    final continueOnFail = data['continue_on_fail'] as bool?;
-    final messageLimit = data['message_limit'] as int?;
-    final tokenLimit = data['token_limit'] as int?;
-    final timeLimit = data['time_limit'] as int?;
-    final workingLimit = data['working_limit'] as int?;
-    final costLimit = (data['cost_limit'] as num?)?.toDouble();
-    final earlyStopping = data['early_stopping'];
+    // Task-level Inspect AI args are nested under inspect_task_args
+    final taskArgs = _asMap(data['inspect_task_args']) ?? <String, dynamic>{};
+    final model = taskArgs['model'] as String?;
+    final config = _asMap(taskArgs['config']);
+    final modelRoles = _asStringMap(taskArgs['model_roles']);
+    final sandbox = taskArgs['sandbox'];
+    final approval = taskArgs['approval'];
+    final epochs = taskArgs['epochs'];
+    final failOnError = taskArgs['fail_on_error'];
+    final continueOnFail = taskArgs['continue_on_fail'] as bool?;
+    final messageLimit = taskArgs['message_limit'] as int?;
+    final tokenLimit = taskArgs['token_limit'] as int?;
+    final timeLimit = taskArgs['time_limit'] as int?;
+    final workingLimit = taskArgs['working_limit'] as int?;
+    final costLimit = (taskArgs['cost_limit'] as num?)?.toDouble();
+    final earlyStopping = taskArgs['early_stopping'];
     final displayName = data['display_name'] as String?;
     final version = data['version'];
     final taskMetadata = _asMap(data['metadata']);
@@ -259,8 +260,11 @@ class YamlParser extends Parser {
       }
     }
 
-    final sampleWorkspace = doc['workspace'];
-    final sampleTests = doc['tests'];
+    // Read metadata fields from the metadata dict
+    final metaRaw = Map<String, dynamic>.from(doc['metadata'] as Map? ?? {});
+
+    final sampleWorkspace = metaRaw['workspace'];
+    final sampleTests = metaRaw['tests'];
 
     // Sample-level overrides task-level
     final effectiveWorkspace = sampleWorkspace ?? taskWorkspace;
@@ -286,8 +290,8 @@ class YamlParser extends Parser {
       tests = _resolveResourcePath(taskTests, datasetRoot);
     }
 
-    // --- Normalize tags ---
-    final rawTags = doc['tags'];
+    // --- Normalize tags from metadata ---
+    final rawTags = metaRaw['tags'];
     final List<String> tags;
     if (rawTags is String) {
       tags = rawTags.split(',').map((t) => t.trim()).toList();
@@ -299,8 +303,8 @@ class YamlParser extends Parser {
 
     // Build metadata with domain-specific fields
     final metadata = <String, dynamic>{
-      ...Map<String, dynamic>.from(doc['metadata'] as Map? ?? {}),
-      'difficulty': doc['difficulty'] as String? ?? 'medium',
+      ...metaRaw,
+      'difficulty': metaRaw['difficulty'] as String? ?? 'medium',
       'tags': tags,
       'workspace': ?workspace,
       'tests': ?tests,
@@ -339,7 +343,6 @@ class YamlParser extends Parser {
     final data = readYamlFileAsMap(jobPath);
 
     final logsDir = (data['logs_dir'] as String?) ?? _kDefaultLogsDir;
-    final sandboxType = (data['sandbox_type'] as String?) ?? 'local';
     final maxConnections = (data['max_connections'] as int?) ?? 10;
 
     // Resolve log directory with timestamp
@@ -391,10 +394,8 @@ class YamlParser extends Parser {
 
     return Job(
       logDir: logDir,
-      sandboxType: sandboxType,
       maxConnections: maxConnections,
       description: data['description'] as String?,
-      imagePrefix: data['image_prefix'] as String?,
       models: (data['models'] as List?)?.cast<String>(),
       variants: variants,
       taskPaths: taskPaths,
@@ -402,66 +403,29 @@ class YamlParser extends Parser {
       taskFilters: taskFilters,
       sampleFilters: sampleFilters,
       saveExamples: data['save_examples'] == true,
-      // Promoted eval_set() fields
-      retryAttempts: data['retry_attempts'] as int?,
-      maxRetries: data['max_retries'] as int?,
-      retryWait: (data['retry_wait'] as num?)?.toDouble(),
-      retryConnections: (data['retry_connections'] as num?)?.toDouble(),
-      retryCleanup: data['retry_cleanup'] as bool?,
-      failOnError: (data['fail_on_error'] as num?)?.toDouble(),
-      continueOnFail: data['continue_on_fail'] as bool?,
-      retryOnError: data['retry_on_error'] as int?,
-      debugErrors: data['debug_errors'] as bool?,
-      maxSamples: data['max_samples'] as int?,
-      maxTasks: data['max_tasks'] as int?,
-      maxSubprocesses: data['max_subprocesses'] as int?,
-      maxSandboxes: data['max_sandboxes'] as int?,
-      logLevel: data['log_level'] as String?,
-      logLevelTranscript: data['log_level_transcript'] as String?,
-      logFormat: data['log_format'] as String?,
-      tags: (data['tags'] as List?)?.cast<String>(),
-      metadata: _asMap(data['metadata']),
-      trace: data['trace'] as bool?,
-      display: data['display'] as String?,
-      score: data['score'] as bool?,
-      limit: data['limit'],
-      sampleId: data['sample_id'],
-      sampleShuffle: data['sample_shuffle'],
-      epochs: data['epochs'],
-      approval: data['approval'],
-      solver: data['solver'],
-      sandboxCleanup: data['sandbox_cleanup'] as bool?,
-      modelBaseUrl: data['model_base_url'] as String?,
-      modelArgs: _asObjectMap(data['model_args']),
-      modelRoles: _asStringMap(data['model_roles']),
-      taskArgs: _asObjectMap(data['task_args']),
-      messageLimit: data['message_limit'] as int?,
-      tokenLimit: data['token_limit'] as int?,
-      timeLimit: data['time_limit'] as int?,
-      workingLimit: data['working_limit'] as int?,
-      costLimit: (data['cost_limit'] as num?)?.toDouble(),
-      modelCostConfig: _asObjectMap(data['model_cost_config']),
-      logSamples: data['log_samples'] as bool?,
-      logRealtime: data['log_realtime'] as bool?,
-      logImages: data['log_images'] as bool?,
-      logBuffer: data['log_buffer'] as int?,
-      logShared: data['log_shared'] as int?,
-      bundleDir: data['bundle_dir'] as String?,
-      bundleOverwrite: data['bundle_overwrite'] as bool?,
-      logDirAllowDirty: data['log_dir_allow_dirty'] as bool?,
-      evalSetId: data['eval_set_id'] as String?,
-      // Pass-through sections
-      evalSetOverrides: _asMap(data['eval_set_overrides']),
-      taskDefaults: _asMap(data['task_defaults']),
+      // Sandbox configuration
+      sandbox: _parseSandbox(data['sandbox']),
+      // All inspect eval arguments
+      inspectEvalArguments: _asMap(data['inspect_eval_arguments']),
     );
   }
 
+  /// Parse sandbox config from YAML value.
+  ///
+  /// Supports both string shorthand ('podman') and map form.
+  static Map<String, dynamic>? _parseSandbox(Object? value) {
+    if (value is Map) {
+      return Map<String, dynamic>.from(value);
+    } else if (value is String) {
+      return {'environment': value};
+    }
+    return null;
+  }
+
   /// Create a [Job] with default settings (when no job file is provided).
   Job createDefaultJob(String baseDir) {
     return Job(
       logDir: _resolveLogDir(_kDefaultLogsDir, baseDir),
-      sandboxType: 'local',
-      maxConnections: 10,
     );
   }
 
@@ -481,11 +445,6 @@ class YamlParser extends Parser {
     return null;
   }
 
-  /// Safely cast a YAML value to `Map<String, Object?>?`.
-  static Map<String, Object?>? _asObjectMap(Object? value) {
-    if (value is Map) return Map<String, Object?>.from(value);
-    return null;
-  }
 
   // ------------------------------------------------------------------
   // Path resolution helpers
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index 53d6b47..09b32ea 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -79,7 +79,8 @@ class EvalSetResolver {
     String datasetRoot,
   ) {
     final models = _resolveModels(job);
-    final sandboxTypeStr = job.sandboxType;
+    final sandboxCfg = job.sandbox ?? <String, dynamic>{};
+    final sandboxTypeStr = (sandboxCfg['environment'] as String?) ?? 'local';
     final expandedTasks = _expandTaskConfigs(
       datasetTasks,
       job,
@@ -126,11 +127,14 @@ class EvalSetResolver {
     required Job job,
   }) {
     final inspectTasks = <Task>[];
+    final sandboxCfg = job.sandbox ?? <String, dynamic>{};
+    final sandboxTypeStr = (sandboxCfg['environment'] as String?) ?? 'local';
     final isContainer =
-        job.sandboxType.isNotEmpty && job.sandboxType != 'local';
+        sandboxTypeStr.isNotEmpty && sandboxTypeStr != 'local';
 
-    // Parse task_defaults from the job
-    final taskDefaults = job.taskDefaults ?? <String, dynamic>{};
+    // Parse task_defaults from inspect_eval_arguments
+    final evalArgs = job.inspectEvalArguments ?? <String, dynamic>{};
+    final taskDefaults = (evalArgs['task_defaults'] as Map<String, dynamic>?) ?? <String, dynamic>{};
 
     for (final tc in taskConfigs) {
       // Enrich each sample with task-level metadata
@@ -212,9 +216,9 @@ class EvalSetResolver {
         if (tc.systemMessage != null) 'system_message': tc.systemMessage,
         if (tc.saveExamples) 'save_examples': true,
         if (tc.examplesDir != null) 'examples_dir': tc.examplesDir,
-        // Propagate image_prefix from job for container image resolution
-        if (job.imagePrefix != null && job.imagePrefix!.isNotEmpty)
-          'image_prefix': job.imagePrefix,
+        // Propagate image_prefix from sandbox for container image resolution
+        if (sandboxCfg['image_prefix'] != null)
+          'image_prefix': sandboxCfg['image_prefix'],
         // Merge any task-level metadata from YAML
         ...?tc.metadata,
       };
@@ -224,7 +228,7 @@ class EvalSetResolver {
       if (tc.sandbox != null) {
         // Task-level sandbox override
         taskSandbox = tc.sandbox;
-      } else if (tc.sandboxType.isNotEmpty && tc.sandboxType != 'local') {
+      } else if (sandboxTypeStr != 'local') {
         taskSandbox = _serializeSandbox(sandbox);
       }
 
@@ -233,7 +237,7 @@ class EvalSetResolver {
       final resolvedTimeLimit =
           tc.timeLimit ??
           taskDefaults['time_limit'] as int? ??
-          (job.sandboxType != 'local' ? 300 : null);
+          (sandboxTypeStr != 'local' ? 300 : null);
       final resolvedMessageLimit =
           tc.messageLimit ?? taskDefaults['message_limit'] as int?;
       final resolvedTokenLimit =
@@ -287,9 +291,17 @@ class EvalSetResolver {
       );
     }
 
-    // Build the EvalSet with all job-level parameters.
-    // Start with any eval_set_overrides, then apply explicit fields.
-    final overrides = job.evalSetOverrides ?? <String, dynamic>{};
+    // Build the EvalSet with all job-level parameters from inspect_eval_arguments.
+    final evalSetOverrides = (evalArgs['eval_set_overrides'] as Map<String, dynamic>?) ?? <String, dynamic>{};
+
+    // Helper to get a value from evalArgs then overrides
+    T? getArg<T>(String key, [T? defaultVal]) {
+      final v = evalArgs[key] as T?;
+      if (v != null) return v;
+      final o = evalSetOverrides[key] as T?;
+      if (o != null) return o;
+      return defaultVal;
+    }
 
     return EvalSet(
       tasks: inspectTasks,
@@ -297,90 +309,73 @@ class EvalSetResolver {
       model: models,
       sandbox: _serializeSandbox(sandbox),
       // Retry settings
-      retryAttempts:
-          job.retryAttempts ?? overrides['retry_attempts'] as int? ?? 10,
-      retryWait:
-          job.retryWait ?? (overrides['retry_wait'] as num?)?.toDouble() ?? 60,
+      retryAttempts: getArg<int>('retry_attempts', 10),
+      retryWait: (getArg<num>('retry_wait', 60))?.toDouble() ?? 60,
       retryConnections:
-          job.retryConnections ??
-          (overrides['retry_connections'] as num?)?.toDouble() ??
-          0.5,
-      retryCleanup: job.retryCleanup ?? overrides['retry_cleanup'] as bool?,
+          (getArg<num>('retry_connections', 0.5))?.toDouble() ?? 0.5,
+      retryCleanup: getArg<bool>('retry_cleanup'),
       retryOnError:
-          job.retryOnError ??
-          job.maxRetries ??
-          overrides['retry_on_error'] as int?,
+          getArg<int>('retry_on_error') ?? getArg<int>('max_retries'),
       // Error handling
-      failOnError:
-          job.failOnError ??
-          (overrides['fail_on_error'] as num?)?.toDouble() ??
-          0.05,
-      continueOnFail:
-          job.continueOnFail ?? overrides['continue_on_fail'] as bool?,
-      debugErrors: job.debugErrors ?? overrides['debug_errors'] as bool?,
+      failOnError: (getArg<num>('fail_on_error', 0.05))?.toDouble() ?? 0.05,
+      continueOnFail: getArg<bool>('continue_on_fail'),
+      debugErrors: getArg<bool>('debug_errors'),
       // Concurrency
-      maxSamples: job.maxSamples ?? overrides['max_samples'] as int?,
-      maxTasks: job.maxTasks ?? overrides['max_tasks'] as int?,
-      maxSubprocesses:
-          job.maxSubprocesses ?? overrides['max_subprocesses'] as int?,
-      maxSandboxes: job.maxSandboxes ?? overrides['max_sandboxes'] as int?,
+      maxSamples: getArg<int>('max_samples'),
+      maxTasks: getArg<int>('max_tasks'),
+      maxSubprocesses: getArg<int>('max_subprocesses'),
+      maxSandboxes: getArg<int>('max_sandboxes'),
       // Logging
-      logLevel: job.logLevel ?? overrides['log_level'] as String? ?? 'info',
-      logLevelTranscript:
-          job.logLevelTranscript ??
-          overrides['log_level_transcript'] as String?,
-      logFormat: job.logFormat ?? overrides['log_format'] as String? ?? 'json',
-      logSamples: job.logSamples ?? overrides['log_samples'] as bool?,
-      logRealtime: job.logRealtime ?? overrides['log_realtime'] as bool?,
-      logImages: job.logImages ?? overrides['log_images'] as bool?,
-      logBuffer: job.logBuffer ?? overrides['log_buffer'] as int?,
-      logShared: job.logShared ?? overrides['log_shared'] as int?,
-      logDirAllowDirty:
-          job.logDirAllowDirty ?? overrides['log_dir_allow_dirty'] as bool?,
+      logLevel: getArg<String>('log_level', 'info'),
+      logLevelTranscript: getArg<String>('log_level_transcript'),
+      logFormat: getArg<String>('log_format', 'json'),
+      logSamples: getArg<bool>('log_samples'),
+      logRealtime: getArg<bool>('log_realtime'),
+      logImages: getArg<bool>('log_images'),
+      logBuffer: getArg<int>('log_buffer'),
+      logShared: getArg<int>('log_shared'),
+      logDirAllowDirty: getArg<bool>('log_dir_allow_dirty'),
       // Model config
-      modelBaseUrl: job.modelBaseUrl ?? overrides['model_base_url'] as String?,
+      modelBaseUrl: getArg<String>('model_base_url'),
       modelArgs:
-          job.modelArgs ??
-          (overrides['model_args'] as Map<String, Object?>?) ??
+          (evalArgs['model_args'] as Map<String, Object?>?) ??
+          (evalSetOverrides['model_args'] as Map<String, Object?>?) ??
           const {},
       modelRoles:
-          job.modelRoles ?? overrides['model_roles'] as Map<String, String>?,
+          (evalArgs['model_roles'] as Map<String, String>?) ??
+          evalSetOverrides['model_roles'] as Map<String, String>?,
       taskArgs:
-          job.taskArgs ??
-          (overrides['task_args'] as Map<String, Object?>?) ??
+          (evalArgs['task_args'] as Map<String, Object?>?) ??
+          (evalSetOverrides['task_args'] as Map<String, Object?>?) ??
           const {},
       modelCostConfig:
-          job.modelCostConfig ??
-          overrides['model_cost_config'] as Map<String, Object?>?,
+          (evalArgs['model_cost_config'] as Map<String, Object?>?) ??
+          evalSetOverrides['model_cost_config'] as Map<String, Object?>?,
       // Sandbox
-      sandboxCleanup:
-          job.sandboxCleanup ?? overrides['sandbox_cleanup'] as bool?,
+      sandboxCleanup: getArg<bool>('sandbox_cleanup'),
       // Sample control
-      limit: job.limit ?? overrides['limit'],
-      sampleId: job.sampleId ?? overrides['sample_id'],
-      sampleShuffle: job.sampleShuffle ?? overrides['sample_shuffle'],
-      epochs: job.epochs ?? overrides['epochs'],
+      limit: evalArgs['limit'] ?? evalSetOverrides['limit'],
+      sampleId: evalArgs['sample_id'] ?? evalSetOverrides['sample_id'],
+      sampleShuffle: evalArgs['sample_shuffle'] ?? evalSetOverrides['sample_shuffle'],
+      epochs: evalArgs['epochs'] ?? evalSetOverrides['epochs'],
       // Misc
-      tags: job.tags ?? (overrides['tags'] as List?)?.cast<String>(),
-      metadata: job.metadata ?? overrides['metadata'] as Map<String, dynamic>?,
-      trace: job.trace ?? overrides['trace'] as bool?,
-      display: job.display ?? overrides['display'] as String?,
-      approval: job.approval ?? overrides['approval'],
-      solver: job.solver ?? overrides['solver'],
-      score: job.score ?? overrides['score'] as bool? ?? true,
+      tags: (evalArgs['tags'] as List?)?.cast<String>() ?? (evalSetOverrides['tags'] as List?)?.cast<String>(),
+      metadata: (evalArgs['metadata'] as Map<String, dynamic>?) ?? evalSetOverrides['metadata'] as Map<String, dynamic>?,
+      trace: getArg<bool>('trace'),
+      display: getArg<String>('display'),
+      approval: evalArgs['approval'] ?? evalSetOverrides['approval'],
+      solver: evalArgs['solver'] ?? evalSetOverrides['solver'],
+      score: getArg<bool>('score', true) ?? true,
       // Limits
-      messageLimit: job.messageLimit ?? overrides['message_limit'] as int?,
-      tokenLimit: job.tokenLimit ?? overrides['token_limit'] as int?,
-      timeLimit: job.timeLimit ?? overrides['time_limit'] as int?,
-      workingLimit: job.workingLimit ?? overrides['working_limit'] as int?,
-      costLimit: job.costLimit ?? (overrides['cost_limit'] as num?)?.toDouble(),
+      messageLimit: getArg<int>('message_limit'),
+      tokenLimit: getArg<int>('token_limit'),
+      timeLimit: getArg<int>('time_limit'),
+      workingLimit: getArg<int>('working_limit'),
+      costLimit: (getArg<num>('cost_limit'))?.toDouble(),
       // Bundling
-      bundleDir: job.bundleDir ?? overrides['bundle_dir'] as String?,
-      bundleOverwrite:
-          job.bundleOverwrite ??
-          overrides['bundle_overwrite'] as bool? ??
-          false,
-      evalSetId: job.evalSetId ?? overrides['eval_set_id'] as String?,
+      bundleDir: getArg<String>('bundle_dir'),
+      bundleOverwrite: getArg<bool>('bundle_overwrite', false) ?? false,
+      evalSetId: getArg<String>('eval_set_id'),
     );
   }
 
@@ -406,7 +401,8 @@ class EvalSetResolver {
     Job job, {
     String? branch,
   }) {
-    final sandboxType = job.sandboxType;
+    final sandboxCfg = job.sandbox ?? <String, dynamic>{};
+    final sandboxType = (sandboxCfg['environment'] as String?) ?? 'local';
     if (sandboxType.isEmpty || sandboxType == 'local') return 'local';
 
     // Branch override → look up branch-specific sandbox
@@ -505,11 +501,8 @@ class EvalSetResolver {
         }).toList();
       }
 
-      // Apply system_message override
+      // Apply system_message from task (no longer overridden by job task)
       var systemMessage = taskConfig.systemMessage;
-      if (jobTask?.systemMessage != null) {
-        systemMessage = jobTask!.systemMessage;
-      }
 
       // Merge job-task args into metadata
       Map<String, dynamic>? mergedMetadata = taskConfig.metadata;
diff --git a/packages/dataset_config_dart/test/eval_set_resolver_test.dart b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
index 48d4c6d..e76e799 100644
--- a/packages/dataset_config_dart/test/eval_set_resolver_test.dart
+++ b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
@@ -45,23 +45,21 @@ void main() {
   /// Helper to create a minimal [Job] for testing.
   Job makeJob({
     String logDir = '/tmp/logs',
-    String sandboxType = 'local',
+    Map<String, dynamic>? sandbox,
     List<String>? models,
     Map<String, Map<String, dynamic>>? variants,
     Map<String, JobTask>? tasks,
     bool saveExamples = false,
-    Map<String, dynamic>? taskDefaults,
-    String? imagePrefix,
+    Map<String, dynamic>? inspectEvalArguments,
   }) {
     return Job(
       logDir: logDir,
-      sandboxType: sandboxType,
+      sandbox: sandbox,
       models: models,
       variants: variants,
       tasks: tasks,
       saveExamples: saveExamples,
-      taskDefaults: taskDefaults,
-      imagePrefix: imagePrefix,
+      inspectEvalArguments: inspectEvalArguments,
     );
   }
 
@@ -237,7 +235,7 @@ void main() {
     test('local sandbox resolves to null in output', () {
       final results = resolver.resolve(
         [makeTask()],
-        makeJob(models: ['m'], sandboxType: 'local'),
+        makeJob(models: ['m'], sandbox: {'environment': 'local'}),
         '/tmp/dataset',
       );
 
@@ -323,7 +321,7 @@ void main() {
         [makeTask()],
         makeJob(
           models: ['m'],
-          taskDefaults: {'time_limit': 999, 'message_limit': 77},
+          inspectEvalArguments: {'task_defaults': {'time_limit': 999, 'message_limit': 77}},
         ),
         '/tmp/dataset',
       );
@@ -338,7 +336,7 @@ void main() {
         [makeTask(timeLimit: 100)],
         makeJob(
           models: ['m'],
-          taskDefaults: {'time_limit': 999},
+          inspectEvalArguments: {'task_defaults': {'time_limit': 999}},
         ),
         '/tmp/dataset',
       );
@@ -349,11 +347,13 @@ void main() {
     test('job-level eval_set fields propagate', () {
       final results = resolver.resolve(
         [makeTask()],
-        const Job(
+        Job(
           logDir: '/tmp/logs',
           models: ['m'],
-          retryAttempts: 42,
-          logLevel: 'debug',
+          inspectEvalArguments: {
+            'retry_attempts': 42,
+            'log_level': 'debug',
+          },
         ),
         '/tmp/dataset',
       );
@@ -398,12 +398,15 @@ void main() {
       expect(taskNames, isNot(contains('test_task:mcp_only')));
     });
 
-    test('image_prefix from job appears in task metadata', () {
+    test('image_prefix from sandbox appears in task metadata', () {
       final results = resolver.resolve(
         [makeTask()],
         makeJob(
           models: ['m'],
-          imagePrefix: 'us-central1-docker.pkg.dev/my-project/repo/',
+          sandbox: {
+            'environment': 'podman',
+            'image_prefix': 'us-central1-docker.pkg.dev/my-project/repo/',
+          },
         ),
         '/tmp/dataset',
       );
diff --git a/packages/dataset_config_dart/test/json_parser_test.dart b/packages/dataset_config_dart/test/json_parser_test.dart
index 3763af6..51dc2b0 100644
--- a/packages/dataset_config_dart/test/json_parser_test.dart
+++ b/packages/dataset_config_dart/test/json_parser_test.dart
@@ -58,7 +58,7 @@ void main() {
       );
     });
 
-    test('normalises tags from comma-separated string', () {
+    test('normalises tags from comma-separated string in metadata', () {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'tagged_task',
@@ -68,7 +68,9 @@ void main() {
                 'id': 's1',
                 'input': 'q',
                 'target': 'a',
-                'tags': 'flutter, dart, widgets',
+                'metadata': {
+                  'tags': 'flutter, dart, widgets',
+                },
               },
             ],
           },
@@ -79,7 +81,7 @@ void main() {
       expect(metadata['tags'], equals(['flutter', 'dart', 'widgets']));
     });
 
-    test('normalises tags from list', () {
+    test('normalises tags from list in metadata', () {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'tagged_task',
@@ -89,7 +91,9 @@ void main() {
                 'id': 's1',
                 'input': 'q',
                 'target': 'a',
-                'tags': ['tag1', 'tag2'],
+                'metadata': {
+                  'tags': ['tag1', 'tag2'],
+                },
               },
             ],
           },
@@ -157,21 +161,23 @@ void main() {
       expect(sample.files, {'main.dart': 'void main() {}'});
     });
 
-    test('parses all task-level settings', () {
+    test('parses all task-level settings from inspect_task_args', () {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'full_task',
           'func': 'my_func',
           'system_message': 'Be helpful',
           'allowed_variants': ['baseline', 'full'],
-          'model': 'gemini-pro',
-          'config': {'temperature': 0.5},
-          'model_roles': {'grader': 'gpt-4o'},
-          'message_limit': 50,
-          'token_limit': 4096,
-          'time_limit': 600,
-          'working_limit': 300,
-          'cost_limit': 1.5,
+          'inspect_task_args': {
+            'model': 'gemini-pro',
+            'config': {'temperature': 0.5},
+            'model_roles': {'grader': 'gpt-4o'},
+            'message_limit': 50,
+            'token_limit': 4096,
+            'time_limit': 600,
+            'working_limit': 300,
+            'cost_limit': 1.5,
+          },
           'display_name': 'Full Task',
           'version': 2,
           'metadata': {'author': 'test'},
@@ -214,7 +220,6 @@ void main() {
       final job = parser.parseJobFromMap(<String, dynamic>{});
 
       expect(job.logDir, '');
-      expect(job.sandboxType, 'local');
       expect(job.maxConnections, 10);
       expect(job.models, isNull);
       expect(job.saveExamples, false);
@@ -223,53 +228,67 @@ void main() {
     test('parses all core fields', () {
       final job = parser.parseJobFromMap({
         'log_dir': './logs/run1',
-        'sandbox_type': 'podman',
+        'sandbox': {'environment': 'podman'},
         'max_connections': 5,
         'models': ['gemini-pro', 'gpt-4o'],
         'save_examples': true,
       });
 
       expect(job.logDir, './logs/run1');
-      expect(job.sandboxType, 'podman');
+      expect(job.sandbox, {'environment': 'podman'});
       expect(job.maxConnections, 5);
       expect(job.models, ['gemini-pro', 'gpt-4o']);
       expect(job.saveExamples, true);
     });
 
-    test('parses promoted eval_set fields', () {
+    test('parses sandbox string shorthand', () {
+      final job = parser.parseJobFromMap({
+        'sandbox': 'podman',
+      });
+
+      expect(job.sandbox, {'environment': 'podman'});
+    });
+
+    test('parses inspect_eval_arguments', () {
       final job = parser.parseJobFromMap({
-        'retry_attempts': 20,
-        'max_retries': 3,
-        'retry_wait': 5.0,
-        'fail_on_error': 0.5,
-        'continue_on_fail': true,
-        'max_samples': 100,
-        'max_tasks': 4,
-        'log_level': 'debug',
-        'tags': ['ci', 'nightly'],
-        'metadata': {'run_by': 'bot'},
+        'inspect_eval_arguments': {
+          'retry_attempts': 20,
+          'max_retries': 3,
+          'retry_wait': 5.0,
+          'fail_on_error': 0.5,
+          'continue_on_fail': true,
+          'max_samples': 100,
+          'max_tasks': 4,
+          'log_level': 'debug',
+          'tags': ['ci', 'nightly'],
+          'metadata': {'run_by': 'bot'},
+        },
       });
 
-      expect(job.retryAttempts, 20);
-      expect(job.maxRetries, 3);
-      expect(job.retryWait, 5.0);
-      expect(job.failOnError, 0.5);
-      expect(job.continueOnFail, true);
-      expect(job.maxSamples, 100);
-      expect(job.maxTasks, 4);
-      expect(job.logLevel, 'debug');
-      expect(job.tags, ['ci', 'nightly']);
-      expect(job.metadata, {'run_by': 'bot'});
+      final evalArgs = job.inspectEvalArguments!;
+      expect(evalArgs['retry_attempts'], 20);
+      expect(evalArgs['max_retries'], 3);
+      expect(evalArgs['retry_wait'], 5.0);
+      expect(evalArgs['fail_on_error'], 0.5);
+      expect(evalArgs['continue_on_fail'], true);
+      expect(evalArgs['max_samples'], 100);
+      expect(evalArgs['max_tasks'], 4);
+      expect(evalArgs['log_level'], 'debug');
+      expect(evalArgs['tags'], ['ci', 'nightly']);
+      expect(evalArgs['metadata'], {'run_by': 'bot'});
     });
 
-    test('parses pass-through overrides', () {
+    test('parses nested overrides in inspect_eval_arguments', () {
       final job = parser.parseJobFromMap({
-        'eval_set_overrides': {'custom_key': 'custom_value'},
-        'task_defaults': {'time_limit': 600},
+        'inspect_eval_arguments': {
+          'eval_set_overrides': {'custom_key': 'custom_value'},
+          'task_defaults': {'time_limit': 600},
+        },
       });
 
-      expect(job.evalSetOverrides, {'custom_key': 'custom_value'});
-      expect(job.taskDefaults, {'time_limit': 600});
+      final evalArgs = job.inspectEvalArguments!;
+      expect(evalArgs['eval_set_overrides'], {'custom_key': 'custom_value'});
+      expect(evalArgs['task_defaults'], {'time_limit': 600});
     });
   });
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/job.py b/packages/dataset_config_python/src/dataset_config_python/models/job.py
index b259ed1..16f2568 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/job.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/job.py
@@ -21,9 +21,6 @@ class JobTask(BaseModel):
     exclude_samples: list[str] | None = None
     """Exclude these sample IDs."""
 
-    system_message: str | None = None
-    """Override system message for this task."""
-
     args: dict[str, Any] | None = None
     """Per-task argument overrides passed to the task function."""
 
@@ -36,7 +33,6 @@ def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask:
             id=task_id,
             include_samples=data.get("include-samples"),
             exclude_samples=data.get("exclude-samples"),
-            system_message=data.get("system_message"),
             args=data.get("args"),
         )
 
@@ -46,9 +42,7 @@ class Job(BaseModel):
 
     # Core settings
     description: str | None = None
-    image_prefix: str | None = None
     log_dir: str
-    sandbox_type: str = "local"
     max_connections: int = 10
     models: list[str] | None = None
     variants: dict[str, dict[str, Any]] | None = None
@@ -56,58 +50,13 @@ class Job(BaseModel):
     tasks: dict[str, JobTask] | None = None
     save_examples: bool = False
 
-    # Promoted eval_set() parameters
-    retry_attempts: int | None = None
-    max_retries: int | None = None
-    retry_wait: float | None = None
-    retry_connections: float | None = None
-    retry_cleanup: bool | None = None
-    fail_on_error: float | None = None
-    continue_on_fail: bool | None = None
-    retry_on_error: int | None = None
-    debug_errors: bool | None = None
-    max_samples: int | None = None
-    max_tasks: int | None = None
-    max_subprocesses: int | None = None
-    max_sandboxes: int | None = None
-    log_level: str | None = None
-    log_level_transcript: str | None = None
-    log_format: str | None = None
-    tags: list[str] | None = None
-    metadata: dict[str, Any] | None = None
-    trace: bool | None = None
-    display: str | None = None
-    score: bool | None = None
-    limit: Any | None = None
-    sample_id: Any | None = None
-    sample_shuffle: Any | None = None
-    epochs: Any | None = None
-    approval: Any | None = None
-    solver: Any | None = None
-    sandbox_cleanup: bool | None = None
-    model_base_url: str | None = None
-    model_args: dict[str, Any] | None = None
-    model_roles: dict[str, str] | None = None
-    task_args: dict[str, Any] | None = None
-    message_limit: int | None = None
-    token_limit: int | None = None
-    time_limit: int | None = None
-    working_limit: int | None = None
-    cost_limit: float | None = None
-    model_cost_config: dict[str, Any] | None = None
-    log_samples: bool | None = None
-    log_realtime: bool | None = None
-    log_images: bool | None = None
-    log_buffer: int | None = None
-    log_shared: int | None = None
-    bundle_dir: str | None = None
-    bundle_overwrite: bool | None = None
-    log_dir_allow_dirty: bool | None = None
-    eval_set_id: str | None = None
-
-    # Pass-through overrides
-    eval_set_overrides: dict[str, Any] | None = None
-    task_defaults: dict[str, Any] | None = None
+    # Sandbox configuration
+    sandbox: dict[str, Any] | None = None
+    """Sandbox config with keys: environment, parameters, image_prefix."""
+
+    # Inspect eval arguments (passed through to eval_set())
+    inspect_eval_arguments: dict[str, Any] | None = None
+    """All Inspect AI eval_set() parameters, nested under one key."""
 
     # Tag-based filtering
     task_filters: TagFilter | None = None
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index dd8ed1d..e381985 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -257,19 +257,25 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
 
     allowed_variants = data.get("allowed_variants")
 
-    # Parse samples section
+    # Parse samples section (optional)
     samples_raw = data.get("samples")
-    if not isinstance(samples_raw, dict):
+    if samples_raw is None:
+        samples: list[Sample] = []
+    elif not isinstance(samples_raw, dict):
         raise ValueError(
             f"Task '{task_id}': 'samples' must be a dict with 'inline' and/or "
             f"'paths' keys, got {type(samples_raw).__name__}"
         )
-    samples = _load_samples_section(samples_raw, dataset_root, task_workspace, task_tests, task_dir)
+    else:
+        samples = _load_samples_section(samples_raw, dataset_root, task_workspace, task_tests, task_dir)
 
     # Parse variant_filters (tag-based variant restriction)
     variant_filters_raw = data.get("variant_filters")
     variant_filters = TagFilter(**variant_filters_raw) if isinstance(variant_filters_raw, dict) else None
 
+    # Task-level Inspect AI args are nested under inspect_task_args
+    task_args = data.get("inspect_task_args") or {}
+
     return [
         ParsedTask(
             id=task_id,
@@ -278,20 +284,20 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             samples=samples,
             system_message=system_message,
             allowed_variants=allowed_variants,
-            model=data.get("model"),
-            config=data.get("config") if isinstance(data.get("config"), dict) else None,
-            model_roles=data.get("model_roles") if isinstance(data.get("model_roles"), dict) else None,
-            sandbox=data.get("sandbox"),
-            approval=data.get("approval"),
-            epochs=data.get("epochs"),
-            fail_on_error=data.get("fail_on_error"),
-            continue_on_fail=data.get("continue_on_fail"),
-            message_limit=data.get("message_limit"),
-            token_limit=data.get("token_limit"),
-            time_limit=data.get("time_limit"),
-            working_limit=data.get("working_limit"),
-            cost_limit=float(data["cost_limit"]) if data.get("cost_limit") is not None else None,
-            early_stopping=data.get("early_stopping"),
+            model=task_args.get("model"),
+            config=task_args.get("config") if isinstance(task_args.get("config"), dict) else None,
+            model_roles=task_args.get("model_roles") if isinstance(task_args.get("model_roles"), dict) else None,
+            sandbox=task_args.get("sandbox"),
+            approval=task_args.get("approval"),
+            epochs=task_args.get("epochs"),
+            fail_on_error=task_args.get("fail_on_error"),
+            continue_on_fail=task_args.get("continue_on_fail"),
+            message_limit=task_args.get("message_limit"),
+            token_limit=task_args.get("token_limit"),
+            time_limit=task_args.get("time_limit"),
+            working_limit=task_args.get("working_limit"),
+            cost_limit=float(task_args["cost_limit"]) if task_args.get("cost_limit") is not None else None,
+            early_stopping=task_args.get("early_stopping"),
             display_name=data.get("display_name"),
             version=data.get("version"),
             metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None,
@@ -385,8 +391,11 @@ def _resolve_sample(
                 f"Sample '{doc.get('id', 'unknown')}' missing required field: {field}"
             )
 
-    sample_workspace = doc.get("workspace")
-    sample_tests = doc.get("tests")
+    # Read metadata fields from the metadata dict
+    meta_raw: dict[str, Any] = doc.get("metadata") or {}
+
+    sample_workspace = meta_raw.get("workspace")
+    sample_tests = meta_raw.get("tests")
 
     effective_workspace = sample_workspace if sample_workspace is not None else task_workspace
 
@@ -408,8 +417,8 @@ def _resolve_sample(
     elif task_tests is not None:
         tests = _resolve_resource_path(task_tests, dataset_root)
 
-    # Normalize tags
-    raw_tags = doc.get("tags")
+    # Normalize tags from metadata
+    raw_tags = meta_raw.get("tags")
     if isinstance(raw_tags, str):
         tags = [t.strip() for t in raw_tags.split(",")]
     elif isinstance(raw_tags, list):
@@ -418,8 +427,8 @@ def _resolve_sample(
         tags = []
 
     # Build metadata
-    meta: dict[str, Any] = {**(doc.get("metadata") or {})}
-    meta["difficulty"] = doc.get("difficulty", "medium")
+    meta: dict[str, Any] = {**meta_raw}
+    meta["difficulty"] = meta_raw.get("difficulty", "medium")
     meta["tags"] = tags
     if workspace is not None:
         meta["workspace"] = workspace
@@ -457,7 +466,14 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
     logs_dir = data.get("logs_dir") or _DEFAULT_LOGS_DIR
     log_dir = _resolve_log_dir(logs_dir, dataset_root)
 
-    sandbox_type = data.get("sandbox_type") or "local"
+    # Parse sandbox config
+    sandbox_raw = data.get("sandbox")
+    sandbox = None
+    if isinstance(sandbox_raw, dict):
+        sandbox = sandbox_raw
+    elif isinstance(sandbox_raw, str):
+        sandbox = {"environment": sandbox_raw}
+
     max_connections = data.get("max_connections") or 10
 
     # Parse task filters
@@ -483,82 +499,24 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
             else:
                 variants[str(key)] = {}
 
+    # Parse inspect_eval_arguments
+    inspect_eval_arguments = data.get("inspect_eval_arguments")
+    if isinstance(inspect_eval_arguments, dict):
+        inspect_eval_arguments = dict(inspect_eval_arguments)
+    else:
+        inspect_eval_arguments = None
+
     return Job(
         log_dir=log_dir,
-        sandbox_type=sandbox_type,
         max_connections=max_connections,
         models=data.get("models"),
         variants=variants,
         task_paths=task_paths,
         tasks=tasks,
         save_examples=data.get("save_examples") is True,
-        retry_attempts=data.get("retry_attempts"),
-        max_retries=data.get("max_retries"),
-        retry_wait=float(data["retry_wait"]) if data.get("retry_wait") is not None else None,
-        retry_connections=(
-            float(data["retry_connections"]) if data.get("retry_connections") is not None else None
-        ),
-        retry_cleanup=data.get("retry_cleanup"),
-        fail_on_error=(
-            float(data["fail_on_error"]) if data.get("fail_on_error") is not None else None
-        ),
-        continue_on_fail=data.get("continue_on_fail"),
-        retry_on_error=data.get("retry_on_error"),
-        debug_errors=data.get("debug_errors"),
-        max_samples=data.get("max_samples"),
-        max_tasks=data.get("max_tasks"),
-        max_subprocesses=data.get("max_subprocesses"),
-        max_sandboxes=data.get("max_sandboxes"),
-        log_level=data.get("log_level"),
-        log_level_transcript=data.get("log_level_transcript"),
-        log_format=data.get("log_format"),
-        tags=data.get("tags"),
-        metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None,
-        trace=data.get("trace"),
-        display=data.get("display"),
-        score=data.get("score"),
-        limit=data.get("limit"),
-        sample_id=data.get("sample_id"),
-        sample_shuffle=data.get("sample_shuffle"),
-        epochs=data.get("epochs"),
-        approval=data.get("approval"),
-        solver=data.get("solver"),
-        sandbox_cleanup=data.get("sandbox_cleanup"),
-        model_base_url=data.get("model_base_url"),
-        model_args=data.get("model_args") if isinstance(data.get("model_args"), dict) else None,
-        model_roles=(
-            data.get("model_roles") if isinstance(data.get("model_roles"), dict) else None
-        ),
-        task_args=data.get("task_args") if isinstance(data.get("task_args"), dict) else None,
-        message_limit=data.get("message_limit"),
-        token_limit=data.get("token_limit"),
-        time_limit=data.get("time_limit"),
-        working_limit=data.get("working_limit"),
-        cost_limit=float(data["cost_limit"]) if data.get("cost_limit") is not None else None,
-        model_cost_config=(
-            data.get("model_cost_config")
-            if isinstance(data.get("model_cost_config"), dict)
-            else None
-        ),
-        log_samples=data.get("log_samples"),
-        log_realtime=data.get("log_realtime"),
-        log_images=data.get("log_images"),
-        log_buffer=data.get("log_buffer"),
-        log_shared=data.get("log_shared"),
-        bundle_dir=data.get("bundle_dir"),
-        bundle_overwrite=data.get("bundle_overwrite"),
-        log_dir_allow_dirty=data.get("log_dir_allow_dirty"),
-        eval_set_id=data.get("eval_set_id"),
-        eval_set_overrides=(
-            data.get("eval_set_overrides")
-            if isinstance(data.get("eval_set_overrides"), dict)
-            else None
-        ),
-        task_defaults=(
-            data.get("task_defaults") if isinstance(data.get("task_defaults"), dict) else None
-        ),
         description=data.get("description"),
-        image_prefix=data.get("image_prefix"),
+        sandbox=sandbox,
+        inspect_eval_arguments=inspect_eval_arguments,
         task_filters=data.get("task_filters"),
         sample_filters=data.get("sample_filters"),
     )
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 1c67c4a..02e14f3 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -141,7 +141,8 @@ def _resolve_job(
 ) -> list[EvalSet]:
     """Resolve task configs and job into EvalSet objects."""
     models = job.models if job.models else list(DEFAULT_MODELS)
-    sandbox_type_str = job.sandbox_type
+    sandbox_cfg = job.sandbox or {}
+    sandbox_type_str = sandbox_cfg.get("environment", "local")
 
     expanded_tasks = _expand_task_configs(dataset_tasks, job, sandbox_type_str, dataset_root)
 
@@ -178,8 +179,11 @@ def _build_eval_set(
 ) -> EvalSet:
     """Build an EvalSet from resolved ParsedTasks."""
     inspect_tasks: list[Task] = []
-    is_container = job.sandbox_type and job.sandbox_type != "local"
-    task_defaults = job.task_defaults or {}
+    sandbox_cfg = job.sandbox or {}
+    sandbox_type_str = sandbox_cfg.get("environment", "local")
+    is_container = sandbox_type_str and sandbox_type_str != "local"
+    eval_args = job.inspect_eval_arguments or {}
+    task_defaults = eval_args.get("task_defaults") or {}
 
     for tc in task_configs:
         # Enrich each sample with task-level metadata
@@ -254,8 +258,8 @@ def _build_eval_set(
         if tc.examples_dir is not None:
             task_metadata["examples_dir"] = tc.examples_dir
         # Propagate image_prefix from job for container image resolution
-        if job.image_prefix:
-            task_metadata["image_prefix"] = job.image_prefix
+        if (job.sandbox or {}).get("image_prefix"):
+            task_metadata["image_prefix"] = job.sandbox["image_prefix"]
         if tc.metadata:
             task_metadata.update(tc.metadata)
 
@@ -263,7 +267,7 @@ def _build_eval_set(
         task_sandbox = None
         if tc.sandbox is not None:
             task_sandbox = tc.sandbox
-        elif tc.sandbox_type and tc.sandbox_type != "local":
+        elif sandbox_type_str != "local":
             task_sandbox = _serialize_sandbox(sandbox)
 
         # Resolve task-level settings with precedence:
@@ -271,7 +275,7 @@ def _build_eval_set(
         resolved_time_limit = (
             tc.time_limit
             or task_defaults.get("time_limit")
-            or (300 if job.sandbox_type != "local" else None)
+            or (300 if sandbox_type_str != "local" else None)
         )
 
         inspect_tasks.append(
@@ -303,8 +307,18 @@ def _build_eval_set(
             )
         )
 
-    # Build EvalSet with all job-level parameters
-    overrides = job.eval_set_overrides or {}
+    # Build EvalSet with all job-level parameters from inspect_eval_arguments
+    eval_set_overrides = eval_args.get("eval_set_overrides") or {}
+
+    # Helper to get a value from eval_args then overrides
+    def _get(key, default=None):
+        v = eval_args.get(key)
+        if v is not None:
+            return v
+        v = eval_set_overrides.get(key)
+        if v is not None:
+            return v
+        return default
 
     return EvalSet(
         tasks=inspect_tasks,
@@ -312,63 +326,61 @@ def _build_eval_set(
         model=models,
         sandbox=_serialize_sandbox(sandbox),
         # Retry
-        retry_attempts=job.retry_attempts or overrides.get("retry_attempts") or 10,
-        retry_wait=job.retry_wait or overrides.get("retry_wait") or 60.0,
-        retry_connections=job.retry_connections or overrides.get("retry_connections") or 0.5,
-        retry_cleanup=job.retry_cleanup if job.retry_cleanup is not None else overrides.get("retry_cleanup"),
-        retry_on_error=job.retry_on_error or job.max_retries or overrides.get("retry_on_error"),
+        retry_attempts=_get("retry_attempts", 10),
+        retry_wait=float(_get("retry_wait", 60.0)),
+        retry_connections=float(_get("retry_connections", 0.5)),
+        retry_cleanup=_get("retry_cleanup"),
+        retry_on_error=_get("retry_on_error") or _get("max_retries"),
         # Error handling
-        fail_on_error=job.fail_on_error if job.fail_on_error is not None else (overrides.get("fail_on_error") or 0.05),
-        continue_on_fail=job.continue_on_fail if job.continue_on_fail is not None else overrides.get("continue_on_fail"),
-        debug_errors=job.debug_errors if job.debug_errors is not None else overrides.get("debug_errors"),
+        fail_on_error=float(_get("fail_on_error", 0.05)),
+        continue_on_fail=_get("continue_on_fail"),
+        debug_errors=_get("debug_errors"),
         # Concurrency
-        max_samples=job.max_samples or overrides.get("max_samples"),
-        max_tasks=job.max_tasks or overrides.get("max_tasks"),
-        max_subprocesses=job.max_subprocesses or overrides.get("max_subprocesses"),
-        max_sandboxes=job.max_sandboxes or overrides.get("max_sandboxes"),
+        max_samples=_get("max_samples"),
+        max_tasks=_get("max_tasks"),
+        max_subprocesses=_get("max_subprocesses"),
+        max_sandboxes=_get("max_sandboxes"),
         # Logging
-        log_level=job.log_level or overrides.get("log_level") or "info",
-        log_level_transcript=job.log_level_transcript or overrides.get("log_level_transcript"),
-        log_format=job.log_format or overrides.get("log_format") or "json",
-        log_samples=job.log_samples if job.log_samples is not None else overrides.get("log_samples"),
-        log_realtime=job.log_realtime if job.log_realtime is not None else overrides.get("log_realtime"),
-        log_images=job.log_images if job.log_images is not None else overrides.get("log_images"),
-        log_buffer=job.log_buffer or overrides.get("log_buffer"),
-        log_shared=job.log_shared or overrides.get("log_shared"),
-        log_dir_allow_dirty=job.log_dir_allow_dirty if job.log_dir_allow_dirty is not None else overrides.get("log_dir_allow_dirty"),
+        log_level=_get("log_level", "info"),
+        log_level_transcript=_get("log_level_transcript"),
+        log_format=_get("log_format", "json"),
+        log_samples=_get("log_samples"),
+        log_realtime=_get("log_realtime"),
+        log_images=_get("log_images"),
+        log_buffer=_get("log_buffer"),
+        log_shared=_get("log_shared"),
+        log_dir_allow_dirty=_get("log_dir_allow_dirty"),
         # Model config
-        model_base_url=job.model_base_url or overrides.get("model_base_url"),
-        model_args=job.model_args or overrides.get("model_args") or {},
-        model_roles=job.model_roles or overrides.get("model_roles"),
-        task_args=job.task_args or overrides.get("task_args") or {},
-        model_cost_config=job.model_cost_config or overrides.get("model_cost_config"),
+        model_base_url=_get("model_base_url"),
+        model_args=_get("model_args", {}),
+        model_roles=_get("model_roles"),
+        task_args=_get("task_args", {}),
+        model_cost_config=_get("model_cost_config"),
         # Sandbox
-        sandbox_cleanup=job.sandbox_cleanup if job.sandbox_cleanup is not None else overrides.get("sandbox_cleanup"),
+        sandbox_cleanup=_get("sandbox_cleanup"),
         # Sample control
-        limit=job.limit or overrides.get("limit"),
-        sample_id=job.sample_id or overrides.get("sample_id"),
-        sample_shuffle=job.sample_shuffle or overrides.get("sample_shuffle"),
-        epochs=job.epochs or overrides.get("epochs"),
+        limit=_get("limit"),
+        sample_id=_get("sample_id"),
+        sample_shuffle=_get("sample_shuffle"),
+        epochs=_get("epochs"),
         # Misc
-        tags=job.tags or overrides.get("tags"),
-        metadata=job.metadata or overrides.get("metadata"),
-        trace=job.trace if job.trace is not None else overrides.get("trace"),
-        display=job.display or overrides.get("display"),
-        approval=job.approval or overrides.get("approval"),
-        solver=job.solver or overrides.get("solver"),
-        score=job.score if job.score is not None else (overrides.get("score") if overrides.get("score") is not None else True),
+        tags=_get("tags"),
+        metadata=_get("metadata"),
+        trace=_get("trace"),
+        display=_get("display"),
+        approval=_get("approval"),
+        solver=_get("solver"),
+        score=_get("score", True),
         # Limits
-        message_limit=job.message_limit or overrides.get("message_limit"),
-        token_limit=job.token_limit or overrides.get("token_limit"),
-        time_limit=job.time_limit or overrides.get("time_limit"),
-        working_limit=job.working_limit or overrides.get("working_limit"),
-        cost_limit=job.cost_limit if job.cost_limit is not None else (
-            float(overrides["cost_limit"]) if overrides.get("cost_limit") is not None else None
-        ),
+        message_limit=_get("message_limit"),
+        token_limit=_get("token_limit"),
+        time_limit=_get("time_limit"),
+        working_limit=_get("working_limit"),
+        cost_limit=float(_get("cost_limit")) if _get("cost_limit") is not None else None,
         # Bundling
-        bundle_dir=job.bundle_dir or overrides.get("bundle_dir"),
-        bundle_overwrite=job.bundle_overwrite if job.bundle_overwrite is not None else (overrides.get("bundle_overwrite") or False),
-        eval_set_id=job.eval_set_id or overrides.get("eval_set_id"),
+        bundle_dir=_get("bundle_dir"),
+        bundle_overwrite=_get("bundle_overwrite", False),
+        eval_set_id=_get("eval_set_id"),
     )
 
 
@@ -397,7 +409,8 @@ def _resolve_sandbox(
     branch: str | None = None,
 ) -> Any:
     """Resolve sandbox spec for a given config."""
-    sandbox_type = job.sandbox_type
+    sandbox_cfg = job.sandbox or {}
+    sandbox_type = sandbox_cfg.get("environment", "local")
     if not sandbox_type or sandbox_type == "local":
         return "local"
 
@@ -483,10 +496,8 @@ def _expand_task_configs(
                 if matches_tag_filter((s.metadata or {}).get("tags", []), job.sample_filters)
             ]
 
-        # Apply system_message override
+        # Apply system_message from task (no longer overridden by job task)
         system_message = tc.system_message
-        if job_task and job_task.system_message is not None:
-            system_message = job_task.system_message
 
         # Merge job-task args into metadata
         merged_metadata = dict(tc.metadata) if tc.metadata else None

From 147319dd26b0abfd2445d31b28e33a42e15bf5f2 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Wed, 18 Mar 2026 11:50:59 -0700
Subject: [PATCH 11/21] feat: Refactor variant configuration to use explicit
 include/exclude lists, rename context files to files, skill paths to skills,
 and add task parameters.

---
 docs/reference/yaml_config.md                 | 386 ++----------------
 .../dash_evals/runner/tasks/task_helpers.py   | 129 +++++-
 .../lib/src/models/job.dart                   |  12 +-
 .../lib/src/models/job.freezed.dart           |  64 ++-
 .../lib/src/models/job.g.dart                 |   8 +
 .../lib/src/models/variant.dart               |  31 +-
 .../lib/src/models/variant.freezed.dart       |  95 +++--
 .../lib/src/models/variant.g.dart             |  20 +-
 .../lib/src/parsed_task.dart                  |  11 -
 .../lib/src/parsers/json_parser.dart          |   3 -
 .../lib/src/parsers/yaml_parser.dart          |  13 +-
 .../lib/src/resolvers/eval_set_resolver.dart  | 181 ++++----
 .../test/eval_set_resolver_test.dart          |  28 +-
 .../test/json_parser_test.dart                |   2 -
 .../test/parsed_task_test.dart                |   4 +-
 .../src/dataset_config_python/__init__.py     |   2 -
 .../dataset_config_python/models/__init__.py  |   2 +
 .../src/dataset_config_python/models/job.py   |   8 +
 .../models/mcp_server_config.py               |  68 +++
 .../dataset_config_python/models/variant.py   |  19 +-
 .../src/dataset_config_python/parser.py       |  33 +-
 .../src/dataset_config_python/resolver.py     | 133 +++---
 .../tests/test_config.py                      |  34 +-
 .../init_templates/init_job_template.dart     |  13 +-
 .../dataset/file_templates/job_template.dart  |  17 +-
 .../dataset/file_templates/task_template.dart |   5 +-
 .../lib/src/dataset/variant_defaults.dart     |   6 +-
 .../test/dataset/task_template_test.dart      |   5 +-
 28 files changed, 568 insertions(+), 764 deletions(-)
 create mode 100644 packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py

diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index 9004539..95e0a91 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -72,10 +72,10 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Y
   - `variants`
   - `variants`
-  - Named variant definitions (keys are names, values are config maps)
+  - Named variant definitions (keys are names, values are config maps). Can also be a list of paths to external variant files.
 * - `variants`\
     &nbsp;&nbsp;`.<name>`\
-    &nbsp;&nbsp;`.context_files`
+    &nbsp;&nbsp;`.files`
   - list
   - Y
   -
@@ -88,7 +88,7 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Y
   -
   -
-  - MCP server identifiers
+  - MCP server configurations (list of objects with `name`, `command`, `args`, `env`, `transport`; or a `ref:` string to a Python package)
 * - `variants`\
     &nbsp;&nbsp;`.<name>`\
     &nbsp;&nbsp;`.skills`
@@ -99,12 +99,12 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Paths or glob patterns to skill directories
 * - `variants`\
     &nbsp;&nbsp;`.<name>`\
-    &nbsp;&nbsp;`.flutter_channel`
-  - string
+    &nbsp;&nbsp;`.task_parameters`
+  - object
   - Y
   -
   -
-  - Flutter SDK channel (`stable`, `beta`, `main`)
+  - Optional parameters merged into the task config dict at runtime
 * - `task_filters`
   - object
   - Y
@@ -167,6 +167,22 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `JobTask.args`
   - `JobTask.args`
   - Per-task argument overrides passed to the task function
+* - `tasks`\\
+    &nbsp;&nbsp;`.<task_id>`\\
+    &nbsp;&nbsp;`.include-variants`
+  - list
+  - Y
+  - `JobTask.includeVariants`
+  - `JobTask.include_variants`
+  - Only run these variant names for this task
+* - `tasks`\\
+    &nbsp;&nbsp;`.<task_id>`\\
+    &nbsp;&nbsp;`.exclude-variants`
+  - list
+  - Y
+  - `JobTask.excludeVariants`
+  - `JobTask.exclude_variants`
+  - Exclude these variant names for this task
 * - `save_examples`
   - bool
   - Y
@@ -178,350 +194,7 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Y
   - `inspectEvalArguments`
   - `inspect_eval_arguments`
-  - All Inspect AI `eval_set()` parameters. See sub-fields below.
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.retry_attempts`
-  - int
-  - Y
-  -
-  -
-  - Max retry attempts before giving up
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.max_retries`
-  - int
-  - Y
-  -
-  -
-  - Max retry attempts for failed samples
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.retry_wait`
-  - float
-  - Y
-  -
-  -
-  - Seconds between retries (exponential backoff)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.retry_connections`
-  - float
-  - Y
-  -
-  -
-  - Reduce `max_connections` at this rate per retry
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.retry_cleanup`
-  - bool
-  - Y
-  -
-  -
-  - Cleanup failed log files after retries
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.fail_on_error`
-  - float
-  - Y
-  -
-  -
-  - Fail if error proportion exceeds threshold (`0.0–1.0`)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.continue_on_fail`
-  - bool
-  - Y
-  -
-  -
-  - Continue running even if `fail_on_error` condition is met
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.retry_on_error`
-  - int
-  - Y
-  -
-  -
-  - Retry samples on error (per-sample)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.debug_errors`
-  - bool
-  - Y
-  -
-  -
-  - Raise task errors for debugging
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.max_samples`
-  - int
-  - Y
-  -
-  -
-  - Max concurrent samples per task
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.max_tasks`
-  - int
-  - Y
-  -
-  -
-  - Max tasks to run in parallel
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.max_subprocesses`
-  - int
-  - Y
-  -
-  -
-  - Max subprocesses in parallel
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.max_sandboxes`
-  - int
-  - Y
-  -
-  -
-  - Max sandboxes (per-provider) in parallel
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_level`
-  - string
-  - Y
-  -
-  -
-  - Console log level (`debug`, `info`, `warning`, `error`)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_level_transcript`
-  - string
-  - Y
-  -
-  -
-  - Log file level
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_format`
-  - string
-  - Y
-  -
-  -
-  - Log format (`eval` or `json`)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_samples`
-  - bool
-  - Y
-  -
-  -
-  - Log detailed samples and scores
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_realtime`
-  - bool
-  - Y
-  -
-  -
-  - Log events in realtime
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_images`
-  - bool
-  - Y
-  -
-  -
-  - Log base64-encoded images
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_buffer`
-  - int
-  - Y
-  -
-  -
-  - Samples to buffer before log write
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_shared`
-  - int
-  - Y
-  -
-  -
-  - Sync sample events for realtime viewing
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.log_dir_allow_dirty`
-  - bool
-  - Y
-  -
-  -
-  - Allow log dir with unrelated logs
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.model_base_url`
-  - string
-  - Y
-  -
-  -
-  - Base URL for the model API
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.model_args`
-  - object
-  - Y
-  -
-  -
-  - Model creation arguments
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.model_roles`
-  - object
-  - Y
-  -
-  -
-  - Named roles for `get_model()`
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.task_args`
-  - object
-  - Y
-  -
-  -
-  - Task creation arguments
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.model_cost_config`
-  - object
-  - Y
-  -
-  -
-  - Model prices for cost tracking
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.limit`
-  - int/list
-  - Y
-  -
-  -
-  - Limit samples (count or `[start, end]` range)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.sample_id`
-  - string/list
-  - Y
-  -
-  -
-  - Evaluate specific sample(s)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.sample_shuffle`
-  - bool/int
-  - Y
-  -
-  -
-  - Shuffle samples (pass seed for deterministic order)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.epochs`
-  - int/object
-  - Y
-  -
-  -
-  - Repeat samples and optional score reducer
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.message_limit`
-  - int
-  - Y
-  -
-  -
-  - Max messages per sample
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.token_limit`
-  - int
-  - Y
-  -
-  -
-  - Max tokens per sample
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.time_limit`
-  - int
-  - Y
-  -
-  -
-  - Max clock time (seconds) per sample
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.working_limit`
-  - int
-  - Y
-  -
-  -
-  - Max working time (seconds) per sample
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.cost_limit`
-  - float
-  - Y
-  -
-  -
-  - Max cost (dollars) per sample
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.tags`
-  - list
-  - Y
-  -
-  -
-  - Tags for this evaluation run
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.metadata`
-  - object
-  - Y
-  -
-  -
-  - Metadata for this evaluation run
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.trace`
-  - bool
-  - Y
-  -
-  -
-  - Trace model interactions to terminal
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.display`
-  - string
-  - Y
-  -
-  -
-  - Task display type (default: `full`)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.score`
-  - bool
-  - Y
-  -
-  -
-  - Score output (default: `true`)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.approval`
-  - string/object
-  - Y
-  -
-  -
-  - Tool use approval policies
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.solver`
-  - string/object
-  - Y
-  -
-  -
-  - Alternative solver(s)
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.sandbox_cleanup`
-  - bool
-  - Y
-  -
-  -
-  - Cleanup sandbox after task
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.bundle_dir`
-  - string
-  - Y
-  -
-  -
-  - Directory for bundled logs + viewer
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.bundle_overwrite`
-  - bool
-  - Y
-  -
-  -
-  - Overwrite files in `bundle_dir`
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.eval_set_id`
-  - string
-  - Y
-  -
-  -
-  - Custom ID for the eval set
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.eval_set_overrides`
-  - object
-  - Y
-  -
-  -
-  - Additional `eval_set()` kwargs not covered by named fields above
-* - `inspect_eval_arguments`\
-    &nbsp;&nbsp;`.task_defaults`
-  - object
-  - Y
-  -
-  -
-  - Default `Task` kwargs applied to every task in this job
+  - Pass-through dict of any valid Inspect AI `eval_set()` kwargs (e.g. `retry_attempts`, `log_level`, `max_tasks`, `tags`, `task_defaults`, `eval_set_overrides`, etc.). See [Inspect AI docs](https://inspect.ai-safety-institute.org.uk/) for the full list of supported parameters.
 ```
 
 ## Task
@@ -578,18 +251,6 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
   -
   -
   - Glob patterns for external sample YAML files (relative to task dir)
-* - `allowed_variants`
-  - list
-  - Y
-  -
-  -
-  - Whitelist of variant names this task accepts
-* - `variant_filters`
-  - object
-  - Y
-  -
-  -
-  - Tag-based variant filter (same schema as job-level `task_filters`)
 * - `system_message`
   - string
   - Y
@@ -738,7 +399,6 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
   - `early_stopping`
   - Early stopping callbacks
 ```
-```
 
 ## Sample
 
diff --git a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
index 9c2f61a..6f6d44d 100644
--- a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
+++ b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
@@ -1,7 +1,7 @@
 """Shared helper functions for building task components.
 
 These helpers encapsulate common patterns used across tasks:
-- Creating the Dart MCP server
+- Creating MCP servers from variant config
 - Building task metadata
 - Appending variant-driven solvers (context injection, MCP tools, skills)
 
@@ -11,11 +11,12 @@
 
 from __future__ import annotations
 
+import importlib
 from typing import Any, cast
 
 from inspect_ai.agent import react
 from inspect_ai.solver import Solver, generate
-from inspect_ai.tool import MCPServer, Tool, mcp_server_stdio, skill
+from inspect_ai.tool import MCPServer, Tool, mcp_server_sandbox, mcp_server_stdio, skill
 
 from dash_evals.runner.solvers import context_injector
 
@@ -58,32 +59,111 @@ def validate_sandbox_tools(config: dict, tool_names: list[str]) -> None:
     )
 
 
-def create_mcp_server(config: dict | None = None):
-    """
-    Create an MCP server tool from config.
+def _resolve_mcp_ref(ref: str) -> MCPServer:
+    """Resolve a Python import reference to an MCPServer object.
 
-    Reads 'mcp_server_command' and 'mcp_server_args' from config.
-    Defaults to the Dart MCP server if not specified.
+    Supports ``"module.path:variable_name"`` format.
+
+    Args:
+        ref: Import reference (e.g. ``"my_package.mcp:staging_server"``).
+
+    Returns:
+        The resolved MCPServer object.
+    """
+    if ":" not in ref:
+        raise ValueError(
+            f"Invalid MCP server ref '{ref}'. "
+            "Expected format: 'module.path:variable_name'"
+        )
+    module_path, attr_name = ref.rsplit(":", 1)
+    try:
+        module = importlib.import_module(module_path)
+    except ImportError as e:
+        raise ImportError(
+            f"Could not import module '{module_path}' for MCP server ref '{ref}': {e}"
+        ) from e
+    try:
+        server = getattr(module, attr_name)
+    except AttributeError as e:
+        raise AttributeError(
+            f"Module '{module_path}' has no attribute '{attr_name}' "
+            f"(referenced by MCP server ref '{ref}')"
+        ) from e
+    return server
+
+
+def create_mcp_servers(
+    mcp_configs: list[dict],
+    sandbox_type: str = "local",
+) -> list[MCPServer]:
+    """Create MCP server objects from variant config.
+
+    Supports two modes per entry:
+    - **Declarative**: dict with ``name``, ``command``, ``args``, etc.
+    - **Python ref**: dict with ``ref`` key pointing to a pre-built MCPServer.
+
+    Transport is auto-selected based on sandbox_type when not explicit:
+    - ``"local"`` → ``mcp_server_stdio``
+    - anything else (docker, podman) → ``mcp_server_sandbox``
 
     Args:
-        config: Task config with optional 'mcp_server_command' and
-                'mcp_server_args' keys.
+        mcp_configs: List of MCP server config dicts from variant_config.
+        sandbox_type: The sandbox type for the current eval run.
 
     Returns:
-        MCP server stdio tool.
+        List of MCPServer objects.
     """
-    config = config or {}
-    command = config.get("mcp_server_command", "dart")
-    args = config.get("mcp_server_args", ["mcp-server", "--force-roots-fallback"])
-    name = config.get("mcp_server_name", "Dart")
+    servers: list[MCPServer] = []
+    for cfg in mcp_configs:
+        if cfg.get("ref"):
+            servers.append(_resolve_mcp_ref(cfg["ref"]))
+            continue
+
+        command = cfg.get("command")
+        if not command:
+            raise ValueError(f"MCP server config missing 'command': {cfg}")
+
+        name = cfg.get("name", command)
+        args = cfg.get("args", [])
+        env = cfg.get("env")
+        cwd = cfg.get("cwd")
+
+        transport = cfg.get("transport")
+        if transport is None:
+            transport = "sandbox" if sandbox_type != "local" else "stdio"
+
+        if transport == "stdio":
+            servers.append(mcp_server_stdio(
+                name=name,
+                command=command,
+                args=args,
+                env=env,
+                cwd=cwd,
+            ))
+        elif transport == "sandbox":
+            servers.append(mcp_server_sandbox(
+                name=name,
+                command=command,
+                args=args,
+                env=env,
+                cwd=cwd,
+            ))
+        else:
+            raise ValueError(f"Unknown MCP transport '{transport}' for server '{name}'")
+
+    return servers
+
+
+# Backwards-compatible alias
+def create_mcp_server(config: dict | None = None):
+    """Create the default Dart MCP server (backwards-compatible alias)."""
     return mcp_server_stdio(
-        name=name,
-        command=command,
-        args=args,
+        name="Dart",
+        command="dart",
+        args=["mcp-server", "--force-roots-fallback"],
     )
 
 
-# Backwards-compatible alias
 def create_dart_mcp_server():
     """Create the standard Dart MCP server tool (backwards-compatible alias)."""
     return create_mcp_server()
@@ -119,7 +199,8 @@ def append_context_injection(solver_chain: list, config: dict) -> None:
         config: Task manifest entry with 'variant' key.
     """
     variant = config.get("variant", {})
-    context_files = variant.get("context_files", [])
+    # Support both old "context_files" and new "files" key
+    context_files = variant.get("files") or variant.get("context_files", [])
     if context_files:
         solver_chain.append(context_injector(context_files))
 
@@ -134,7 +215,8 @@ def get_skill_tool(config: dict) -> Tool | None:
         The skill Tool, or None if no skills are configured.
     """
     variant = config.get("variant", {})
-    skill_paths = variant.get("skill_paths", [])
+    # Support both old "skill_paths" and new "skills" key
+    skill_paths = variant.get("skills") or variant.get("skill_paths", [])
     if skill_paths:
         return skill(skill_paths)
     return None
@@ -155,8 +237,11 @@ def append_model_interaction(
     """
     tools: list[Tool | MCPServer] = []
     variant = config.get("variant", {})
-    if variant.get("mcp_servers"):
-        tools.append(create_mcp_server(config))
+    mcp_servers_config = variant.get("mcp_servers", [])
+
+    if mcp_servers_config:
+        sandbox_type = config.get("sandbox_type", "local")
+        tools.extend(create_mcp_servers(mcp_servers_config, sandbox_type))
 
     skill_tool = get_skill_tool(config)
     if skill_tool:
diff --git a/packages/dataset_config_dart/lib/src/models/job.dart b/packages/dataset_config_dart/lib/src/models/job.dart
index 0fa4599..a7566aa 100644
--- a/packages/dataset_config_dart/lib/src/models/job.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.dart
@@ -25,7 +25,7 @@ part 'job.g.dart';
 /// variants:
 ///   baseline: {}
 ///   context_only:
-///     context_files: [./context_files/flutter.md]
+///     files: [./context_files/flutter.md]
 /// tasks:
 ///   dart_qa:
 ///     include-samples: [sample_1]
@@ -101,7 +101,7 @@ sealed class Job with _$Job {
 
 /// Per-task configuration within a job.
 ///
-/// Allows overriding which samples run for specific tasks.
+/// Allows overriding which samples and variants run for specific tasks.
 @freezed
 sealed class JobTask with _$JobTask {
   const factory JobTask({
@@ -114,6 +114,12 @@ sealed class JobTask with _$JobTask {
     /// Exclude these sample IDs. Mutually exclusive with [includeSamples].
     @JsonKey(name: 'exclude_samples') List<String>? excludeSamples,
 
+    /// Only run these variant names for this task.
+    @JsonKey(name: 'include_variants') List<String>? includeVariants,
+
+    /// Exclude these variant names for this task.
+    @JsonKey(name: 'exclude_variants') List<String>? excludeVariants,
+
     /// Per-task argument overrides passed to the task function.
     @JsonKey(name: 'args') Map<String, dynamic>? args,
   }) = _JobTask;
@@ -130,6 +136,8 @@ sealed class JobTask with _$JobTask {
       id: taskId,
       includeSamples: (data['include-samples'] as List?)?.cast<String>(),
       excludeSamples: (data['exclude-samples'] as List?)?.cast<String>(),
+      includeVariants: (data['include-variants'] as List?)?.cast<String>(),
+      excludeVariants: (data['exclude-variants'] as List?)?.cast<String>(),
       args: (data['args'] as Map?)?.cast<String, dynamic>(),
     );
   }
diff --git a/packages/dataset_config_dart/lib/src/models/job.freezed.dart b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
index ed172a2..b0de561 100644
--- a/packages/dataset_config_dart/lib/src/models/job.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
@@ -470,7 +470,9 @@ mixin _$JobTask {
 /// Task identifier matching a task directory name in `tasks/`.
  String get id;/// Only run these sample IDs. Mutually exclusive with [excludeSamples].
 @JsonKey(name: 'include_samples') List<String>? get includeSamples;/// Exclude these sample IDs. Mutually exclusive with [includeSamples].
-@JsonKey(name: 'exclude_samples') List<String>? get excludeSamples;/// Per-task argument overrides passed to the task function.
+@JsonKey(name: 'exclude_samples') List<String>? get excludeSamples;/// Only run these variant names for this task.
+@JsonKey(name: 'include_variants') List<String>? get includeVariants;/// Exclude these variant names for this task.
+@JsonKey(name: 'exclude_variants') List<String>? get excludeVariants;/// Per-task argument overrides passed to the task function.
 @JsonKey(name: 'args') Map<String, dynamic>? get args;
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
@@ -484,16 +486,16 @@ $JobTaskCopyWith<JobTask> get copyWith => _$JobTaskCopyWithImpl<JobTask>(this as
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&const DeepCollectionEquality().equals(other.args, args));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&const DeepCollectionEquality().equals(other.includeVariants, includeVariants)&&const DeepCollectionEquality().equals(other.excludeVariants, excludeVariants)&&const DeepCollectionEquality().equals(other.args, args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),const DeepCollectionEquality().hash(args));
+int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),const DeepCollectionEquality().hash(includeVariants),const DeepCollectionEquality().hash(excludeVariants),const DeepCollectionEquality().hash(args));
 
 @override
 String toString() {
-  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, args: $args)';
+  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, includeVariants: $includeVariants, excludeVariants: $excludeVariants, args: $args)';
 }
 
 
@@ -504,7 +506,7 @@ abstract mixin class $JobTaskCopyWith<$Res>  {
   factory $JobTaskCopyWith(JobTask value, $Res Function(JobTask) _then) = _$JobTaskCopyWithImpl;
 @useResult
 $Res call({
- String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'args') Map<String, dynamic>? args
+ String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'include_variants') List<String>? includeVariants,@JsonKey(name: 'exclude_variants') List<String>? excludeVariants,@JsonKey(name: 'args') Map<String, dynamic>? args
 });
 
 
@@ -521,11 +523,13 @@ class _$JobTaskCopyWithImpl<$Res>
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? args = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? includeVariants = freezed,Object? excludeVariants = freezed,Object? args = freezed,}) {
   return _then(_self.copyWith(
 id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable
 as String,includeSamples: freezed == includeSamples ? _self.includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,excludeSamples: freezed == excludeSamples ? _self.excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable
+as List<String>?,includeVariants: freezed == includeVariants ? _self.includeVariants : includeVariants // ignore: cast_nullable_to_non_nullable
+as List<String>?,excludeVariants: freezed == excludeVariants ? _self.excludeVariants : excludeVariants // ignore: cast_nullable_to_non_nullable
 as List<String>?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,
   ));
@@ -609,10 +613,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'include_variants')  List<String>? includeVariants, @JsonKey(name: 'exclude_variants')  List<String>? excludeVariants, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _JobTask() when $default != null:
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);case _:
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.includeVariants,_that.excludeVariants,_that.args);case _:
   return orElse();
 
 }
@@ -630,10 +634,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);c
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'args')  Map<String, dynamic>? args)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'include_variants')  List<String>? includeVariants, @JsonKey(name: 'exclude_variants')  List<String>? excludeVariants, @JsonKey(name: 'args')  Map<String, dynamic>? args)  $default,) {final _that = this;
 switch (_that) {
 case _JobTask():
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);}
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.includeVariants,_that.excludeVariants,_that.args);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -647,10 +651,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);}
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String id, @JsonKey(name: 'include_samples')  List<String>? includeSamples, @JsonKey(name: 'exclude_samples')  List<String>? excludeSamples, @JsonKey(name: 'include_variants')  List<String>? includeVariants, @JsonKey(name: 'exclude_variants')  List<String>? excludeVariants, @JsonKey(name: 'args')  Map<String, dynamic>? args)?  $default,) {final _that = this;
 switch (_that) {
 case _JobTask() when $default != null:
-return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);case _:
+return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.includeVariants,_that.excludeVariants,_that.args);case _:
   return null;
 
 }
@@ -662,7 +666,7 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.args);c
 @JsonSerializable()
 
 class _JobTask implements JobTask {
-  const _JobTask({required this.id, @JsonKey(name: 'include_samples') final  List<String>? includeSamples, @JsonKey(name: 'exclude_samples') final  List<String>? excludeSamples, @JsonKey(name: 'args') final  Map<String, dynamic>? args}): _includeSamples = includeSamples,_excludeSamples = excludeSamples,_args = args;
+  const _JobTask({required this.id, @JsonKey(name: 'include_samples') final  List<String>? includeSamples, @JsonKey(name: 'exclude_samples') final  List<String>? excludeSamples, @JsonKey(name: 'include_variants') final  List<String>? includeVariants, @JsonKey(name: 'exclude_variants') final  List<String>? excludeVariants, @JsonKey(name: 'args') final  Map<String, dynamic>? args}): _includeSamples = includeSamples,_excludeSamples = excludeSamples,_includeVariants = includeVariants,_excludeVariants = excludeVariants,_args = args;
   factory _JobTask.fromJson(Map<String, dynamic> json) => _$JobTaskFromJson(json);
 
 /// Task identifier matching a task directory name in `tasks/`.
@@ -689,6 +693,28 @@ class _JobTask implements JobTask {
   return EqualUnmodifiableListView(value);
 }
 
+/// Only run these variant names for this task.
+ final  List<String>? _includeVariants;
+/// Only run these variant names for this task.
+@override@JsonKey(name: 'include_variants') List<String>? get includeVariants {
+  final value = _includeVariants;
+  if (value == null) return null;
+  if (_includeVariants is EqualUnmodifiableListView) return _includeVariants;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableListView(value);
+}
+
+/// Exclude these variant names for this task.
+ final  List<String>? _excludeVariants;
+/// Exclude these variant names for this task.
+@override@JsonKey(name: 'exclude_variants') List<String>? get excludeVariants {
+  final value = _excludeVariants;
+  if (value == null) return null;
+  if (_excludeVariants is EqualUnmodifiableListView) return _excludeVariants;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableListView(value);
+}
+
 /// Per-task argument overrides passed to the task function.
  final  Map<String, dynamic>? _args;
 /// Per-task argument overrides passed to the task function.
@@ -714,16 +740,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&const DeepCollectionEquality().equals(other._args, _args));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&const DeepCollectionEquality().equals(other._includeVariants, _includeVariants)&&const DeepCollectionEquality().equals(other._excludeVariants, _excludeVariants)&&const DeepCollectionEquality().equals(other._args, _args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),const DeepCollectionEquality().hash(_args));
+int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),const DeepCollectionEquality().hash(_includeVariants),const DeepCollectionEquality().hash(_excludeVariants),const DeepCollectionEquality().hash(_args));
 
 @override
 String toString() {
-  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, args: $args)';
+  return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, includeVariants: $includeVariants, excludeVariants: $excludeVariants, args: $args)';
 }
 
 
@@ -734,7 +760,7 @@ abstract mixin class _$JobTaskCopyWith<$Res> implements $JobTaskCopyWith<$Res> {
   factory _$JobTaskCopyWith(_JobTask value, $Res Function(_JobTask) _then) = __$JobTaskCopyWithImpl;
 @override @useResult
 $Res call({
- String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'args') Map<String, dynamic>? args
+ String id,@JsonKey(name: 'include_samples') List<String>? includeSamples,@JsonKey(name: 'exclude_samples') List<String>? excludeSamples,@JsonKey(name: 'include_variants') List<String>? includeVariants,@JsonKey(name: 'exclude_variants') List<String>? excludeVariants,@JsonKey(name: 'args') Map<String, dynamic>? args
 });
 
 
@@ -751,11 +777,13 @@ class __$JobTaskCopyWithImpl<$Res>
 
 /// Create a copy of JobTask
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? args = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? includeVariants = freezed,Object? excludeVariants = freezed,Object? args = freezed,}) {
   return _then(_JobTask(
 id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable
 as String,includeSamples: freezed == includeSamples ? _self._includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable
 as List<String>?,excludeSamples: freezed == excludeSamples ? _self._excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable
+as List<String>?,includeVariants: freezed == includeVariants ? _self._includeVariants : includeVariants // ignore: cast_nullable_to_non_nullable
+as List<String>?,excludeVariants: freezed == excludeVariants ? _self._excludeVariants : excludeVariants // ignore: cast_nullable_to_non_nullable
 as List<String>?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable
 as Map<String, dynamic>?,
   ));
diff --git a/packages/dataset_config_dart/lib/src/models/job.g.dart b/packages/dataset_config_dart/lib/src/models/job.g.dart
index 929b022..e2996b3 100644
--- a/packages/dataset_config_dart/lib/src/models/job.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.g.dart
@@ -54,6 +54,12 @@ _JobTask _$JobTaskFromJson(Map<String, dynamic> json) => _JobTask(
   excludeSamples: (json['exclude_samples'] as List<dynamic>?)
       ?.map((e) => e as String)
       .toList(),
+  includeVariants: (json['include_variants'] as List<dynamic>?)
+      ?.map((e) => e as String)
+      .toList(),
+  excludeVariants: (json['exclude_variants'] as List<dynamic>?)
+      ?.map((e) => e as String)
+      .toList(),
   args: json['args'] as Map<String, dynamic>?,
 );
 
@@ -61,5 +67,7 @@ Map<String, dynamic> _$JobTaskToJson(_JobTask instance) => <String, dynamic>{
   'id': instance.id,
   'include_samples': instance.includeSamples,
   'exclude_samples': instance.excludeSamples,
+  'include_variants': instance.includeVariants,
+  'exclude_variants': instance.excludeVariants,
   'args': instance.args,
 };
diff --git a/packages/dataset_config_dart/lib/src/models/variant.dart b/packages/dataset_config_dart/lib/src/models/variant.dart
index bfa1542..15a3bbb 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.dart
@@ -11,9 +11,10 @@ part 'variant.g.dart';
 /// performance with and without specific tooling or context.
 ///
 /// Features are implied by field presence — no explicit feature list needed:
-/// - [contextFiles] populated → context injection enabled
+/// - [files] populated → context injection enabled
 /// - [mcpServers] populated → MCP tools enabled
-/// - [skillPaths] populated → agent skills enabled
+/// - [skills] populated → agent skills enabled
+/// - [taskParameters] populated → extra parameters passed to the task
 /// - all empty → baseline variant
 ///
 /// Example YAML:
@@ -21,10 +22,13 @@ part 'variant.g.dart';
 /// variants:
 ///   baseline: {}
 ///   context_only:
-///     context_files: [./context_files/flutter.md]
+///     files: [./context_files/flutter.md]
 ///   full:
-///     context_files: [./context_files/flutter.md]
-///     mcp_servers: [dart]
+///     files: [./context_files/flutter.md]
+///     mcp_servers:
+///       - name: dart
+///         command: dart
+///         args: [mcp-server]
 ///     skills: [./skills/flutter_docs_ui]
 /// ```
 @freezed
@@ -34,18 +38,21 @@ sealed class Variant with _$Variant {
     @Default('baseline') String name,
 
     /// Loaded context files (paths resolved by config resolver).
-    @JsonKey(name: 'context_files') @Default([]) List<ContextFile> contextFiles,
+    @JsonKey(name: 'files') @Default([]) List<ContextFile> files,
 
-    /// MCP server keys to enable (e.g., `['dart']`).
-    @JsonKey(name: 'mcp_servers') @Default([]) List<String> mcpServers,
+    /// MCP server configurations (list of config maps or ref strings).
+    @JsonKey(name: 'mcp_servers')
+    @Default([])
+    List<Map<String, dynamic>> mcpServers,
 
     /// Resolved paths to agent skill directories.
     /// Each directory must contain a `SKILL.md` file.
-    @JsonKey(name: 'skill_paths') @Default([]) List<String> skillPaths,
+    @JsonKey(name: 'skills') @Default([]) List<String> skills,
 
-    /// SDK branch/channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
-    /// `null` means use the default image from the job's sandbox.
-    @JsonKey(name: 'branch') String? branch,
+    /// Optional parameters merged into the task config dict at runtime.
+    @JsonKey(name: 'task_parameters')
+    @Default({})
+    Map<String, dynamic> taskParameters,
   }) = _Variant;
 
   const Variant._();
diff --git a/packages/dataset_config_dart/lib/src/models/variant.freezed.dart b/packages/dataset_config_dart/lib/src/models/variant.freezed.dart
index 5389724..f322f5a 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.freezed.dart
@@ -17,12 +17,11 @@ mixin _$Variant {
 
 /// User-defined variant name from the job file.
  String get name;/// Loaded context files (paths resolved by config resolver).
-@JsonKey(name: 'context_files') List<ContextFile> get contextFiles;/// MCP server keys to enable (e.g., `['dart']`).
-@JsonKey(name: 'mcp_servers') List<String> get mcpServers;/// Resolved paths to agent skill directories.
+@JsonKey(name: 'files') List<ContextFile> get files;/// MCP server configurations (list of config maps or ref strings).
+@JsonKey(name: 'mcp_servers') List<Map<String, dynamic>> get mcpServers;/// Resolved paths to agent skill directories.
 /// Each directory must contain a `SKILL.md` file.
-@JsonKey(name: 'skill_paths') List<String> get skillPaths;/// SDK branch/channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
-/// `null` means use the default image from the job's sandbox.
-@JsonKey(name: 'branch') String? get branch;
+@JsonKey(name: 'skills') List<String> get skills;/// Optional parameters merged into the task config dict at runtime.
+@JsonKey(name: 'task_parameters') Map<String, dynamic> get taskParameters;
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
 @JsonKey(includeFromJson: false, includeToJson: false)
@@ -35,16 +34,16 @@ $VariantCopyWith<Variant> get copyWith => _$VariantCopyWithImpl<Variant>(this as
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.contextFiles, contextFiles)&&const DeepCollectionEquality().equals(other.mcpServers, mcpServers)&&const DeepCollectionEquality().equals(other.skillPaths, skillPaths)&&(identical(other.branch, branch) || other.branch == branch));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.files, files)&&const DeepCollectionEquality().equals(other.mcpServers, mcpServers)&&const DeepCollectionEquality().equals(other.skills, skills)&&const DeepCollectionEquality().equals(other.taskParameters, taskParameters));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(contextFiles),const DeepCollectionEquality().hash(mcpServers),const DeepCollectionEquality().hash(skillPaths),branch);
+int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(files),const DeepCollectionEquality().hash(mcpServers),const DeepCollectionEquality().hash(skills),const DeepCollectionEquality().hash(taskParameters));
 
 @override
 String toString() {
-  return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, branch: $branch)';
+  return 'Variant(name: $name, files: $files, mcpServers: $mcpServers, skills: $skills, taskParameters: $taskParameters)';
 }
 
 
@@ -55,7 +54,7 @@ abstract mixin class $VariantCopyWith<$Res>  {
   factory $VariantCopyWith(Variant value, $Res Function(Variant) _then) = _$VariantCopyWithImpl;
 @useResult
 $Res call({
- String name,@JsonKey(name: 'context_files') List<ContextFile> contextFiles,@JsonKey(name: 'mcp_servers') List<String> mcpServers,@JsonKey(name: 'skill_paths') List<String> skillPaths,@JsonKey(name: 'branch') String? branch
+ String name,@JsonKey(name: 'files') List<ContextFile> files,@JsonKey(name: 'mcp_servers') List<Map<String, dynamic>> mcpServers,@JsonKey(name: 'skills') List<String> skills,@JsonKey(name: 'task_parameters') Map<String, dynamic> taskParameters
 });
 
 
@@ -72,14 +71,14 @@ class _$VariantCopyWithImpl<$Res>
 
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? branch = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? name = null,Object? files = null,Object? mcpServers = null,Object? skills = null,Object? taskParameters = null,}) {
   return _then(_self.copyWith(
 name: null == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
-as String,contextFiles: null == contextFiles ? _self.contextFiles : contextFiles // ignore: cast_nullable_to_non_nullable
+as String,files: null == files ? _self.files : files // ignore: cast_nullable_to_non_nullable
 as List<ContextFile>,mcpServers: null == mcpServers ? _self.mcpServers : mcpServers // ignore: cast_nullable_to_non_nullable
-as List<String>,skillPaths: null == skillPaths ? _self.skillPaths : skillPaths // ignore: cast_nullable_to_non_nullable
-as List<String>,branch: freezed == branch ? _self.branch : branch // ignore: cast_nullable_to_non_nullable
-as String?,
+as List<Map<String, dynamic>>,skills: null == skills ? _self.skills : skills // ignore: cast_nullable_to_non_nullable
+as List<String>,taskParameters: null == taskParameters ? _self.taskParameters : taskParameters // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>,
   ));
 }
 
@@ -161,10 +160,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'branch')  String? branch)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'files')  List<ContextFile> files, @JsonKey(name: 'mcp_servers')  List<Map<String, dynamic>> mcpServers, @JsonKey(name: 'skills')  List<String> skills, @JsonKey(name: 'task_parameters')  Map<String, dynamic> taskParameters)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Variant() when $default != null:
-return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.branch);case _:
+return $default(_that.name,_that.files,_that.mcpServers,_that.skills,_that.taskParameters);case _:
   return orElse();
 
 }
@@ -182,10 +181,10 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'branch')  String? branch)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String name, @JsonKey(name: 'files')  List<ContextFile> files, @JsonKey(name: 'mcp_servers')  List<Map<String, dynamic>> mcpServers, @JsonKey(name: 'skills')  List<String> skills, @JsonKey(name: 'task_parameters')  Map<String, dynamic> taskParameters)  $default,) {final _that = this;
 switch (_that) {
 case _Variant():
-return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.branch);}
+return $default(_that.name,_that.files,_that.mcpServers,_that.skills,_that.taskParameters);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -199,10 +198,10 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String name, @JsonKey(name: 'context_files')  List<ContextFile> contextFiles, @JsonKey(name: 'mcp_servers')  List<String> mcpServers, @JsonKey(name: 'skill_paths')  List<String> skillPaths, @JsonKey(name: 'branch')  String? branch)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String name, @JsonKey(name: 'files')  List<ContextFile> files, @JsonKey(name: 'mcp_servers')  List<Map<String, dynamic>> mcpServers, @JsonKey(name: 'skills')  List<String> skills, @JsonKey(name: 'task_parameters')  Map<String, dynamic> taskParameters)?  $default,) {final _that = this;
 switch (_that) {
 case _Variant() when $default != null:
-return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.branch);case _:
+return $default(_that.name,_that.files,_that.mcpServers,_that.skills,_that.taskParameters);case _:
   return null;
 
 }
@@ -214,24 +213,24 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,
 @JsonSerializable()
 
 class _Variant extends Variant {
-  const _Variant({this.name = 'baseline', @JsonKey(name: 'context_files') final  List<ContextFile> contextFiles = const [], @JsonKey(name: 'mcp_servers') final  List<String> mcpServers = const [], @JsonKey(name: 'skill_paths') final  List<String> skillPaths = const [], @JsonKey(name: 'branch') this.branch}): _contextFiles = contextFiles,_mcpServers = mcpServers,_skillPaths = skillPaths,super._();
+  const _Variant({this.name = 'baseline', @JsonKey(name: 'files') final  List<ContextFile> files = const [], @JsonKey(name: 'mcp_servers') final  List<Map<String, dynamic>> mcpServers = const [], @JsonKey(name: 'skills') final  List<String> skills = const [], @JsonKey(name: 'task_parameters') final  Map<String, dynamic> taskParameters = const {}}): _files = files,_mcpServers = mcpServers,_skills = skills,_taskParameters = taskParameters,super._();
   factory _Variant.fromJson(Map<String, dynamic> json) => _$VariantFromJson(json);
 
 /// User-defined variant name from the job file.
 @override@JsonKey() final  String name;
 /// Loaded context files (paths resolved by config resolver).
- final  List<ContextFile> _contextFiles;
+ final  List<ContextFile> _files;
 /// Loaded context files (paths resolved by config resolver).
-@override@JsonKey(name: 'context_files') List<ContextFile> get contextFiles {
-  if (_contextFiles is EqualUnmodifiableListView) return _contextFiles;
+@override@JsonKey(name: 'files') List<ContextFile> get files {
+  if (_files is EqualUnmodifiableListView) return _files;
   // ignore: implicit_dynamic_type
-  return EqualUnmodifiableListView(_contextFiles);
+  return EqualUnmodifiableListView(_files);
 }
 
-/// MCP server keys to enable (e.g., `['dart']`).
- final  List<String> _mcpServers;
-/// MCP server keys to enable (e.g., `['dart']`).
-@override@JsonKey(name: 'mcp_servers') List<String> get mcpServers {
+/// MCP server configurations (list of config maps or ref strings).
+ final  List<Map<String, dynamic>> _mcpServers;
+/// MCP server configurations (list of config maps or ref strings).
+@override@JsonKey(name: 'mcp_servers') List<Map<String, dynamic>> get mcpServers {
   if (_mcpServers is EqualUnmodifiableListView) return _mcpServers;
   // ignore: implicit_dynamic_type
   return EqualUnmodifiableListView(_mcpServers);
@@ -239,18 +238,24 @@ class _Variant extends Variant {
 
 /// Resolved paths to agent skill directories.
 /// Each directory must contain a `SKILL.md` file.
- final  List<String> _skillPaths;
+ final  List<String> _skills;
 /// Resolved paths to agent skill directories.
 /// Each directory must contain a `SKILL.md` file.
-@override@JsonKey(name: 'skill_paths') List<String> get skillPaths {
-  if (_skillPaths is EqualUnmodifiableListView) return _skillPaths;
+@override@JsonKey(name: 'skills') List<String> get skills {
+  if (_skills is EqualUnmodifiableListView) return _skills;
   // ignore: implicit_dynamic_type
-  return EqualUnmodifiableListView(_skillPaths);
+  return EqualUnmodifiableListView(_skills);
+}
+
+/// Optional parameters merged into the task config dict at runtime.
+ final  Map<String, dynamic> _taskParameters;
+/// Optional parameters merged into the task config dict at runtime.
+@override@JsonKey(name: 'task_parameters') Map<String, dynamic> get taskParameters {
+  if (_taskParameters is EqualUnmodifiableMapView) return _taskParameters;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableMapView(_taskParameters);
 }
 
-/// SDK branch/channel to use (e.g., `'stable'`, `'beta'`, `'main'`).
-/// `null` means use the default image from the job's sandbox.
-@override@JsonKey(name: 'branch') final  String? branch;
 
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
@@ -265,16 +270,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other._contextFiles, _contextFiles)&&const DeepCollectionEquality().equals(other._mcpServers, _mcpServers)&&const DeepCollectionEquality().equals(other._skillPaths, _skillPaths)&&(identical(other.branch, branch) || other.branch == branch));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other._files, _files)&&const DeepCollectionEquality().equals(other._mcpServers, _mcpServers)&&const DeepCollectionEquality().equals(other._skills, _skills)&&const DeepCollectionEquality().equals(other._taskParameters, _taskParameters));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(_contextFiles),const DeepCollectionEquality().hash(_mcpServers),const DeepCollectionEquality().hash(_skillPaths),branch);
+int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(_files),const DeepCollectionEquality().hash(_mcpServers),const DeepCollectionEquality().hash(_skills),const DeepCollectionEquality().hash(_taskParameters));
 
 @override
 String toString() {
-  return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, branch: $branch)';
+  return 'Variant(name: $name, files: $files, mcpServers: $mcpServers, skills: $skills, taskParameters: $taskParameters)';
 }
 
 
@@ -285,7 +290,7 @@ abstract mixin class _$VariantCopyWith<$Res> implements $VariantCopyWith<$Res> {
   factory _$VariantCopyWith(_Variant value, $Res Function(_Variant) _then) = __$VariantCopyWithImpl;
 @override @useResult
 $Res call({
- String name,@JsonKey(name: 'context_files') List<ContextFile> contextFiles,@JsonKey(name: 'mcp_servers') List<String> mcpServers,@JsonKey(name: 'skill_paths') List<String> skillPaths,@JsonKey(name: 'branch') String? branch
+ String name,@JsonKey(name: 'files') List<ContextFile> files,@JsonKey(name: 'mcp_servers') List<Map<String, dynamic>> mcpServers,@JsonKey(name: 'skills') List<String> skills,@JsonKey(name: 'task_parameters') Map<String, dynamic> taskParameters
 });
 
 
@@ -302,14 +307,14 @@ class __$VariantCopyWithImpl<$Res>
 
 /// Create a copy of Variant
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? branch = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? name = null,Object? files = null,Object? mcpServers = null,Object? skills = null,Object? taskParameters = null,}) {
   return _then(_Variant(
 name: null == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
-as String,contextFiles: null == contextFiles ? _self._contextFiles : contextFiles // ignore: cast_nullable_to_non_nullable
+as String,files: null == files ? _self._files : files // ignore: cast_nullable_to_non_nullable
 as List<ContextFile>,mcpServers: null == mcpServers ? _self._mcpServers : mcpServers // ignore: cast_nullable_to_non_nullable
-as List<String>,skillPaths: null == skillPaths ? _self._skillPaths : skillPaths // ignore: cast_nullable_to_non_nullable
-as List<String>,branch: freezed == branch ? _self.branch : branch // ignore: cast_nullable_to_non_nullable
-as String?,
+as List<Map<String, dynamic>>,skills: null == skills ? _self._skills : skills // ignore: cast_nullable_to_non_nullable
+as List<String>,taskParameters: null == taskParameters ? _self._taskParameters : taskParameters // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>,
   ));
 }
 
diff --git a/packages/dataset_config_dart/lib/src/models/variant.g.dart b/packages/dataset_config_dart/lib/src/models/variant.g.dart
index 09277ff..35e3d0c 100644
--- a/packages/dataset_config_dart/lib/src/models/variant.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/variant.g.dart
@@ -8,28 +8,26 @@ part of 'variant.dart';
 
 _Variant _$VariantFromJson(Map<String, dynamic> json) => _Variant(
   name: json['name'] as String? ?? 'baseline',
-  contextFiles:
-      (json['context_files'] as List<dynamic>?)
+  files:
+      (json['files'] as List<dynamic>?)
           ?.map((e) => ContextFile.fromJson(e as Map<String, dynamic>))
           .toList() ??
       const [],
   mcpServers:
       (json['mcp_servers'] as List<dynamic>?)
-          ?.map((e) => e as String)
+          ?.map((e) => e as Map<String, dynamic>)
           .toList() ??
       const [],
-  skillPaths:
-      (json['skill_paths'] as List<dynamic>?)
-          ?.map((e) => e as String)
-          .toList() ??
+  skills:
+      (json['skills'] as List<dynamic>?)?.map((e) => e as String).toList() ??
       const [],
-  branch: json['branch'] as String?,
+  taskParameters: json['task_parameters'] as Map<String, dynamic>? ?? const {},
 );
 
 Map<String, dynamic> _$VariantToJson(_Variant instance) => <String, dynamic>{
   'name': instance.name,
-  'context_files': instance.contextFiles,
+  'files': instance.files,
   'mcp_servers': instance.mcpServers,
-  'skill_paths': instance.skillPaths,
-  'branch': instance.branch,
+  'skills': instance.skills,
+  'task_parameters': instance.taskParameters,
 };
diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart
index c5afaf7..64816d8 100644
--- a/packages/dataset_config_dart/lib/src/parsed_task.dart
+++ b/packages/dataset_config_dart/lib/src/parsed_task.dart
@@ -18,13 +18,9 @@ class ParsedTask {
   final Variant variant;
   final String sandboxType;
   final String? systemMessage;
-  final List<String>? allowedVariants;
   final bool saveExamples;
   final String? examplesDir;
 
-  /// Tag filter for variant selection.
-  final TagFilter? variantFilters;
-
   /// Pass-through dict for sandbox plugin configuration.
   final Map<String, dynamic>? sandboxParameters;
 
@@ -90,12 +86,9 @@ class ParsedTask {
     required this.variant,
     this.sandboxType = 'local',
     this.systemMessage,
-    this.allowedVariants,
     this.saveExamples = false,
     this.examplesDir,
-    this.variantFilters,
     this.sandboxParameters,
-    // Task-level settings
     this.model,
     this.config,
     this.modelRoles,
@@ -123,10 +116,8 @@ class ParsedTask {
     Variant? variant,
     String? sandboxType,
     String? systemMessage,
-    List<String>? allowedVariants,
     bool? saveExamples,
     String? examplesDir,
-    TagFilter? variantFilters,
     Map<String, dynamic>? sandboxParameters,
     String? model,
     Map<String, dynamic>? config,
@@ -153,10 +144,8 @@ class ParsedTask {
       variant: variant ?? this.variant,
       sandboxType: sandboxType ?? this.sandboxType,
       systemMessage: systemMessage ?? this.systemMessage,
-      allowedVariants: allowedVariants ?? this.allowedVariants,
       saveExamples: saveExamples ?? this.saveExamples,
       examplesDir: examplesDir ?? this.examplesDir,
-      variantFilters: variantFilters ?? this.variantFilters,
       sandboxParameters: sandboxParameters ?? this.sandboxParameters,
       model: model ?? this.model,
       config: config ?? this.config,
diff --git a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
index cbacb17..b6e40d3 100644
--- a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
@@ -23,8 +23,6 @@ class JsonParser extends Parser {
       final taskId = data['id'] as String;
       final func = (data['func'] as String?) ?? taskId;
       final systemMessage = data['system_message'] as String?;
-      final allowedVariants = (data['allowed_variants'] as List?)
-          ?.cast<String>();
 
       // Parse samples from inline data (no file I/O) - optional
       final samplesRaw = data['samples'];
@@ -123,7 +121,6 @@ class JsonParser extends Parser {
         variant: const Variant(),
         samples: samples,
         systemMessage: systemMessage,
-        allowedVariants: allowedVariants,
         // Task-level settings
         model: model,
         config: config,
diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index b9b41c0..66f8ac7 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -60,9 +60,6 @@ class YamlParser extends Parser {
     final taskWorkspace = _preResolveToAbs(taskWorkspaceRaw, taskDir);
     final taskTests = _preResolveToAbs(taskTestsRaw, taskDir);
 
-    // Optional whitelist of variant names
-    final allowedVariants = (data['allowed_variants'] as List?)?.cast<String>();
-
     // Parse samples section
     final samplesRaw = data['samples'];
     if (samplesRaw is! Map) {
@@ -101,12 +98,6 @@ class YamlParser extends Parser {
     final taskMetadata = _asMap(data['metadata']);
     final sandboxParameters = _asMap(data['sandbox_parameters']);
 
-    // Parse variant_filters (tag-based variant restriction)
-    final variantFiltersRaw = _asMap(data['variant_filters']);
-    final variantFilters = variantFiltersRaw != null
-        ? TagFilter.fromJson(variantFiltersRaw)
-        : null;
-
     return [
       ParsedTask(
         id: taskId,
@@ -114,7 +105,7 @@ class YamlParser extends Parser {
         variant: const Variant(), // placeholder baseline
         samples: samples,
         systemMessage: systemMessage,
-        allowedVariants: allowedVariants,
+        sandboxParameters: sandboxParameters,
         // Task-level settings
         model: model,
         config: config,
@@ -133,8 +124,6 @@ class YamlParser extends Parser {
         displayName: displayName,
         version: version,
         metadata: taskMetadata,
-        sandboxParameters: sandboxParameters,
-        variantFilters: variantFilters,
       ),
     ];
   }
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index 09b32ea..d73925c 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -36,14 +36,7 @@ const Map<String, Map<String, String>> kDefaultSandboxRegistry = {
   },
 };
 
-/// Default SDK branch → sandbox registry key mapping.
-///
-/// Consumers can pass these to [EvalSetResolver] or provide their own.
-const Map<String, String> kDefaultBranchChannels = {
-  'stable': 'podman',
-  'beta': 'podman-beta',
-  'main': 'podman-main',
-};
+
 
 /// Resolves parsed task configs and job into fully-resolved
 /// [EvalSet] objects ready for JSON serialization.
@@ -51,28 +44,21 @@ const Map<String, String> kDefaultBranchChannels = {
 /// This is the resolution engine. It:
 /// 1. Resolves models, sandboxes, and variants
 /// 2. Expands task × variant combinations into [Task] entries
-/// 3. Groups by branch (one [EvalSet] per group)
-/// 4. Propagates job-level and task-level settings to the output
+/// 3. Propagates job-level and task-level settings to the output
 class EvalSetResolver {
   /// Creates a resolver with optional sandbox configuration.
   ///
-  /// If [sandboxRegistry] or [branchChannels] are not provided, they default
-  /// to empty maps (no sandbox resolution). Pass [kDefaultSandboxRegistry]
-  /// and [kDefaultBranchChannels] for the Flutter-specific sandbox setup.
+  /// If [sandboxRegistry] is not provided, it defaults to an empty map
+  /// (no sandbox resolution). Pass [kDefaultSandboxRegistry] for the
+  /// Flutter-specific sandbox setup.
   const EvalSetResolver({
     this.sandboxRegistry = const {},
-    this.branchChannels = const {},
   });
 
   /// Named sandbox configurations (e.g. `'podman'` → compose file path).
   final Map<String, Map<String, String>> sandboxRegistry;
 
-  /// SDK branch → sandbox registry key mapping.
-  final Map<String, String> branchChannels;
-
   /// Resolve task configs and job into [EvalSet] objects.
-  ///
-  /// Groups by branch so each gets its own sandbox.
   List<EvalSet> resolve(
     List<ParsedTask> datasetTasks,
     Job job,
@@ -88,26 +74,16 @@ class EvalSetResolver {
       datasetRoot,
     );
 
-    // Group by branch
-    final groups = <String?, List<ParsedTask>>{};
-    for (final tc in expandedTasks) {
-      final key = tc.variant.branch;
-      (groups[key] ??= []).add(tc);
-    }
+    final sandbox = _resolveSandbox(datasetRoot, job);
 
     return [
-      for (final entry in groups.entries)
-        _buildEvalSet(
-          taskConfigs: entry.value,
-          logDir: job.logDir,
-          models: models,
-          sandbox: _resolveSandbox(
-            datasetRoot,
-            job,
-            branch: entry.key,
-          ),
-          job: job,
-        ),
+      _buildEvalSet(
+        taskConfigs: expandedTasks,
+        logDir: job.logDir,
+        models: models,
+        sandbox: sandbox,
+        job: job,
+      ),
     ];
   }
 
@@ -192,26 +168,27 @@ class EvalSetResolver {
       // Build task metadata (variant config, system message, etc.)
       final metadata = <String, dynamic>{
         'variant': tc.variant.name,
-        if (tc.variant.contextFiles.isNotEmpty)
+        if (tc.variant.files.isNotEmpty ||
+            tc.variant.mcpServers.isNotEmpty ||
+            tc.variant.skills.isNotEmpty ||
+            tc.variant.taskParameters.isNotEmpty)
           'variant_config': {
-            'context_files': tc.variant.contextFiles
-                .map(
-                  (cf) => {
-                    'title': cf.metadata.title,
-                    'version': cf.metadata.version,
-                    'content': cf.content,
-                  },
-                )
-                .toList(),
-            'mcp_servers': tc.variant.mcpServers,
-            'skill_paths': tc.variant.skillPaths,
-          },
-        if (tc.variant.contextFiles.isEmpty &&
-            (tc.variant.mcpServers.isNotEmpty ||
-                tc.variant.skillPaths.isNotEmpty))
-          'variant_config': {
-            'mcp_servers': tc.variant.mcpServers,
-            'skill_paths': tc.variant.skillPaths,
+            if (tc.variant.files.isNotEmpty)
+              'files': tc.variant.files
+                  .map(
+                    (cf) => {
+                      'title': cf.metadata.title,
+                      'version': cf.metadata.version,
+                      'content': cf.content,
+                    },
+                  )
+                  .toList(),
+            if (tc.variant.mcpServers.isNotEmpty)
+              'mcp_servers': tc.variant.mcpServers,
+            if (tc.variant.skills.isNotEmpty)
+              'skills': tc.variant.skills,
+            if (tc.variant.taskParameters.isNotEmpty)
+              'task_parameters': tc.variant.taskParameters,
           },
         if (tc.systemMessage != null) 'system_message': tc.systemMessage,
         if (tc.saveExamples) 'save_examples': true,
@@ -398,26 +375,12 @@ class EvalSetResolver {
   /// Returns either `"local"` or a `Map` with `type` and `path` keys.
   Object _resolveSandbox(
     String datasetRoot,
-    Job job, {
-    String? branch,
-  }) {
+    Job job,
+  ) {
     final sandboxCfg = job.sandbox ?? <String, dynamic>{};
     final sandboxType = (sandboxCfg['environment'] as String?) ?? 'local';
     if (sandboxType.isEmpty || sandboxType == 'local') return 'local';
 
-    // Branch override → look up branch-specific sandbox
-    if (branch != null && branchChannels.containsKey(branch)) {
-      final registryKey = branchChannels[branch]!;
-      if (sandboxRegistry.containsKey(registryKey)) {
-        final def = sandboxRegistry[registryKey]!;
-        var sandboxPath = def['path']!;
-        if (!p.isAbsolute(sandboxPath)) {
-          sandboxPath = p.normalize(p.join(datasetRoot, sandboxPath));
-        }
-        return {'type': def['name']!, 'path': sandboxPath};
-      }
-    }
-
     // Named sandbox from registry
     if (sandboxRegistry.containsKey(sandboxType)) {
       final def = sandboxRegistry[sandboxType]!;
@@ -457,27 +420,30 @@ class EvalSetResolver {
         if (!matchesTagFilter(taskTags, job.taskFilters!)) continue;
       }
 
-      // Determine effective variants (intersection)
+      // Get job-level task overrides
+      final jobTask = (job.tasks != null && job.tasks!.containsKey(taskId))
+          ? job.tasks![taskId]
+          : null;
+
+      // Determine effective variants using job-level include/exclude
       final effectiveVariants = <String, Map<String, dynamic>>{};
       for (final entry in jobVariants.entries) {
-        if (taskConfig.allowedVariants == null ||
-            taskConfig.allowedVariants!.contains(entry.key)) {
-          effectiveVariants[entry.key] = entry.value;
+        final vName = entry.key;
+
+        // Job-task level include_variants filter
+        if (jobTask?.includeVariants != null &&
+            !jobTask!.includeVariants!.contains(vName)) {
+          continue;
+        }
+        // Job-task level exclude_variants filter
+        if (jobTask?.excludeVariants != null &&
+            jobTask!.excludeVariants!.contains(vName)) {
+          continue;
         }
-      }
 
-      // Filter by task-level variant_filters (tag-based)
-      if (taskConfig.variantFilters != null) {
-        effectiveVariants.removeWhere((name, _) {
-          return !matchesTagFilter([name], taskConfig.variantFilters!);
-        });
+        effectiveVariants[vName] = entry.value;
       }
 
-      // Get job-level task overrides
-      final jobTask = (job.tasks != null && job.tasks!.containsKey(taskId))
-          ? job.tasks![taskId]
-          : null;
-
       // Apply sample filtering
       var samples = taskConfig.samples;
       if (jobTask != null) {
@@ -526,7 +492,6 @@ class EvalSetResolver {
             variant: variant,
             sandboxType: sandboxType,
             systemMessage: systemMessage,
-            allowedVariants: null,
             saveExamples: job.saveExamples,
             examplesDir: examplesDir,
             metadata: mergedMetadata,
@@ -551,9 +516,9 @@ class EvalSetResolver {
     if (vDef.isEmpty) return Variant(name: name);
 
     // Load context files (with glob support)
-    final contextFiles = <ContextFile>[];
+    final files = <ContextFile>[];
     final cfPaths =
-        (vDef['context_files'] as List?)?.cast<String>() ?? const [];
+        (vDef['files'] as List?)?.cast<String>() ?? const [];
     for (final cfPath in cfPaths) {
       if (_isGlob(cfPath)) {
         final matched = _expandGlobFiles(datasetRoot, cfPath);
@@ -563,19 +528,18 @@ class EvalSetResolver {
           );
         }
         for (final f in matched) {
-          contextFiles.add(ContextFile.load(f));
+          files.add(ContextFile.load(f));
         }
       } else {
         final fullPath = p.normalize(p.join(datasetRoot, cfPath));
-        contextFiles.add(ContextFile.load(fullPath));
+        files.add(ContextFile.load(fullPath));
       }
     }
 
     // Resolve skill paths (with glob support)
-    final skillPaths = <String>[];
+    final skills = <String>[];
     final rawSkills =
-        ((vDef['skills'] as List?) ?? (vDef['skill_paths'] as List?) ?? [])
-            .cast<String>();
+        (vDef['skills'] as List?)?.cast<String>() ?? const [];
     for (final skillPathStr in rawSkills) {
       if (_isGlob(skillPathStr)) {
         final matched = _expandGlobDirs(datasetRoot, skillPathStr);
@@ -587,7 +551,7 @@ class EvalSetResolver {
             'No skill directories matched pattern: $skillPathStr',
           );
         }
-        skillPaths.addAll(validDirs);
+        skills.addAll(validDirs);
       } else {
         final skillDir = p.normalize(p.join(datasetRoot, skillPathStr));
         if (!Directory(skillDir).existsSync()) {
@@ -599,16 +563,31 @@ class EvalSetResolver {
             'Each skill directory must contain a SKILL.md file.',
           );
         }
-        skillPaths.add(skillDir);
+        skills.add(skillDir);
       }
     }
 
+    // Parse MCP servers as config objects
+    final mcpServers = <Map<String, dynamic>>[];
+    final rawMcpServers = vDef['mcp_servers'] as List? ?? [];
+    for (final srv in rawMcpServers) {
+      if (srv is Map) {
+        mcpServers.add(Map<String, dynamic>.from(srv));
+      } else if (srv is String) {
+        // Legacy string format: treat as name
+        mcpServers.add(<String, dynamic>{'name': srv});
+      }
+    }
+
+    // Parse task_parameters
+    final taskParameters = (vDef['task_parameters'] as Map?)?.cast<String, dynamic>() ?? <String, dynamic>{};
+
     return Variant(
       name: name,
-      contextFiles: contextFiles,
-      mcpServers: (vDef['mcp_servers'] as List?)?.cast<String>() ?? [],
-      skillPaths: skillPaths,
-      branch: vDef['branch'] as String?,
+      files: files,
+      mcpServers: mcpServers,
+      skills: skills,
+      taskParameters: taskParameters,
     );
   }
 
diff --git a/packages/dataset_config_dart/test/eval_set_resolver_test.dart b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
index e76e799..8765930 100644
--- a/packages/dataset_config_dart/test/eval_set_resolver_test.dart
+++ b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
@@ -10,12 +10,10 @@ void main() {
     String func = 'question_answer',
     List<Sample>? samples,
     Variant? variant,
-    List<String>? allowedVariants,
     String? systemMessage,
     String? model,
     int? timeLimit,
     int? messageLimit,
-    TagFilter? variantFilters,
     Map<String, dynamic>? metadata,
   }) {
     return ParsedTask(
@@ -32,12 +30,10 @@ void main() {
             ),
           ],
       variant: variant ?? const Variant(),
-      allowedVariants: allowedVariants,
       systemMessage: systemMessage,
       model: model,
       timeLimit: timeLimit,
       messageLimit: messageLimit,
-      variantFilters: variantFilters,
       metadata: metadata,
     );
   }
@@ -242,14 +238,20 @@ void main() {
       expect(results.first.sandbox, isNull);
     });
 
-    test('respects allowedVariants on tasks', () {
+    test('respects includeVariants on job tasks', () {
       final results = resolver.resolve(
         [
-          makeTask(allowedVariants: ['baseline']),
+          makeTask(),
         ],
         makeJob(
           models: ['m'],
           variants: {'baseline': {}, 'full': {}},
+          tasks: {
+            'test_task': const JobTask(
+              id: 'test_task',
+              includeVariants: ['baseline'],
+            ),
+          },
         ),
         '/tmp/dataset',
       );
@@ -373,18 +375,20 @@ void main() {
       expect(dataset.name, 'my_eval:baseline');
     });
 
-    test('variant_filters restricts effective variants', () {
+    test('excludeVariants restricts effective variants', () {
       final results = resolver.resolve(
         [
-          makeTask(
-            variantFilters: const TagFilter(
-              includeTags: ['baseline'],
-            ),
-          ),
+          makeTask(),
         ],
         makeJob(
           models: ['m'],
           variants: {'baseline': {}, 'full': {}, 'mcp_only': {}},
+          tasks: {
+            'test_task': const JobTask(
+              id: 'test_task',
+              excludeVariants: ['full', 'mcp_only'],
+            ),
+          },
         ),
         '/tmp/dataset',
       );
diff --git a/packages/dataset_config_dart/test/json_parser_test.dart b/packages/dataset_config_dart/test/json_parser_test.dart
index 51dc2b0..a95c994 100644
--- a/packages/dataset_config_dart/test/json_parser_test.dart
+++ b/packages/dataset_config_dart/test/json_parser_test.dart
@@ -167,7 +167,6 @@ void main() {
           'id': 'full_task',
           'func': 'my_func',
           'system_message': 'Be helpful',
-          'allowed_variants': ['baseline', 'full'],
           'inspect_task_args': {
             'model': 'gemini-pro',
             'config': {'temperature': 0.5},
@@ -187,7 +186,6 @@ void main() {
 
       final task = tasks.first;
       expect(task.systemMessage, 'Be helpful');
-      expect(task.allowedVariants, ['baseline', 'full']);
       expect(task.model, 'gemini-pro');
       expect(task.config, {'temperature': 0.5});
       expect(task.modelRoles, {'grader': 'gpt-4o'});
diff --git a/packages/dataset_config_dart/test/parsed_task_test.dart b/packages/dataset_config_dart/test/parsed_task_test.dart
index b6fb7c5..cd3c75c 100644
--- a/packages/dataset_config_dart/test/parsed_task_test.dart
+++ b/packages/dataset_config_dart/test/parsed_task_test.dart
@@ -14,7 +14,7 @@ void main() {
       expect(task.sandboxType, 'local');
       expect(task.saveExamples, false);
       expect(task.systemMessage, isNull);
-      expect(task.allowedVariants, isNull);
+      expect(task.examplesDir, isNull);
       expect(task.examplesDir, isNull);
       expect(task.model, isNull);
       expect(task.config, isNull);
@@ -32,7 +32,6 @@ void main() {
         variant: Variant(name: 'full'),
         sandboxType: 'podman',
         systemMessage: 'Be helpful',
-        allowedVariants: ['baseline', 'full'],
         saveExamples: true,
         examplesDir: '/tmp/examples',
         model: 'gemini-pro',
@@ -54,7 +53,6 @@ void main() {
       expect(task.variant.name, 'full');
       expect(task.sandboxType, 'podman');
       expect(task.systemMessage, 'Be helpful');
-      expect(task.allowedVariants, ['baseline', 'full']);
       expect(task.saveExamples, true);
       expect(task.examplesDir, '/tmp/examples');
       expect(task.model, 'gemini-pro');
diff --git a/packages/dataset_config_python/src/dataset_config_python/__init__.py b/packages/dataset_config_python/src/dataset_config_python/__init__.py
index fc0cf83..e6dd675 100644
--- a/packages/dataset_config_python/src/dataset_config_python/__init__.py
+++ b/packages/dataset_config_python/src/dataset_config_python/__init__.py
@@ -8,7 +8,6 @@
 
 from dataset_config_python.parser import ParsedTask, find_job_file, parse_job, parse_tasks
 from dataset_config_python.resolver import (
-    DEFAULT_BRANCH_CHANNELS,
     DEFAULT_SANDBOX_REGISTRY,
     SandboxConfig,
     resolve,
@@ -17,7 +16,6 @@
 from dataset_config_python.writer import write_eval_sets
 
 __all__ = [
-    "DEFAULT_BRANCH_CHANNELS",
     "DEFAULT_SANDBOX_REGISTRY",
     "ParsedTask",
     "SandboxConfig",
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/__init__.py b/packages/dataset_config_python/src/dataset_config_python/models/__init__.py
index f42caca..3afc978 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/__init__.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/__init__.py
@@ -4,6 +4,7 @@
 from dataset_config_python.models.dataset import Dataset
 from dataset_config_python.models.eval_set import EvalSet
 from dataset_config_python.models.job import Job, JobTask
+from dataset_config_python.models.mcp_server_config import McpServerConfig
 from dataset_config_python.models.sample import Sample
 from dataset_config_python.models.tag_filter import TagFilter, matches_tag_filter
 from dataset_config_python.models.task import Task
@@ -16,6 +17,7 @@
     "EvalSet",
     "Job",
     "JobTask",
+    "McpServerConfig",
     "Sample",
     "TagFilter",
     "Task",
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/job.py b/packages/dataset_config_python/src/dataset_config_python/models/job.py
index 16f2568..ee0801a 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/job.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/job.py
@@ -24,6 +24,12 @@ class JobTask(BaseModel):
     args: dict[str, Any] | None = None
     """Per-task argument overrides passed to the task function."""
 
+    include_variants: list[str] | None = None
+    """Only run these variant names for this task."""
+
+    exclude_variants: list[str] | None = None
+    """Exclude these variant names for this task."""
+
     @staticmethod
     def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask:
         """Create from parsed YAML data."""
@@ -34,6 +40,8 @@ def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask:
             include_samples=data.get("include-samples"),
             exclude_samples=data.get("exclude-samples"),
             args=data.get("args"),
+            include_variants=data.get("include-variants"),
+            exclude_variants=data.get("exclude-variants"),
         )
 
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py b/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py
new file mode 100644
index 0000000..42414fb
--- /dev/null
+++ b/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py
@@ -0,0 +1,68 @@
+"""MCP server configuration model — declarative or Python import ref."""
+
+from __future__ import annotations
+
+from typing import Any
+
+from pydantic import BaseModel, Field, model_validator
+
+
+class McpServerConfig(BaseModel):
+    """MCP server configuration.
+
+    Supports two modes:
+    1. **Declarative** — specify command, args, env, etc. directly.
+    2. **Python ref** — point to a pre-built MCPServer object via
+       ``ref: "my_package.module:variable_name"``.
+
+    When ``ref`` is set, all other fields are ignored.
+    """
+
+    # Declarative fields
+    name: str | None = None
+    """Human-readable server name (e.g. ``"dart"``)."""
+
+    command: str | None = None
+    """Executable to run (e.g. ``"dart"``)."""
+
+    args: list[str] = Field(default_factory=list)
+    """Command-line arguments (e.g. ``["mcp-server"]``)."""
+
+    env: dict[str, str] | None = None
+    """Extra environment variables for the server process."""
+
+    cwd: str | None = None
+    """Working directory for the server process."""
+
+    transport: str | None = None
+    """Transport type: ``"stdio"``, ``"sandbox"``, or ``None`` (auto-select)."""
+
+    # Python import escape hatch
+    ref: str | None = None
+    """Python import path to a pre-built MCPServer object.
+
+    Format: ``"module.path:variable_name"`` or ``"module.path:factory()"``.
+    When set, all declarative fields above are ignored.
+    """
+
+    @model_validator(mode="after")
+    def _validate_mode(self) -> McpServerConfig:
+        if self.ref is None and self.command is None:
+            raise ValueError(
+                "McpServerConfig requires either 'ref' (Python import) "
+                "or 'command' (declarative). Neither was provided."
+            )
+        return self
+
+    @staticmethod
+    def from_yaml(raw: Any) -> McpServerConfig:
+        """Parse from YAML — accepts a dict or a string shorthand.
+
+        String shorthand is treated as a ref:
+            ``"my_package.mcp:server"`` → ``McpServerConfig(ref=...)``
+        """
+        if isinstance(raw, str):
+            return McpServerConfig(ref=raw)
+        if isinstance(raw, dict):
+            return McpServerConfig(**raw)
+        raise ValueError(f"Invalid MCP server config: {raw!r}")
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/variant.py b/packages/dataset_config_python/src/dataset_config_python/models/variant.py
index 4fa39d6..81eb40c 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/variant.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/variant.py
@@ -2,9 +2,12 @@
 
 from __future__ import annotations
 
+from typing import Any
+
 from pydantic import BaseModel, Field
 
 from dataset_config_python.models.context_file import ContextFile
+from dataset_config_python.models.mcp_server_config import McpServerConfig
 
 
 class Variant(BaseModel):
@@ -14,23 +17,23 @@ class Variant(BaseModel):
     performance with and without specific tooling or context.
 
     Features are implied by field presence:
-    - context_files populated → context injection enabled
+    - files populated → context injection enabled
     - mcp_servers populated → MCP tools enabled
-    - skill_paths populated → agent skills enabled
+    - skills populated → agent skills enabled
     - all empty → baseline variant
     """
 
     name: str = "baseline"
     """User-defined variant name."""
 
-    context_files: list[ContextFile] = Field(default_factory=list)
+    files: list[ContextFile] = Field(default_factory=list)
     """Loaded context files (paths resolved by config resolver)."""
 
-    mcp_servers: list[str] = Field(default_factory=list)
-    """MCP server keys to enable (e.g. ``['dart']``)."""
+    mcp_servers: list[McpServerConfig] = Field(default_factory=list)
+    """MCP server configurations (declarative or Python import refs)."""
 
-    skill_paths: list[str] = Field(default_factory=list)
+    skills: list[str] = Field(default_factory=list)
     """Resolved paths to agent skill directories."""
 
-    branch: str | None = None
-    """SDK branch/channel to use (e.g. 'stable', 'beta', 'main')."""
+    task_parameters: dict[str, Any] = Field(default_factory=dict)
+    """Optional parameters merged into the task config dict at runtime."""
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index e381985..4d07ecd 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -12,7 +12,6 @@
 
 from dataset_config_python.models.job import Job, JobTask
 from dataset_config_python.models.sample import Sample
-from dataset_config_python.models.tag_filter import TagFilter
 from dataset_config_python.models.variant import Variant
 
 # Default log directory (relative to dataset root).
@@ -35,7 +34,6 @@ def __init__(
         variant: Variant | None = None,
         sandbox_type: str = "local",
         system_message: str | None = None,
-        allowed_variants: list[str] | None = None,
         save_examples: bool = False,
         examples_dir: str | None = None,
         # Task-level settings
@@ -57,7 +55,6 @@ def __init__(
         version: Any | None = None,
         metadata: dict[str, Any] | None = None,
         sandbox_parameters: dict[str, Any] | None = None,
-        variant_filters: TagFilter | None = None,
     ):
         self.id = id
         self.func = func
@@ -65,7 +62,6 @@ def __init__(
         self.variant = variant or Variant()
         self.sandbox_type = sandbox_type
         self.system_message = system_message
-        self.allowed_variants = allowed_variants
         self.save_examples = save_examples
         self.examples_dir = examples_dir
         self.model = model
@@ -86,7 +82,6 @@ def __init__(
         self.version = version
         self.metadata = metadata
         self.sandbox_parameters = sandbox_parameters
-        self.variant_filters = variant_filters
 
     _UNSET: Any = object()
 
@@ -99,7 +94,6 @@ def copy_with(
         variant: Variant | None = _UNSET,
         sandbox_type: str | None = _UNSET,
         system_message: str | None = _UNSET,
-        allowed_variants: list[str] | None = _UNSET,
         save_examples: bool | None = _UNSET,
         examples_dir: str | None = _UNSET,
         sandbox_parameters: dict[str, Any] | None = _UNSET,
@@ -120,7 +114,6 @@ def copy_with(
         display_name: str | None = _UNSET,
         version: Any = _UNSET,
         metadata: dict[str, Any] | None = _UNSET,
-        variant_filters: TagFilter | None = _UNSET,
     ) -> ParsedTask:
         """Create a copy with overrides."""
         _U = ParsedTask._UNSET
@@ -131,7 +124,6 @@ def copy_with(
             variant=self.variant if variant is _U else variant,
             sandbox_type=self.sandbox_type if sandbox_type is _U else sandbox_type,  # type: ignore[arg-type]
             system_message=self.system_message if system_message is _U else system_message,
-            allowed_variants=self.allowed_variants if allowed_variants is _U else allowed_variants,
             save_examples=self.save_examples if save_examples is _U else save_examples,  # type: ignore[arg-type]
             examples_dir=self.examples_dir if examples_dir is _U else examples_dir,
             sandbox_parameters=self.sandbox_parameters if sandbox_parameters is _U else sandbox_parameters,
@@ -152,7 +144,6 @@ def copy_with(
             display_name=self.display_name if display_name is _U else display_name,
             version=self.version if version is _U else version,
             metadata=self.metadata if metadata is _U else metadata,
-            variant_filters=self.variant_filters if variant_filters is _U else variant_filters,
         )
 
 
@@ -255,8 +246,6 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     task_workspace = _pre_resolve_to_abs(task_workspace_raw, task_dir)
     task_tests = _pre_resolve_to_abs(task_tests_raw, task_dir)
 
-    allowed_variants = data.get("allowed_variants")
-
     # Parse samples section (optional)
     samples_raw = data.get("samples")
     if samples_raw is None:
@@ -269,10 +258,6 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     else:
         samples = _load_samples_section(samples_raw, dataset_root, task_workspace, task_tests, task_dir)
 
-    # Parse variant_filters (tag-based variant restriction)
-    variant_filters_raw = data.get("variant_filters")
-    variant_filters = TagFilter(**variant_filters_raw) if isinstance(variant_filters_raw, dict) else None
-
     # Task-level Inspect AI args are nested under inspect_task_args
     task_args = data.get("inspect_task_args") or {}
 
@@ -283,7 +268,6 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             variant=Variant(),
             samples=samples,
             system_message=system_message,
-            allowed_variants=allowed_variants,
             model=task_args.get("model"),
             config=task_args.get("config") if isinstance(task_args.get("config"), dict) else None,
             model_roles=task_args.get("model_roles") if isinstance(task_args.get("model_roles"), dict) else None,
@@ -302,7 +286,6 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             version=data.get("version"),
             metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None,
             sandbox_parameters=data.get("sandbox_parameters") if isinstance(data.get("sandbox_parameters"), dict) else None,
-            variant_filters=variant_filters,
         )
     ]
 
@@ -488,7 +471,7 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
             for tid, tdata in inline_tasks.items():
                 tasks[tid] = JobTask.from_yaml(tid, tdata)
 
-    # Parse variants
+    # Parse variants — supports inline dict or list of file paths
     variants = None
     variants_raw = data.get("variants")
     if isinstance(variants_raw, dict):
@@ -498,6 +481,20 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
                 variants[str(key)] = dict(value)
             else:
                 variants[str(key)] = {}
+    elif isinstance(variants_raw, list):
+        # List of relative paths to variant definition files
+        job_dir = os.path.dirname(job_path)
+        variants = {}
+        for rel_path in variants_raw:
+            variant_file = os.path.normpath(os.path.join(job_dir, str(rel_path)))
+            if not os.path.isfile(variant_file):
+                raise FileNotFoundError(
+                    f"Variant file not found: {variant_file} "
+                    f"(referenced from {job_path})"
+                )
+            file_data = _read_yaml_file(variant_file)
+            for vname, vdef in file_data.items():
+                variants[str(vname)] = dict(vdef) if isinstance(vdef, dict) else {}
 
     # Parse inspect_eval_arguments
     inspect_eval_arguments = data.get("inspect_eval_arguments")
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 02e14f3..83b2b2e 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -11,6 +11,7 @@
 from dataset_config_python.models.dataset import Dataset
 from dataset_config_python.models.eval_set import EvalSet
 from dataset_config_python.models.job import Job
+from dataset_config_python.models.mcp_server_config import McpServerConfig
 from dataset_config_python.models.sample import Sample
 from dataset_config_python.models.tag_filter import matches_tag_filter
 from dataset_config_python.models.task import Task
@@ -39,20 +40,12 @@
     "podman-main": {"name": "podman", "path": "./sandboxes/podman/compose-main.yaml"},
 }
 
-# Default SDK branch → sandbox registry key mapping.
-DEFAULT_BRANCH_CHANNELS: dict[str, str] = {
-    "stable": "podman",
-    "beta": "podman-beta",
-    "main": "podman-main",
-}
-
 
 @dataclass
 class SandboxConfig:
-    """Sandbox registry and branch-channel mapping."""
+    """Sandbox registry for named sandbox definitions."""
 
     registry: dict[str, dict[str, str]] = field(default_factory=dict)
-    branch_channels: dict[str, str] = field(default_factory=dict)
 
 
 def _is_glob(pattern: str) -> bool:
@@ -128,8 +121,7 @@ def resolve_from_parsed(
     """
     sandbox_cfg = sandbox_config or SandboxConfig()
     registry = sandbox_cfg.registry
-    channels = sandbox_cfg.branch_channels
-    return _resolve_job(task_configs, job, dataset_path, registry, channels)
+    return _resolve_job(task_configs, job, dataset_path, registry)
 
 
 def _resolve_job(
@@ -137,7 +129,6 @@ def _resolve_job(
     job: Any,
     dataset_root: str,
     sandbox_registry: dict[str, dict[str, str]],
-    branch_channels: dict[str, str],
 ) -> list[EvalSet]:
     """Resolve task configs and job into EvalSet objects."""
     models = job.models if job.models else list(DEFAULT_MODELS)
@@ -146,21 +137,14 @@ def _resolve_job(
 
     expanded_tasks = _expand_task_configs(dataset_tasks, job, sandbox_type_str, dataset_root)
 
-    # Group by branch
-    groups: dict[str | None, list[ParsedTask]] = {}
-    for tc in expanded_tasks:
-        key = tc.variant.branch
-        groups.setdefault(key, []).append(tc)
-
     return [
         _build_eval_set(
-            task_configs=group,
+            task_configs=expanded_tasks,
             log_dir=job.log_dir,
             models=models,
-            sandbox=_resolve_sandbox(dataset_root, job, sandbox_registry, branch_channels, branch=channel),
+            sandbox=_resolve_sandbox(dataset_root, job, sandbox_registry),
             job=job,
         )
-        for channel, group in groups.items()
     ]
 
 
@@ -233,24 +217,26 @@ def _build_eval_set(
 
         # Task metadata (variant config, system message, etc.)
         task_metadata: dict[str, Any] = {"variant": tc.variant.name}
-        if tc.variant.context_files:
-            task_metadata["variant_config"] = {
-                "context_files": [
-                    {
-                        "title": cf.metadata.title,
-                        "version": cf.metadata.version,
-                        "content": cf.content,
-                    }
-                    for cf in tc.variant.context_files
-                ],
-                "mcp_servers": tc.variant.mcp_servers,
-                "skill_paths": tc.variant.skill_paths,
-            }
-        elif tc.variant.mcp_servers or tc.variant.skill_paths:
-            task_metadata["variant_config"] = {
-                "mcp_servers": tc.variant.mcp_servers,
-                "skill_paths": tc.variant.skill_paths,
-            }
+        variant_config: dict[str, Any] = {}
+        if tc.variant.files:
+            variant_config["files"] = [
+                {
+                    "title": cf.metadata.title,
+                    "version": cf.metadata.version,
+                    "content": cf.content,
+                }
+                for cf in tc.variant.files
+            ]
+        if tc.variant.mcp_servers:
+            variant_config["mcp_servers"] = [
+                s.model_dump(exclude_none=True) for s in tc.variant.mcp_servers
+            ]
+        if tc.variant.skills:
+            variant_config["skills"] = tc.variant.skills
+        if tc.variant.task_parameters:
+            variant_config["task_parameters"] = tc.variant.task_parameters
+        if variant_config:
+            task_metadata["variant_config"] = variant_config
         if tc.system_message is not None:
             task_metadata["system_message"] = tc.system_message
         if tc.save_examples:
@@ -311,7 +297,7 @@ def _build_eval_set(
     eval_set_overrides = eval_args.get("eval_set_overrides") or {}
 
     # Helper to get a value from eval_args then overrides
-    def _get(key, default=None):
+    def _get(key: str, default: Any = None) -> Any:
         v = eval_args.get(key)
         if v is not None:
             return v
@@ -404,9 +390,6 @@ def _resolve_sandbox(
     dataset_root: str,
     job: Any,
     sandbox_registry: dict[str, dict[str, str]],
-    branch_channels: dict[str, str],
-    *,
-    branch: str | None = None,
 ) -> Any:
     """Resolve sandbox spec for a given config."""
     sandbox_cfg = job.sandbox or {}
@@ -414,17 +397,6 @@ def _resolve_sandbox(
     if not sandbox_type or sandbox_type == "local":
         return "local"
 
-    # Channel override
-    # Branch override → look up branch-specific sandbox
-    if branch and branch in branch_channels:
-        registry_key = branch_channels[branch]
-        if registry_key in sandbox_registry:
-            defn = sandbox_registry[registry_key]
-            sandbox_path = defn["path"]
-            if not os.path.isabs(sandbox_path):
-                sandbox_path = os.path.normpath(os.path.join(dataset_root, sandbox_path))
-            return {"type": defn["name"], "path": sandbox_path}
-
     # Named sandbox from registry
     if sandbox_type in sandbox_registry:
         defn = sandbox_registry[sandbox_type]
@@ -464,22 +436,21 @@ def _expand_task_configs(
             if not matches_tag_filter(task_tags, job.task_filters):
                 continue
 
-        # Determine effective variants (intersection)
-        effective_variants: dict[str, dict[str, Any]] = {}
-        for vname, vdef in job_variants.items():
-            if tc.allowed_variants is None or vname in tc.allowed_variants:
-                effective_variants[vname] = vdef
+        # Start with all job-level variants
+        effective_variants: dict[str, dict[str, Any]] = dict(job_variants)
 
-        # Filter by task-level variant_filters (tag-based)
-        if tc.variant_filters is not None:
+        # Apply per-task include/exclude variants from job.tasks.<task_id>
+        job_task = job.tasks.get(task_id) if job.tasks else None
+        if job_task and job_task.include_variants:
             effective_variants = {
-                vname: vdef
-                for vname, vdef in effective_variants.items()
-                if matches_tag_filter([vname], tc.variant_filters)
+                k: v for k, v in effective_variants.items()
+                if k in job_task.include_variants
+            }
+        if job_task and job_task.exclude_variants:
+            effective_variants = {
+                k: v for k, v in effective_variants.items()
+                if k not in job_task.exclude_variants
             }
-
-        # Get job-level task overrides
-        job_task = job.tasks.get(task_id) if job.tasks else None
 
         # Apply sample filtering
         samples = tc.samples
@@ -496,7 +467,7 @@ def _expand_task_configs(
                 if matches_tag_filter((s.metadata or {}).get("tags", []), job.sample_filters)
             ]
 
-        # Apply system_message from task (no longer overridden by job task)
+        # Apply system_message from task
         system_message = tc.system_message
 
         # Merge job-task args into metadata
@@ -518,7 +489,6 @@ def _expand_task_configs(
                     variant=variant,
                     sandbox_type=sandbox_type,
                     system_message=system_message,
-                    allowed_variants=None,
                     save_examples=job.save_examples,
                     examples_dir=examples_dir,
                     metadata=merged_metadata,
@@ -542,9 +512,9 @@ def _resolve_variant(
     if not vdef:
         return Variant(name=name)
 
-    # Load context files (with glob support)
+    # Load context files (with glob support) — YAML key is "files"
     context_files: list[ContextFile] = []
-    cf_paths: list[str] = vdef.get("context_files") or []
+    cf_paths: list[str] = vdef.get("files") or []
     for cf_path in cf_paths:
         if _is_glob(cf_path):
             full_pattern = os.path.join(dataset_root, cf_path)
@@ -561,9 +531,9 @@ def _resolve_variant(
             full_path = os.path.normpath(os.path.join(dataset_root, cf_path))
             context_files.append(ContextFile.load(full_path))
 
-    # Resolve skill paths (with glob support)
+    # Resolve skill paths (with glob support) — YAML key is "skills"
     skill_paths: list[str] = []
-    raw_skills: list[str] = vdef.get("skills") or vdef.get("skill_paths") or []
+    raw_skills: list[str] = vdef.get("skills") or []
     for skill_path_str in raw_skills:
         if _is_glob(skill_path_str):
             full_pattern = os.path.join(dataset_root, skill_path_str)
@@ -587,12 +557,21 @@ def _resolve_variant(
                 )
             skill_paths.append(skill_dir)
 
+    # Resolve MCP servers
+    mcp_servers: list[McpServerConfig] = []
+    raw_mcp: list[Any] = vdef.get("mcp_servers") or []
+    for raw in raw_mcp:
+        mcp_servers.append(McpServerConfig.from_yaml(raw))
+
+    # Task parameters
+    task_parameters: dict[str, Any] = vdef.get("task_parameters") or {}
+
     return Variant(
         name=name,
-        context_files=context_files,
-        mcp_servers=vdef.get("mcp_servers") or [],
-        skill_paths=skill_paths,
-        branch=vdef.get("branch"),
+        files=context_files,
+        mcp_servers=mcp_servers,
+        skills=skill_paths,
+        task_parameters=task_parameters,
     )
 
 
diff --git a/packages/dataset_config_python/tests/test_config.py b/packages/dataset_config_python/tests/test_config.py
index 89ea89c..3d79e90 100644
--- a/packages/dataset_config_python/tests/test_config.py
+++ b/packages/dataset_config_python/tests/test_config.py
@@ -45,8 +45,11 @@ def dataset_dir(tmp_path):
     - id: sample_2
       input: "What is Flutter?"
       target: "A UI framework."
-      difficulty: medium
-      tags: ui, framework
+      metadata:
+        difficulty: medium
+        tags:
+          - ui
+          - framework
 """
     )
 
@@ -55,13 +58,10 @@ def dataset_dir(tmp_path):
     code_gen_dir.mkdir(parents=True)
     code_gen_yaml = code_gen_dir / "task.yaml"
     code_gen_yaml.write_text(
-        """
-id: code_gen
+        """id: code_gen
 func: flutter_code_gen
-time_limit: 600
-allowed_variants:
-  - baseline
-  - context_only
+inspect_task_args:
+  time_limit: 600
 samples:
   inline:
     - id: sample_1
@@ -75,16 +75,14 @@ def dataset_dir(tmp_path):
     jobs_dir.mkdir()
     job_yaml = jobs_dir / "local_dev.yaml"
     job_yaml.write_text(
-        """
-logs_dir: ./logs
-sandbox_type: local
+        """logs_dir: ./logs
 max_connections: 5
 models:
   - google/gemini-2.5-flash
 variants:
   baseline: {}
   context_only:
-    context_files: []
+    files: []
 """
     )
 
@@ -162,10 +160,10 @@ def test_dataset_creation(self):
     def test_variant_defaults(self):
         v = Variant()
         assert v.name == "baseline"
-        assert v.context_files == []
+        assert v.files == []
         assert v.mcp_servers == []
-        assert v.skill_paths == []
-        assert v.branch is None
+        assert v.skills == []
+        assert v.task_parameters == {}
 
     def test_job_task_from_yaml_none(self):
         jt = JobTask.from_yaml("my_task", None)
@@ -216,11 +214,6 @@ def test_parse_tasks_metadata(self, dataset_dir):
         assert s2.metadata["tags"] == ["ui", "framework"]
         assert s2.metadata["difficulty"] == "medium"
 
-    def test_parse_tasks_allowed_variants(self, dataset_dir):
-        tasks = parse_tasks(str(dataset_dir))
-        code_gen = next(t for t in tasks if t.id == "code_gen")
-        assert code_gen.allowed_variants == ["baseline", "context_only"]
-
     def test_parse_tasks_time_limit(self, dataset_dir):
         tasks = parse_tasks(str(dataset_dir))
         code_gen = next(t for t in tasks if t.id == "code_gen")
@@ -229,7 +222,6 @@ def test_parse_tasks_time_limit(self, dataset_dir):
     def test_parse_job(self, dataset_dir):
         job_path = os.path.join(str(dataset_dir), "jobs", "local_dev.yaml")
         job = parse_job(job_path, str(dataset_dir))
-        assert job.sandbox_type == "local"
         assert job.max_connections == 5
         assert job.models == ["google/gemini-2.5-flash"]
 
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart
index b3007c5..e4fbb72 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart
@@ -63,8 +63,12 @@ $modelsList
 # Example:
 #   variants:
 #     baseline: {}                              # no extra features
-#     context_only: { context_files: [../../context/flutter.md] }
-#     mcp_only: { mcp_servers: [dart] }
+#     context_only: { files: [../../context/flutter.md] }
+#     mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
+#
+# Variants can also be loaded from separate files:
+#   variants:
+#     - ./variants/common.yaml
 
 # =============================================================================
 # TASKS
@@ -80,11 +84,10 @@ $modelsList
 #   tasks:
 #     inline:
 #       task_id:
-#         # (use allowed_variants in task.yaml to whitelist variants)
 #         include-samples: [sample1]   # Only run specific samples (mutually exclusive with exclude)
 #         exclude-samples: [sample2]   # Skip specific samples (mutually exclusive with include)
-#         system_message: |            # Override system prompt for this task
-#           Custom instructions...
+#         include-variants: [baseline]  # Only run these variants for this task
+#         exclude-variants: [with_mcp]  # Skip these variants for this task
 #
 # Simple format (run all samples with job-level settings):
 #   tasks:
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart
index b76a758..32b3ed6 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart
@@ -80,13 +80,14 @@ $modelsList
 # Each variant defines what tools/context the agent has access to.
 #
 # Format: variant_name: { config }
-#   baseline: {}                                     # no extra features
-#   context_only: { context_files: [./path/to.md] }  # injects context files
-#   mcp_only: { mcp_servers: [dart] }                # enables MCP servers
-#   full: { context_files: [...], mcp_servers: [...] }
+#   baseline: {}                                                       # no extra features
+#   context_only: { files: [./path/to.md] }                            # injects context files
+#   mcp_only: { mcp_servers: [{name: dart, command: dart, args: [...]}] }  # enables MCP servers
+#   full: { files: [...], mcp_servers: [...] }
 #
-# Tasks can optionally restrict which variants they support
-# via `allowed_variants:` in their task.yaml.
+# Variants can also be loaded from separate files:
+#   variants:
+#     - ./variants/common.yaml
 variants:
 ${variantsMap.toString().trimRight()}
 
@@ -105,8 +106,8 @@ ${variantsMap.toString().trimRight()}
 #       task_id:
 #         include-samples: [sample1]   # Only run specific samples (mutually exclusive with exclude)
 #         exclude-samples: [sample2]   # Skip specific samples (mutually exclusive with include)
-#         system_message: |            # Override system prompt for this task
-#           Custom instructions...
+#         include-variants: [baseline]  # Only run these variants for this task
+#         exclude-variants: [with_mcp]  # Exclude these variants for this task
 #
 # Simple format (run all samples with job-level settings):
 #   tasks:
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
index 4aa092d..56eff1a 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
@@ -18,9 +18,8 @@ String taskTemplate({
     workspaceValue: workspaceValue,
   );
 
-  final variantsLine = variants.isNotEmpty
-      ? 'allowed_variants: [${variants.join(', ')}]\n'
-      : '';
+  final variantsLine = '';
+
 
   final systemMessageBlock = systemMessage != null && systemMessage.isNotEmpty
       ? 'system_message: |\n  ${systemMessage.replaceAll('\n', '\n  ')}\n'
diff --git a/packages/devals_cli/lib/src/dataset/variant_defaults.dart b/packages/devals_cli/lib/src/dataset/variant_defaults.dart
index 0f475c9..41a42fe 100644
--- a/packages/devals_cli/lib/src/dataset/variant_defaults.dart
+++ b/packages/devals_cli/lib/src/dataset/variant_defaults.dart
@@ -11,17 +11,17 @@ enum DefaultVariants {
   flutterRules(
     'flutter_rules',
     'Run with Flutter rules context files.',
-    'flutter_rules: { context_files: [./context_files/flutter.md] }',
+    'flutter_rules: { files: [./context_files/flutter.md] }',
   ),
   withSkills(
     'with_skills',
     'Run with skills files.',
-    'with_skills: { skill_paths: [./skills/*] }',
+    'with_skills: { skills: [./skills/*] }',
   ),
   withMCP(
     'with_mcp',
     'Run with Dart MCP server available.',
-    'with_mcp: { mcp_servers: [dart] }',
+    'with_mcp: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }',
   )
   ;
 
diff --git a/packages/devals_cli/test/dataset/task_template_test.dart b/packages/devals_cli/test/dataset/task_template_test.dart
index 463ae62..bfe86a0 100644
--- a/packages/devals_cli/test/dataset/task_template_test.dart
+++ b/packages/devals_cli/test/dataset/task_template_test.dart
@@ -17,12 +17,13 @@ void main() {
       expect(result, contains('target: |'));
     });
 
-    test('includes variants when provided', () {
+    test('does not include variants (variants are job-level)', () {
       final result = taskTemplate(
         taskFunc: 'flutter_code_gen',
         variants: ['baseline', 'mcp_only'],
       );
-      expect(result, contains('variants: [baseline, mcp_only]'));
+      // Variants are now configured at the job level, not task level
+      expect(result, isNot(contains('variants:')));
     });
 
     test('omits variants line when list is empty', () {

From 9d522c19fc072319559b88c4cddcb4af5e137deb Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Wed, 18 Mar 2026 12:53:47 -0700
Subject: [PATCH 12/21] feat: Replace task and sample `workspace` and `tests`
 fields with `files` and `setup` for more flexible resource management.

---
 docs/guides/config.md                         |  10 +-
 docs/reference/configuration_reference.md     |  92 ++++++-------
 docs/reference/yaml_config.md                 |  34 ++---
 .../lib/src/models/task.dart                  |   6 +
 .../lib/src/models/task.freezed.dart          |  63 ++++++---
 .../lib/src/models/task.g.dart                |   4 +
 .../lib/src/parsed_task.dart                  |  12 ++
 .../lib/src/parsers/yaml_parser.dart          | 122 +++---------------
 .../lib/src/resolvers/eval_set_resolver.dart  |  27 +---
 .../src/dataset_config_python/models/task.py  |   3 +
 .../src/dataset_config_python/parser.py       | 116 +++++++----------
 .../src/dataset_config_python/resolver.py     |  23 +---
 .../init_templates/init_sample_template.dart  |  11 +-
 .../file_templates/sample_template.dart       |  28 ++--
 .../dataset/file_templates/task_template.dart |  40 ++----
 .../test/dataset/sample_template_test.dart    |  49 ++++---
 .../test/dataset/task_template_test.dart      |  71 ++++------
 17 files changed, 281 insertions(+), 430 deletions(-)

diff --git a/docs/guides/config.md b/docs/guides/config.md
index 7250b57..54140c3 100644
--- a/docs/guides/config.md
+++ b/docs/guides/config.md
@@ -1,6 +1,12 @@
-# Config guide
+# Configuring jobs
+
+
+
+Evals consists of two broad pieces: the framework (InspectAI-based Python framework)
+
+define **what** to evaluate (tasks and samples), **how** to run it (jobs), and **where** code executes (sandboxes). The CLI resolves these files into a single manifest and hands it to the Python runner — so most of the time you're just editing YAML.
+
 
-Evals uses a layered YAML configuration system. You define **what** to evaluate (tasks and samples), **how** to run it (jobs), and **where** code executes (sandboxes). The CLI resolves these files into a single manifest and hands it to the Python runner — so most of the time you're just editing YAML.
 
 This page walks through the main concepts and how they connect.
 
diff --git a/docs/reference/configuration_reference.md b/docs/reference/configuration_reference.md
index b0e457f..baeae69 100644
--- a/docs/reference/configuration_reference.md
+++ b/docs/reference/configuration_reference.md
@@ -8,7 +8,7 @@ The evaluation framework uses the `eval/` directory as its entry point. It conta
 
 - Task definitions autodiscovered from `tasks/*/task.yaml`
 - Job files in `jobs/` that control what to run
-- Shared resources (context files, sandboxes, workspaces)
+- Shared resources (context files, sandboxes)
 
 Configuration is parsed and resolved by the Dart `dataset_config_dart` package, which produces an EvalSet JSON manifest consumed by the Python `dash_evals`.
 
@@ -24,7 +24,7 @@ eval/
 ├── tasks/                   # Task definitions (autodiscovered)
 │   ├── flutter_bug_fix/
 │   │   ├── task.yaml        # Task config with inline samples
-│   │   └── project/         # Workspace files (if applicable)
+│   │   └── project/         # Project files (if applicable)
 │   ├── dart_question_answer/
 │   │   └── task.yaml
 │   └── generate_flutter_app/
@@ -32,14 +32,10 @@ eval/
 │       └── todo_tests/      # Test files for a sample
 ├── context_files/           # Context files injected into prompts
 │   └── flutter.md
-├── sandboxes/               # Container configurations
-│   └── podman/
-│       ├── Containerfile
-│       └── compose.yaml
-└── workspaces/              # Reusable project templates
-    ├── dart_package/
-    ├── flutter_app/
-    └── jaspr_app/
+└── sandboxes/               # Container configurations
+    └── podman/
+        ├── Containerfile
+        └── compose.yaml
 ```
 
 ---
@@ -54,16 +50,10 @@ func: flutter_bug_fix
 system_message: |
   You are an expert Flutter developer. Fix the bug and explain your changes.
 
-# Task-level workspace (inherited by all samples)
-workspace:
-  path: ./project
-
-# Task-level tests (inherited by all samples)
-tests:
-  path: ./tests
-
-# Restrict which job-level variants apply to this task (optional)
-allowed_variants: [baseline, mcp_only]
+# Task-level files copied into sandbox (inherited by all samples)
+files:
+  /workspace: ./project
+setup: "cd /workspace && flutter pub get"
 
 samples:
   inline:
@@ -77,8 +67,8 @@ samples:
         tags: [bloc, state]
 
     - id: navigation_crash
-      workspace:
-        path: ./nav_project    # Override task-level workspace
+      files:
+        /workspace: ./nav_project    # Override task-level files
       input: |
         Fix the crash when navigating back from the detail screen.
       target: |
@@ -90,27 +80,22 @@ samples:
 
 For the complete list of task fields (including Inspect AI `Task` parameters), see the [Task fields table](yaml_config.md#task).
 
-### Workspace/Tests References
+### Files and Setup
 
 ```yaml
-# Reference a reusable template
-workspace:
-  template: flutter_app
-
-# Reference a path relative to task directory
-workspace:
-  path: ./project
-
-# Clone from git
-workspace:
-  git: https://github.com/example/repo.git
-
-# Shorthand (equivalent to path:)
-workspace: ./project
+# Copy a local directory into the sandbox
+files:
+  /workspace: ./project
+setup: "cd /workspace && flutter pub get"
+
+# Copy individual files
+files:
+  /workspace/lib/main.dart: ./fixtures/main.dart
+  /workspace/test/widget_test.dart: ./fixtures/test.dart
 ```
 
 > [!NOTE]
-> Paths in `workspace` and `tests` are resolved **relative to the task directory** (e.g., `tasks/flutter_bug_fix/`).
+> Paths in `files` values are resolved **relative to the task directory** (e.g., `tasks/flutter_bug_fix/`). Task-level `files` and `setup` are inherited by all samples. Sample-level `files` stack on top (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`.
 
 ---
 
@@ -191,9 +176,9 @@ models:
 # Each key is a variant name; the value is the variant configuration.
 variants:
   baseline: {}
-  context_only: { context_files: [./context_files/flutter.md] }
-  mcp_only: { mcp_servers: [dart] }
-  full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
+  context_only: { files: [./context_files/flutter.md] }
+  mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
+  full: { files: [./context_files/flutter.md], mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
 
 # Inspect AI eval_set() parameters (all optional, nested under inspect_eval_arguments)
 inspect_eval_arguments:
@@ -250,9 +235,9 @@ tasks:
   # Per-task overrides (keys must match directory names in tasks/)
   inline:
     flutter_bug_fix:
-      allowed_variants: [baseline]   # Override variants for this task
-      include-samples: [sample_001]  # Only run these samples
-      exclude-samples: [slow_test]   # Exclude these samples
+      include-variants: [baseline]     # Only run these variants for this task
+      include-samples: [sample_001]    # Only run these samples
+      exclude-samples: [slow_test]     # Exclude these samples
 ```
 
 ---
@@ -264,18 +249,21 @@ Variants modify how tasks execute, controlling context injection, tool availabil
 ```yaml
 variants:
   baseline: {}
-  context_only: { context_files: [./context_files/flutter.md] }
-  mcp_only: { mcp_servers: [dart] }
-  full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
+  context_only: { files: [./context_files/flutter.md] }
+  mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
+  full: { files: [./context_files/flutter.md], mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }
 ```
 
-Variant sub-fields (`context_files`, `mcp_servers`, `skills`, `flutter_channel`) are documented in the [Job fields table](yaml_config.md#job).
+Variant sub-fields (`files`, `mcp_servers`, `skills`, `task_parameters`) are documented in the [Job fields table](yaml_config.md#job).
 
-Tasks can optionally restrict which variants apply to them via `allowed_variants` in their `task.yaml`:
+Jobs can restrict which variants apply to specific tasks via `include-variants` and `exclude-variants` on the `tasks.<task_id>` object:
 
 ```yaml
-# task.yaml — only run baseline and mcp_only variants for this task
-allowed_variants: [baseline, mcp_only]
+# job.yaml — only run baseline and mcp_only variants for flutter_bug_fix
+tasks:
+  inline:
+    flutter_bug_fix:
+      include-variants: [baseline, mcp_only]
 ```
 
 Glob patterns (containing `*`, `?`, or `[`) are expanded automatically. For skills, only directories containing `SKILL.md` are included.
@@ -302,7 +290,7 @@ updated: "2025-12-24"
 ## Flutter Best Practices
 
 Content here is injected into the model's context when the variant
-has context_files pointing to this file.
+has files pointing to this file.
 ```
 
 | Field | Type | Required | Description |
diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index 95e0a91..68ab0bb 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -257,18 +257,18 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
   - `systemMessage`
   - `system_message`
   - Custom system prompt for this task
-* - `workspace`
-  - string/object
+* - `files`
+  - object
   - Y
-  -
-  -
-  - Default workspace for all samples (resolved into `Sample.files` and `Sample.setup`)
-* - `tests`
-  - string/object
+  - `files`
+  - `files`
+  - Files to copy into sandbox for all samples (`{destination: source}`). Task-level files stack with sample-level files (sample wins on key conflict).
+* - `setup`
+  - string
   - Y
-  -
-  -
-  - Default test files for all samples
+  - `setup`
+  - `setup`
+  - Setup script to run in sandbox before evaluation (overridden by sample-level `setup`)
 * - `display_name`
   - string
   - Y
@@ -402,7 +402,7 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
 
 ## Sample
 
-Samples are individual test cases defined either inline in `task.yaml` under `samples.inline`, or in external YAML files referenced via `samples.paths`. Fields like `difficulty`, `tags`, `workspace`, and `tests` should be nested inside the sample's `metadata` dict.
+Samples are individual test cases defined either inline in `task.yaml` under `samples.inline`, or in external YAML files referenced via `samples.paths`. Fields like `difficulty` and `tags` should be nested inside the sample's `metadata` dict.
 
 ```{list-table}
 :header-rows: 1
@@ -453,18 +453,6 @@ Samples are individual test cases defined either inline in `task.yaml` under `sa
   -
   -
   - Override system prompt for this sample
-* - `workspace`
-  - string/object
-  - Y
-  -
-  -
-  - Override task-level workspace (resolved path stored in `metadata["workspace"]`)
-* - `tests`
-  - string/object
-  - Y
-  -
-  -
-  - Override task-level tests (resolved path stored in `metadata["tests"]`)
 * - `choices`
   - list
   - Y
diff --git a/packages/dataset_config_dart/lib/src/models/task.dart b/packages/dataset_config_dart/lib/src/models/task.dart
index 19e4f02..1ba0380 100644
--- a/packages/dataset_config_dart/lib/src/models/task.dart
+++ b/packages/dataset_config_dart/lib/src/models/task.dart
@@ -17,6 +17,12 @@ sealed class Task with _$Task {
     /// A `Dataset`, a sequence of `Sample` objects, or `null`.
     Dataset? dataset,
 
+    /// Files to copy into sandbox (inherited by all samples).
+    ///
+    /// Keys are destination paths, values are source paths, inline text,
+    /// or inline binary (base64-encoded data URLs).
+    Map<String, String>? files,
+
     /// Setup step (always run even when the main solver is replaced).
     Object? setup,
 
diff --git a/packages/dataset_config_dart/lib/src/models/task.freezed.dart b/packages/dataset_config_dart/lib/src/models/task.freezed.dart
index b38f7d9..bbf94e6 100644
--- a/packages/dataset_config_dart/lib/src/models/task.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/task.freezed.dart
@@ -18,7 +18,11 @@ mixin _$Task {
 /// Dataset to evaluate.
 ///
 /// A `Dataset`, a sequence of `Sample` objects, or `null`.
- Dataset? get dataset;/// Setup step (always run even when the main solver is replaced).
+ Dataset? get dataset;/// Files to copy into sandbox (inherited by all samples).
+///
+/// Keys are destination paths, values are source paths, inline text,
+/// or inline binary (base64-encoded data URLs).
+ Map<String, String>? get files;/// Setup step (always run even when the main solver is replaced).
  Object? get setup;/// Solver or list of solvers. Defaults to `generate()`.
  Object? get solver;/// Optional cleanup function for task.
 ///
@@ -76,16 +80,16 @@ $TaskCopyWith<Task> get copyWith => _$TaskCopyWithImpl<Task>(this as Task, _$ide
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other.sandboxParameters, sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other.metadata, metadata));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.files, files)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other.sandboxParameters, sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other.metadata, metadata));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(metadata)]);
+int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(files),const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(metadata)]);
 
 @override
 String toString() {
-  return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)';
+  return 'Task(dataset: $dataset, files: $files, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)';
 }
 
 
@@ -96,7 +100,7 @@ abstract mixin class $TaskCopyWith<$Res>  {
   factory $TaskCopyWith(Task value, $Res Function(Task) _then) = _$TaskCopyWithImpl;
 @useResult
 $Res call({
- Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata
+ Dataset? dataset, Map<String, String>? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata
 });
 
 
@@ -113,10 +117,11 @@ class _$TaskCopyWithImpl<$Res>
 
 /// Create a copy of Task
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? dataset = freezed,Object? files = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
   return _then(_self.copyWith(
 dataset: freezed == dataset ? _self.dataset : dataset // ignore: cast_nullable_to_non_nullable
-as Dataset?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable
+as Dataset?,files: freezed == files ? _self.files : files // ignore: cast_nullable_to_non_nullable
+as Map<String, String>?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable
 as String?,config: freezed == config ? _self.config : config ,modelRoles: freezed == modelRoles ? _self.modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable
 as Map<String, String>?,sandbox: freezed == sandbox ? _self.sandbox : sandbox ,approval: freezed == approval ? _self.approval : approval ,epochs: freezed == epochs ? _self.epochs : epochs ,failOnError: freezed == failOnError ? _self.failOnError : failOnError ,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable
 as bool?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable
@@ -224,10 +229,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( Dataset? dataset,  Map<String, String>? files,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Task() when $default != null:
-return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _:
+return $default(_that.dataset,_that.files,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _:
   return orElse();
 
 }
@@ -245,10 +250,10 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( Dataset? dataset,  Map<String, String>? files,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)  $default,) {final _that = this;
 switch (_that) {
 case _Task():
-return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);}
+return $default(_that.dataset,_that.files,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -262,10 +267,10 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( Dataset? dataset,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( Dataset? dataset,  Map<String, String>? files,  Object? setup,  Object? solver,  Object? cleanup,  Object? scorer,  Object? metrics,  String? model,  Object? config, @JsonKey(name: 'model_roles')  Map<String, String>? modelRoles,  Object? sandbox,  Object? approval,  Object? epochs, @JsonKey(name: 'fail_on_error')  Object? failOnError, @JsonKey(name: 'continue_on_fail')  bool? continueOnFail, @JsonKey(name: 'message_limit')  int? messageLimit, @JsonKey(name: 'token_limit')  int? tokenLimit, @JsonKey(name: 'time_limit')  int? timeLimit, @JsonKey(name: 'working_limit')  int? workingLimit, @JsonKey(name: 'cost_limit')  double? costLimit, @JsonKey(name: 'early_stopping')  Object? earlyStopping, @JsonKey(name: 'display_name')  String? displayName, @JsonKey(name: 'func')  String? func, @JsonKey(name: 'system_message')  String? systemMessage, @JsonKey(name: 'sandbox_parameters')  Map<String, dynamic>? sandboxParameters,  String? name,  Object version,  Map<String, dynamic>? metadata)?  $default,) {final _that = this;
 switch (_that) {
 case _Task() when $default != null:
-return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _:
+return $default(_that.dataset,_that.files,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _:
   return null;
 
 }
@@ -277,13 +282,30 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score
 @JsonSerializable()
 
 class _Task implements Task {
-  const _Task({this.dataset, this.setup, this.solver, this.cleanup, this.scorer, this.metrics, this.model, this.config, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, this.sandbox, this.approval, this.epochs, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'early_stopping') this.earlyStopping, @JsonKey(name: 'display_name') this.displayName, @JsonKey(name: 'func') this.func, @JsonKey(name: 'system_message') this.systemMessage, @JsonKey(name: 'sandbox_parameters') final  Map<String, dynamic>? sandboxParameters, this.name, this.version = 0, final  Map<String, dynamic>? metadata}): _modelRoles = modelRoles,_sandboxParameters = sandboxParameters,_metadata = metadata;
+  const _Task({this.dataset, final  Map<String, String>? files, this.setup, this.solver, this.cleanup, this.scorer, this.metrics, this.model, this.config, @JsonKey(name: 'model_roles') final  Map<String, String>? modelRoles, this.sandbox, this.approval, this.epochs, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'early_stopping') this.earlyStopping, @JsonKey(name: 'display_name') this.displayName, @JsonKey(name: 'func') this.func, @JsonKey(name: 'system_message') this.systemMessage, @JsonKey(name: 'sandbox_parameters') final  Map<String, dynamic>? sandboxParameters, this.name, this.version = 0, final  Map<String, dynamic>? metadata}): _files = files,_modelRoles = modelRoles,_sandboxParameters = sandboxParameters,_metadata = metadata;
   factory _Task.fromJson(Map<String, dynamic> json) => _$TaskFromJson(json);
 
 /// Dataset to evaluate.
 ///
 /// A `Dataset`, a sequence of `Sample` objects, or `null`.
 @override final  Dataset? dataset;
+/// Files to copy into sandbox (inherited by all samples).
+///
+/// Keys are destination paths, values are source paths, inline text,
+/// or inline binary (base64-encoded data URLs).
+ final  Map<String, String>? _files;
+/// Files to copy into sandbox (inherited by all samples).
+///
+/// Keys are destination paths, values are source paths, inline text,
+/// or inline binary (base64-encoded data URLs).
+@override Map<String, String>? get files {
+  final value = _files;
+  if (value == null) return null;
+  if (_files is EqualUnmodifiableMapView) return _files;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableMapView(value);
+}
+
 /// Setup step (always run even when the main solver is replaced).
 @override final  Object? setup;
 /// Solver or list of solvers. Defaults to `generate()`.
@@ -396,16 +418,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other._sandboxParameters, _sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other._metadata, _metadata));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other._files, _files)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other._sandboxParameters, _sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other._metadata, _metadata));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(_sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(_metadata)]);
+int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(_files),const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(_sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(_metadata)]);
 
 @override
 String toString() {
-  return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)';
+  return 'Task(dataset: $dataset, files: $files, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)';
 }
 
 
@@ -416,7 +438,7 @@ abstract mixin class _$TaskCopyWith<$Res> implements $TaskCopyWith<$Res> {
   factory _$TaskCopyWith(_Task value, $Res Function(_Task) _then) = __$TaskCopyWithImpl;
 @override @useResult
 $Res call({
- Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata
+ Dataset? dataset, Map<String, String>? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata
 });
 
 
@@ -433,10 +455,11 @@ class __$TaskCopyWithImpl<$Res>
 
 /// Create a copy of Task
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? dataset = freezed,Object? files = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) {
   return _then(_Task(
 dataset: freezed == dataset ? _self.dataset : dataset // ignore: cast_nullable_to_non_nullable
-as Dataset?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable
+as Dataset?,files: freezed == files ? _self._files : files // ignore: cast_nullable_to_non_nullable
+as Map<String, String>?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable
 as String?,config: freezed == config ? _self.config : config ,modelRoles: freezed == modelRoles ? _self._modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable
 as Map<String, String>?,sandbox: freezed == sandbox ? _self.sandbox : sandbox ,approval: freezed == approval ? _self.approval : approval ,epochs: freezed == epochs ? _self.epochs : epochs ,failOnError: freezed == failOnError ? _self.failOnError : failOnError ,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable
 as bool?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable
diff --git a/packages/dataset_config_dart/lib/src/models/task.g.dart b/packages/dataset_config_dart/lib/src/models/task.g.dart
index 7752223..0ad2491 100644
--- a/packages/dataset_config_dart/lib/src/models/task.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/task.g.dart
@@ -10,6 +10,9 @@ _Task _$TaskFromJson(Map<String, dynamic> json) => _Task(
   dataset: json['dataset'] == null
       ? null
       : Dataset.fromJson(json['dataset'] as Map<String, dynamic>),
+  files: (json['files'] as Map<String, dynamic>?)?.map(
+    (k, e) => MapEntry(k, e as String),
+  ),
   setup: json['setup'],
   solver: json['solver'],
   cleanup: json['cleanup'],
@@ -42,6 +45,7 @@ _Task _$TaskFromJson(Map<String, dynamic> json) => _Task(
 
 Map<String, dynamic> _$TaskToJson(_Task instance) => <String, dynamic>{
   'dataset': instance.dataset,
+  'files': instance.files,
   'setup': instance.setup,
   'solver': instance.solver,
   'cleanup': instance.cleanup,
diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart
index 64816d8..c2d54f6 100644
--- a/packages/dataset_config_dart/lib/src/parsed_task.dart
+++ b/packages/dataset_config_dart/lib/src/parsed_task.dart
@@ -24,6 +24,12 @@ class ParsedTask {
   /// Pass-through dict for sandbox plugin configuration.
   final Map<String, dynamic>? sandboxParameters;
 
+  /// Task-level files to copy into sandbox.
+  final Map<String, String>? taskFiles;
+
+  /// Task-level setup script.
+  final String? taskSetup;
+
   // ------------------------------------------------------------------
   // Task-level settings (from task.yaml)
   // ------------------------------------------------------------------
@@ -89,6 +95,8 @@ class ParsedTask {
     this.saveExamples = false,
     this.examplesDir,
     this.sandboxParameters,
+    this.taskFiles,
+    this.taskSetup,
     this.model,
     this.config,
     this.modelRoles,
@@ -119,6 +127,8 @@ class ParsedTask {
     bool? saveExamples,
     String? examplesDir,
     Map<String, dynamic>? sandboxParameters,
+    Map<String, String>? taskFiles,
+    String? taskSetup,
     String? model,
     Map<String, dynamic>? config,
     Map<String, String>? modelRoles,
@@ -147,6 +157,8 @@ class ParsedTask {
       saveExamples: saveExamples ?? this.saveExamples,
       examplesDir: examplesDir ?? this.examplesDir,
       sandboxParameters: sandboxParameters ?? this.sandboxParameters,
+      taskFiles: taskFiles ?? this.taskFiles,
+      taskSetup: taskSetup ?? this.taskSetup,
       model: model ?? this.model,
       config: config ?? this.config,
       modelRoles: modelRoles ?? this.modelRoles,
diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index 66f8ac7..f57d4a4 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -52,13 +52,11 @@ class YamlParser extends Parser {
     final taskId = (data['id'] as String?) ?? p.basename(taskDir);
     final func = (data['func'] as String?) ?? taskId;
 
-    final taskWorkspaceRaw = data['workspace'];
-    final taskTestsRaw = data['tests'];
     final systemMessage = data['system_message'] as String?;
 
-    // Pre-resolve task-level paths to absolute
-    final taskWorkspace = _preResolveToAbs(taskWorkspaceRaw, taskDir);
-    final taskTests = _preResolveToAbs(taskTestsRaw, taskDir);
+    // Parse task-level files and setup
+    final taskFiles = _asStringMap(data['files']);
+    final taskSetup = data['setup'] as String?;
 
     // Parse samples section
     final samplesRaw = data['samples'];
@@ -72,8 +70,7 @@ class YamlParser extends Parser {
     final samples = _loadSamplesSection(
       samplesMap,
       datasetRoot,
-      taskWorkspace,
-      taskTests,
+      taskFiles,
       taskDir,
     );
 
@@ -106,6 +103,8 @@ class YamlParser extends Parser {
         samples: samples,
         systemMessage: systemMessage,
         sandboxParameters: sandboxParameters,
+        taskFiles: taskFiles,
+        taskSetup: taskSetup,
         // Task-level settings
         model: model,
         config: config,
@@ -136,8 +135,7 @@ class YamlParser extends Parser {
   List<Sample> _loadSamplesSection(
     Map<String, dynamic> samplesMap,
     String datasetRoot,
-    Object? taskWorkspace,
-    Object? taskTests,
+    Map<String, String>? taskFiles,
     String taskDir,
   ) {
     final pathPatterns =
@@ -168,8 +166,7 @@ class YamlParser extends Parser {
         _loadSamplesFromFiles(
           matchedFiles,
           datasetRoot,
-          taskWorkspace,
-          taskTests,
+          taskFiles,
         ),
       );
     }
@@ -178,7 +175,7 @@ class YamlParser extends Parser {
     for (final def in inlineDefs) {
       if (def.isEmpty) continue;
       samples.add(
-        _resolveSample(def, taskDir, datasetRoot, taskWorkspace, taskTests),
+        _resolveSample(def, taskDir, datasetRoot, taskFiles),
       );
     }
 
@@ -189,8 +186,7 @@ class YamlParser extends Parser {
   List<Sample> _loadSamplesFromFiles(
     List<String> sampleFiles,
     String datasetRoot,
-    Object? taskWorkspace,
-    Object? taskTests,
+    Map<String, String>? taskFiles,
   ) {
     final samples = <Sample>[];
 
@@ -215,8 +211,7 @@ class YamlParser extends Parser {
             data,
             sampleDir,
             datasetRoot,
-            taskWorkspace,
-            taskTests,
+            taskFiles,
           ),
         );
       }
@@ -237,8 +232,7 @@ class YamlParser extends Parser {
     Map<String, dynamic> doc,
     String baseDir,
     String datasetRoot,
-    Object? taskWorkspace,
-    Object? taskTests,
+    Map<String, String>? taskFiles,
   ) {
     // --- Validate required fields ---
     for (final field in ['id', 'input', 'target']) {
@@ -252,33 +246,6 @@ class YamlParser extends Parser {
     // Read metadata fields from the metadata dict
     final metaRaw = Map<String, dynamic>.from(doc['metadata'] as Map? ?? {});
 
-    final sampleWorkspace = metaRaw['workspace'];
-    final sampleTests = metaRaw['tests'];
-
-    // Sample-level overrides task-level
-    final effectiveWorkspace = sampleWorkspace ?? taskWorkspace;
-
-    String? workspace;
-    String? workspaceGit;
-    String? workspaceGitRef;
-
-    if (effectiveWorkspace != null) {
-      if (effectiveWorkspace is Map && effectiveWorkspace.containsKey('git')) {
-        workspaceGit = effectiveWorkspace['git'] as String?;
-        workspaceGitRef = effectiveWorkspace['ref'] as String?;
-      } else {
-        final resolveDir = sampleWorkspace != null ? baseDir : datasetRoot;
-        workspace = _resolveResourcePath(effectiveWorkspace, resolveDir);
-      }
-    }
-
-    String? tests;
-    if (sampleTests != null) {
-      tests = _resolveResourcePath(sampleTests, baseDir);
-    } else if (taskTests != null) {
-      tests = _resolveResourcePath(taskTests, datasetRoot);
-    }
-
     // --- Normalize tags from metadata ---
     final rawTags = metaRaw['tags'];
     final List<String> tags;
@@ -295,17 +262,19 @@ class YamlParser extends Parser {
       ...metaRaw,
       'difficulty': metaRaw['difficulty'] as String? ?? 'medium',
       'tags': tags,
-      'workspace': ?workspace,
-      'tests': ?tests,
-      'workspace_git': ?workspaceGit,
-      'workspace_git_ref': ?workspaceGitRef,
     };
 
     // Parse sample-level fields
     final choices = (doc['choices'] as List?)?.cast<String>();
     final sampleSandbox = doc['sandbox'];
     final setup = doc['setup'] as String?;
-    final files = _asStringMap(doc['files']);
+    final sampleFiles = _asStringMap(doc['files']);
+
+    // Stack files: task-level files + sample-level files (sample wins on conflict)
+    Map<String, String>? mergedFiles;
+    if (taskFiles != null || sampleFiles != null) {
+      mergedFiles = {...?taskFiles, ...?sampleFiles};
+    }
 
     return Sample(
       id: doc['id'] as String,
@@ -314,7 +283,7 @@ class YamlParser extends Parser {
       metadata: metadata,
       choices: choices,
       sandbox: sampleSandbox,
-      files: files,
+      files: mergedFiles,
       setup: setup,
     );
   }
@@ -435,58 +404,7 @@ class YamlParser extends Parser {
   }
 
 
-  // ------------------------------------------------------------------
-  // Path resolution helpers
-  // ------------------------------------------------------------------
-
-  /// Pre-resolve a task-level resource to an absolute path.
-  Object? _preResolveToAbs(Object? resource, String taskDir) {
-    if (resource == null) return null;
 
-    if (resource is String) {
-      if (resource.startsWith('./') ||
-          resource.startsWith('../') ||
-          resource.startsWith('/')) {
-        return {'path': p.normalize(p.join(taskDir, resource))};
-      }
-      return resource;
-    }
-
-    if (resource is Map) {
-      if (resource.containsKey('path')) {
-        final pathVal = resource['path'] as String;
-        return {
-          ...resource,
-          'path': p.normalize(p.join(taskDir, pathVal)),
-        };
-      }
-      return resource;
-    }
-
-    return resource;
-  }
-
-  /// Resolve a workspace/tests resource reference to an absolute path string.
-  String? _resolveResourcePath(Object? resource, String baseDir) {
-    if (resource == null) return null;
-
-    if (resource is String) {
-      if (resource.startsWith('./') ||
-          resource.startsWith('../') ||
-          resource.startsWith('/')) {
-        return p.normalize(p.join(baseDir, resource));
-      }
-      return null;
-    }
-
-    if (resource is Map) {
-      if (resource.containsKey('path')) {
-        return p.normalize(p.join(baseDir, resource['path'] as String));
-      }
-    }
-
-    return null;
-  }
 
   // ------------------------------------------------------------------
   // Log dir helpers
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index d73925c..af23470 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -105,8 +105,6 @@ class EvalSetResolver {
     final inspectTasks = <Task>[];
     final sandboxCfg = job.sandbox ?? <String, dynamic>{};
     final sandboxTypeStr = (sandboxCfg['environment'] as String?) ?? 'local';
-    final isContainer =
-        sandboxTypeStr.isNotEmpty && sandboxTypeStr != 'local';
 
     // Parse task_defaults from inspect_eval_arguments
     final evalArgs = job.inspectEvalArguments ?? <String, dynamic>{};
@@ -126,26 +124,15 @@ class EvalSetResolver {
           }
         }
 
-        // Build files + setup for sandbox provisioning
-        Map<String, String>? files = sample.files;
-        String? setup = sample.setup;
-        final workspace = sample.metadata?['workspace'] as String?;
-        final workspaceGit = sample.metadata?['workspace_git'] as String?;
-        final workspaceGitRef =
-            sample.metadata?['workspace_git_ref'] as String?;
-
-        if (workspace != null && isContainer) {
-          files = {...?files, '/workspace': workspace};
-          setup ??= 'cd /workspace && flutter pub get';
-          enriched['workspace'] = '/workspace';
-        }
-        if (workspaceGit != null) {
-          enriched['workspace_git'] = workspaceGit;
-          if (workspaceGitRef != null) {
-            enriched['workspace_git_ref'] = workspaceGitRef;
-          }
+        // Stack files: task-level + sample-level (sample wins on conflict)
+        Map<String, String>? files;
+        if (tc.taskFiles != null || sample.files != null) {
+          files = {...?tc.taskFiles, ...?sample.files};
         }
 
+        // Setup: sample overrides task
+        final setup = sample.setup ?? tc.taskSetup;
+
         inspectSamples.add(
           Sample(
             id: sample.id,
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/task.py b/packages/dataset_config_python/src/dataset_config_python/models/task.py
index 5623ab3..bfa0c4d 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/task.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/task.py
@@ -31,6 +31,9 @@ class Task(BaseModel):
     dataset: Dataset | None = None
     """Inline dataset with samples."""
 
+    files: dict[str, str] | None = None
+    """Files to copy into sandbox (inherited by all samples)."""
+
     setup: Any | None = None
     """Setup step (always run even when the main solver is replaced)."""
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index 4d07ecd..eecd3b4 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -55,6 +55,9 @@ def __init__(
         version: Any | None = None,
         metadata: dict[str, Any] | None = None,
         sandbox_parameters: dict[str, Any] | None = None,
+        # Task-level files and setup
+        task_files: dict[str, str] | None = None,
+        task_setup: str | None = None,
     ):
         self.id = id
         self.func = func
@@ -82,6 +85,8 @@ def __init__(
         self.version = version
         self.metadata = metadata
         self.sandbox_parameters = sandbox_parameters
+        self.task_files = task_files
+        self.task_setup = task_setup
 
     _UNSET: Any = object()
 
@@ -97,6 +102,8 @@ def copy_with(
         save_examples: bool | None = _UNSET,
         examples_dir: str | None = _UNSET,
         sandbox_parameters: dict[str, Any] | None = _UNSET,
+        task_files: dict[str, str] | None = _UNSET,
+        task_setup: str | None = _UNSET,
         model: str | None = _UNSET,
         config: dict[str, Any] | None = _UNSET,
         model_roles: dict[str, str] | None = _UNSET,
@@ -127,6 +134,8 @@ def copy_with(
             save_examples=self.save_examples if save_examples is _U else save_examples,  # type: ignore[arg-type]
             examples_dir=self.examples_dir if examples_dir is _U else examples_dir,
             sandbox_parameters=self.sandbox_parameters if sandbox_parameters is _U else sandbox_parameters,
+            task_files=self.task_files if task_files is _U else task_files,
+            task_setup=self.task_setup if task_setup is _U else task_setup,
             model=self.model if model is _U else model,
             config=self.config if config is _U else config,
             model_roles=self.model_roles if model_roles is _U else model_roles,
@@ -180,32 +189,8 @@ def _resolve_log_dir(logs_dir: str, base_dir: str) -> str:
     return os.path.normpath(os.path.join(base_dir, logs_dir, timestamp))
 
 
-def _pre_resolve_to_abs(resource: Any, task_dir: str) -> Any:
-    """Pre-resolve a task-level resource to an absolute path."""
-    if resource is None:
-        return None
-    if isinstance(resource, str):
-        if resource.startswith("./") or resource.startswith("../") or resource.startswith("/"):
-            return {"path": os.path.normpath(os.path.join(task_dir, resource))}
-        return resource
-    if isinstance(resource, dict):
-        if "path" in resource:
-            return {**resource, "path": os.path.normpath(os.path.join(task_dir, resource["path"]))}
-        return resource
-    return resource
-
-
-def _resolve_resource_path(resource: Any, base_dir: str) -> str | None:
-    """Resolve a workspace/tests resource reference to an absolute path."""
-    if resource is None:
-        return None
-    if isinstance(resource, str):
-        if resource.startswith("./") or resource.startswith("../") or resource.startswith("/"):
-            return os.path.normpath(os.path.join(base_dir, resource))
-        return None
-    if isinstance(resource, dict) and "path" in resource:
-        return os.path.normpath(os.path.join(base_dir, resource["path"]))
-    return None
+
+
 
 
 # ---------------------------------------------------------------------------
@@ -239,12 +224,19 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     task_id = data.get("id") or os.path.basename(task_dir)
     func_name = data.get("func") or task_id
 
-    task_workspace_raw = data.get("workspace")
-    task_tests_raw = data.get("tests")
     system_message = data.get("system_message")
 
-    task_workspace = _pre_resolve_to_abs(task_workspace_raw, task_dir)
-    task_tests = _pre_resolve_to_abs(task_tests_raw, task_dir)
+    # Parse task-level files and setup
+    task_files = data.get("files")
+    if isinstance(task_files, dict):
+        task_files = {str(k): str(v) for k, v in task_files.items()}
+    else:
+        task_files = None
+    task_setup = data.get("setup")
+    if isinstance(task_setup, str):
+        pass  # already a string
+    else:
+        task_setup = None
 
     # Parse samples section (optional)
     samples_raw = data.get("samples")
@@ -256,7 +248,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             f"'paths' keys, got {type(samples_raw).__name__}"
         )
     else:
-        samples = _load_samples_section(samples_raw, dataset_root, task_workspace, task_tests, task_dir)
+        samples = _load_samples_section(samples_raw, dataset_root, task_files, task_dir)
 
     # Task-level Inspect AI args are nested under inspect_task_args
     task_args = data.get("inspect_task_args") or {}
@@ -286,6 +278,8 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             version=data.get("version"),
             metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None,
             sandbox_parameters=data.get("sandbox_parameters") if isinstance(data.get("sandbox_parameters"), dict) else None,
+            task_files=task_files,
+            task_setup=task_setup,
         )
     ]
 
@@ -298,8 +292,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
 def _load_samples_section(
     samples_map: dict[str, Any],
     dataset_root: str,
-    task_workspace: Any,
-    task_tests: Any,
+    task_files: dict[str, str] | None,
     task_dir: str,
 ) -> list[Sample]:
     """Load samples from 'paths' and 'inline' subsections."""
@@ -318,12 +311,12 @@ def _load_samples_section(
         if not matched:
             raise FileNotFoundError(f"No sample files matched pattern: {pattern}")
 
-        samples.extend(_load_samples_from_files(matched, dataset_root, task_workspace, task_tests))
+        samples.extend(_load_samples_from_files(matched, dataset_root, task_files))
 
     for defn in inline_defs:
         if not defn:
             continue
-        samples.append(_resolve_sample(defn, task_dir, dataset_root, task_workspace, task_tests))
+        samples.append(_resolve_sample(defn, task_dir, dataset_root, task_files))
 
     return samples
 
@@ -331,8 +324,7 @@ def _load_samples_section(
 def _load_samples_from_files(
     sample_files: list[str],
     dataset_root: str,
-    task_workspace: Any,
-    task_tests: Any,
+    task_files: dict[str, str] | None,
 ) -> list[Sample]:
     """Load samples from external YAML files."""
     samples: list[Sample] = []
@@ -354,7 +346,7 @@ def _load_samples_from_files(
             data = yaml.safe_load(doc)
             if isinstance(data, dict):
                 samples.append(
-                    _resolve_sample(data, sample_dir, dataset_root, task_workspace, task_tests)
+                    _resolve_sample(data, sample_dir, dataset_root, task_files)
                 )
 
     return samples
@@ -364,8 +356,7 @@ def _resolve_sample(
     doc: dict[str, Any],
     base_dir: str,
     dataset_root: str,
-    task_workspace: Any,
-    task_tests: Any,
+    task_files: dict[str, str] | None,
 ) -> Sample:
     """Resolve a single sample dict into a Sample."""
     for field in ("id", "input", "target"):
@@ -377,29 +368,6 @@ def _resolve_sample(
     # Read metadata fields from the metadata dict
     meta_raw: dict[str, Any] = doc.get("metadata") or {}
 
-    sample_workspace = meta_raw.get("workspace")
-    sample_tests = meta_raw.get("tests")
-
-    effective_workspace = sample_workspace if sample_workspace is not None else task_workspace
-
-    workspace = None
-    workspace_git = None
-    workspace_git_ref = None
-
-    if effective_workspace is not None:
-        if isinstance(effective_workspace, dict) and "git" in effective_workspace:
-            workspace_git = effective_workspace.get("git")
-            workspace_git_ref = effective_workspace.get("ref")
-        else:
-            resolve_dir = base_dir if sample_workspace is not None else dataset_root
-            workspace = _resolve_resource_path(effective_workspace, resolve_dir)
-
-    tests = None
-    if sample_tests is not None:
-        tests = _resolve_resource_path(sample_tests, base_dir)
-    elif task_tests is not None:
-        tests = _resolve_resource_path(task_tests, dataset_root)
-
     # Normalize tags from metadata
     raw_tags = meta_raw.get("tags")
     if isinstance(raw_tags, str):
@@ -413,14 +381,18 @@ def _resolve_sample(
     meta: dict[str, Any] = {**meta_raw}
     meta["difficulty"] = meta_raw.get("difficulty", "medium")
     meta["tags"] = tags
-    if workspace is not None:
-        meta["workspace"] = workspace
-    if tests is not None:
-        meta["tests"] = tests
-    if workspace_git is not None:
-        meta["workspace_git"] = workspace_git
-    if workspace_git_ref is not None:
-        meta["workspace_git_ref"] = workspace_git_ref
+
+    # Parse sample-level files
+    sample_files = doc.get("files")
+    if isinstance(sample_files, dict):
+        sample_files = {str(k): str(v) for k, v in sample_files.items()}
+    else:
+        sample_files = None
+
+    # Stack files: task-level + sample-level (sample wins on conflict)
+    merged_files: dict[str, str] | None = None
+    if task_files is not None or sample_files is not None:
+        merged_files = {**(task_files or {}), **(sample_files or {})}
 
     return Sample(
         id=doc["id"],
@@ -429,7 +401,7 @@ def _resolve_sample(
         metadata=meta,
         choices=doc.get("choices"),
         sandbox=doc.get("sandbox"),
-        files=doc.get("files"),
+        files=merged_files,
         setup=doc.get("setup"),
     )
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 83b2b2e..62db329 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -165,7 +165,6 @@ def _build_eval_set(
     inspect_tasks: list[Task] = []
     sandbox_cfg = job.sandbox or {}
     sandbox_type_str = sandbox_cfg.get("environment", "local")
-    is_container = sandbox_type_str and sandbox_type_str != "local"
     eval_args = job.inspect_eval_arguments or {}
     task_defaults = eval_args.get("task_defaults") or {}
 
@@ -181,21 +180,13 @@ def _build_eval_set(
                     enriched["examples_dir"] = tc.examples_dir
                     enriched["task_variant"] = f"{tc.id}:{tc.variant.name}"
 
-            # Build files + setup for sandbox provisioning
-            files = dict(sample.files) if sample.files else None
-            setup = sample.setup
-            workspace = (sample.metadata or {}).get("workspace")
-            workspace_git = (sample.metadata or {}).get("workspace_git")
-            workspace_git_ref = (sample.metadata or {}).get("workspace_git_ref")
-
-            if workspace is not None and is_container:
-                files = {**(files or {}), "/workspace": workspace}
-                setup = setup or "cd /workspace && flutter pub get"
-                enriched["workspace"] = "/workspace"
-            if workspace_git is not None:
-                enriched["workspace_git"] = workspace_git
-                if workspace_git_ref is not None:
-                    enriched["workspace_git_ref"] = workspace_git_ref
+            # Stack files: task-level + sample-level (sample wins on conflict)
+            files: dict[str, str] | None = None
+            if tc.task_files is not None or sample.files is not None:
+                files = {**(tc.task_files or {}), **(sample.files or {})}
+
+            # Setup: sample overrides task
+            setup = sample.setup or tc.task_setup
 
             inspect_samples.append(
                 Sample(
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart
index 589a123..dd6a75a 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart
@@ -1,20 +1,21 @@
 /// Template for the starter task created by `devals init`.
 ///
 /// Creates a task.yaml at tasks/get_started/task.yaml that points at
-/// the parent project as its workspace.
+/// the parent project via files.
 String initTaskTemplate() {
   return '''
 # =============================================================================
 # Starter Task
 # =============================================================================
-# This task points at your project root as its workspace and runs a simple
+# This task copies your project root into the sandbox and runs a simple
 # codebase analysis evaluation.
 
 func: analyze_codebase
 
-# Workspace: points to the project root containing pubspec.yaml
-workspace:
-  path: ../../
+# Files: copies the project root into /workspace in the sandbox
+files:
+  /workspace: ../../
+setup: "cd /workspace && flutter pub get"
 
 samples:
   inline:
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart
index 0a2b40e..582f9be 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart
@@ -11,7 +11,7 @@ String sampleTemplate({
   TemplatePackage? templatePackage,
   String? workspaceValue,
 }) {
-  final workspaceSection = _buildSampleWorkspaceSection(
+  final filesSection = _buildSampleFilesSection(
     workspaceType,
     templatePackage: templatePackage,
     workspaceValue: workspaceValue,
@@ -20,7 +20,7 @@ String sampleTemplate({
   return '''
     - id: $id
       difficulty: $difficulty
-      tags: []$workspaceSection
+      tags: []$filesSection
       input: |
         # Write prompt here
       target: |
@@ -28,35 +28,25 @@ String sampleTemplate({
 ''';
 }
 
-/// Builds workspace/tests lines for an inline sample block.
+/// Builds files lines for an inline sample block.
 ///
-/// Only needed if the sample overrides the task-level workspace.
-String _buildSampleWorkspaceSection(
+/// Only needed if the sample overrides the task-level files.
+String _buildSampleFilesSection(
   WorkspaceType? workspaceType, {
   TemplatePackage? templatePackage,
   String? workspaceValue,
 }) {
   return switch (workspaceType) {
-    WorkspaceType.git =>
-      '''
-
-      workspace:
-        git: ${workspaceValue ?? '<GIT_REPOSITORY_URL>'}''',
     WorkspaceType.path =>
       '''
 
-      workspace:
-        path: ${workspaceValue ?? '<RELATIVE_PATH>'}''',
-    WorkspaceType.template =>
-      '''
-
-      workspace:
-        template: ${templatePackage?.yamlValue ?? '<flutter_app OR jaspr_app OR dart_package>'}''',
+      files:
+        /workspace: ${workspaceValue ?? '<RELATIVE_PATH>'}''',
     WorkspaceType.create =>
       '''
 
-      workspace:
-        path: ./project''',
+      files:
+        /workspace: ./project''',
     _ => '',
   };
 }
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
index 56eff1a..c3370c9 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
@@ -12,7 +12,7 @@ String taskTemplate({
   List<String> variants = const [],
   String? systemMessage,
 }) {
-  final workspaceSection = _buildTaskWorkspaceSection(
+  final filesSection = _buildTaskFilesSection(
     workspaceType,
     templatePackage: templatePackage,
     workspaceValue: workspaceValue,
@@ -29,7 +29,7 @@ String taskTemplate({
 # Task configuration
 # See docs/configuration_reference.md for full schema reference.
 func: $taskFunc
-$variantsLine$systemMessageBlock$workspaceSection
+$variantsLine$systemMessageBlock$filesSection
 samples:
   inline:
     - id: sample_1
@@ -41,43 +41,31 @@ samples:
 ''';
 }
 
-/// Builds the workspace section for a task-level definition.
-String _buildTaskWorkspaceSection(
+/// Builds the files/setup section for a task-level definition.
+String _buildTaskFilesSection(
   WorkspaceType? workspaceType, {
   TemplatePackage? templatePackage,
   String? workspaceValue,
 }) {
   return switch (workspaceType) {
-    WorkspaceType.git =>
-      '''
-workspace:
-  git: ${workspaceValue ?? '<GIT_REPOSITORY_URL>'}
-  # ref: <BRANCH_TAG_OR_COMMIT>  # Optional
-''',
     WorkspaceType.path =>
       '''
-workspace:
-  path: ${workspaceValue ?? './project'}
-''',
-    WorkspaceType.template =>
-      '''
-workspace:
-  template: ${templatePackage?.yamlValue ?? '<flutter_app OR jaspr_app OR dart_package>'}
+files:
+  /workspace: ${workspaceValue ?? './project'}
+setup: "cd /workspace && flutter pub get"
 ''',
     WorkspaceType.create =>
       '''
-workspace:
-  path: ./project
+files:
+  /workspace: ./project
+setup: "cd /workspace && flutter pub get"
 ''',
     _ =>
       '''
-# Workspace configuration (uncomment one):
-# workspace:
-#   template: flutter_app  # OR dart_package OR jaspr_app
-# workspace:
-#   path: ./project
-# workspace:
-#   git: <REPOSITORY_URL>
+# Files to copy into the sandbox (uncomment as needed):
+# files:
+#   /workspace: ./project
+# setup: "cd /workspace && flutter pub get"
 ''',
   };
 }
diff --git a/packages/devals_cli/test/dataset/sample_template_test.dart b/packages/devals_cli/test/dataset/sample_template_test.dart
index 51257b6..6ba3140 100644
--- a/packages/devals_cli/test/dataset/sample_template_test.dart
+++ b/packages/devals_cli/test/dataset/sample_template_test.dart
@@ -21,41 +21,30 @@ void main() {
       expect(result, contains('tags: []'));
     });
 
-    test('with git workspace includes git section', () {
-      final result = sampleTemplate(
-        id: 'test',
-        difficulty: 'easy',
-        workspaceType: WorkspaceType.git,
-        workspaceValue: 'https://github.com/example/repo.git',
-      );
-      expect(result, contains('git:'));
-      expect(result, contains('https://github.com/example/repo.git'));
-    });
-
-    test('with path workspace includes path section', () {
+    test('with path workspace includes files section', () {
       final result = sampleTemplate(
         id: 'test',
         difficulty: 'easy',
         workspaceType: WorkspaceType.path,
         workspaceValue: './project',
       );
-      expect(result, contains('path:'));
-      expect(result, contains('./project'));
+      expect(result, contains('files:'));
+      expect(result, contains('/workspace: ./project'));
     });
 
-    test('with template workspace includes template section', () {
+    test('with create workspace includes files section', () {
       final result = sampleTemplate(
         id: 'test',
         difficulty: 'easy',
-        workspaceType: WorkspaceType.template,
-        templatePackage: TemplatePackage.flutterApp,
+        workspaceType: WorkspaceType.create,
       );
-      expect(result, contains('flutter_app'));
+      expect(result, contains('files:'));
+      expect(result, contains('/workspace: ./project'));
     });
 
-    test('without workspace type has no workspace section', () {
+    test('without workspace type has no files section', () {
       final result = sampleTemplate(id: 'test', difficulty: 'easy');
-      expect(result, isNot(contains('workspace:')));
+      expect(result, isNot(contains('files:')));
     });
 
     test('generates indented block for appending to task file', () {
@@ -64,22 +53,32 @@ void main() {
       expect(result, contains('  - id: test'));
     });
 
-    test('git type with null value uses placeholder', () {
+    test('path type with null value uses placeholder', () {
+      final result = sampleTemplate(
+        id: 'test',
+        difficulty: 'easy',
+        workspaceType: WorkspaceType.path,
+      );
+      expect(result, contains('<RELATIVE_PATH>'));
+    });
+
+    test('git type falls through to no files section', () {
       final result = sampleTemplate(
         id: 'test',
         difficulty: 'easy',
         workspaceType: WorkspaceType.git,
       );
-      expect(result, contains('<GIT_REPOSITORY_URL>'));
+      expect(result, isNot(contains('files:')));
     });
 
-    test('path type with null value uses placeholder', () {
+    test('template type falls through to no files section', () {
       final result = sampleTemplate(
         id: 'test',
         difficulty: 'easy',
-        workspaceType: WorkspaceType.path,
+        workspaceType: WorkspaceType.template,
+        templatePackage: TemplatePackage.flutterApp,
       );
-      expect(result, contains('<RELATIVE_PATH>'));
+      expect(result, isNot(contains('files:')));
     });
   });
 }
diff --git a/packages/devals_cli/test/dataset/task_template_test.dart b/packages/devals_cli/test/dataset/task_template_test.dart
index bfe86a0..2ac92a9 100644
--- a/packages/devals_cli/test/dataset/task_template_test.dart
+++ b/packages/devals_cli/test/dataset/task_template_test.dart
@@ -53,83 +53,58 @@ void main() {
       expect(result, isNot(contains('system_message:')));
     });
 
-    group('workspace section', () {
-      test('generates git workspace', () {
-        final result = taskTemplate(
-          taskFunc: 'flutter_bug_fix',
-          workspaceType: WorkspaceType.git,
-          workspaceValue: 'https://github.com/example/repo',
-        );
-        expect(result, contains('workspace:'));
-        expect(result, contains('git: https://github.com/example/repo'));
-      });
-
-      test('generates path workspace', () {
+    group('files section', () {
+      test('generates path files with workspace value', () {
         final result = taskTemplate(
           taskFunc: 'flutter_bug_fix',
           workspaceType: WorkspaceType.path,
           workspaceValue: './my_project',
         );
-        expect(result, contains('workspace:'));
-        expect(result, contains('path: ./my_project'));
+        expect(result, contains('files:'));
+        expect(result, contains('/workspace: ./my_project'));
+        expect(result, contains('setup:'));
       });
 
-      test('generates template workspace', () {
+      test('generates path files with default when value is null', () {
         final result = taskTemplate(
-          taskFunc: 'flutter_code_gen',
-          workspaceType: WorkspaceType.template,
-          templatePackage: TemplatePackage.flutterApp,
+          taskFunc: 'flutter_bug_fix',
+          workspaceType: WorkspaceType.path,
         );
-        expect(result, contains('workspace:'));
-        expect(result, contains('template: flutter_app'));
+        expect(result, contains('files:'));
+        expect(result, contains('/workspace: ./project'));
       });
 
-      test('generates create workspace as path', () {
+      test('generates create workspace as files', () {
         final result = taskTemplate(
           taskFunc: 'flutter_bug_fix',
           workspaceType: WorkspaceType.create,
         );
-        expect(result, contains('workspace:'));
-        expect(result, contains('path: ./project'));
+        expect(result, contains('files:'));
+        expect(result, contains('/workspace: ./project'));
+        expect(result, contains('setup:'));
       });
 
-      test('generates commented workspace section when type is null', () {
+      test('generates commented files section when type is null', () {
         final result = taskTemplate(taskFunc: 'question_answer');
-        expect(result, contains('# Workspace configuration'));
-        expect(result, contains('#   template: flutter_app'));
+        expect(result, contains('# files:'));
+        expect(result, contains('#   /workspace: ./project'));
       });
 
-      test('generates git with default URL when workspaceValue is null', () {
+      test('git type falls through to commented section', () {
         final result = taskTemplate(
           taskFunc: 'flutter_bug_fix',
           workspaceType: WorkspaceType.git,
         );
-        expect(result, contains('git: <GIT_REPOSITORY_URL>'));
+        expect(result, contains('# files:'));
       });
 
-      test('generates path with default when workspaceValue is null', () {
+      test('template type falls through to commented section', () {
         final result = taskTemplate(
-          taskFunc: 'flutter_bug_fix',
-          workspaceType: WorkspaceType.path,
+          taskFunc: 'flutter_code_gen',
+          workspaceType: WorkspaceType.template,
         );
-        expect(result, contains('path: ./project'));
+        expect(result, contains('# files:'));
       });
-
-      test(
-        'generates template with placeholder when templatePackage is null',
-        () {
-          final result = taskTemplate(
-            taskFunc: 'flutter_code_gen',
-            workspaceType: WorkspaceType.template,
-          );
-          expect(
-            result,
-            contains(
-              'template: <flutter_app OR jaspr_app OR dart_package>',
-            ),
-          );
-        },
-      );
     });
   });
 }

From 87e053e0357e31a073053d4497174bb29f2b77d6 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Wed, 18 Mar 2026 13:03:13 -0700
Subject: [PATCH 13/21] docs: simplify `inspect_task_args` documentation by
 replacing a detailed list of sub-fields with a concise summary and external
 reference.

---
 docs/reference/yaml_config.md | 163 ++++++----------------------------
 1 file changed, 29 insertions(+), 134 deletions(-)

diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index 68ab0bb..625f363 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -34,21 +34,21 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `sandbox`
   - `sandbox`
   - Sandbox configuration. String shorthand (e.g. `podman`) is equivalent to `{environment: podman}`
-* - `sandbox`\\
+* - `sandbox` \
     &nbsp;&nbsp;`.environment`
   - string
   - Y
   -
   -
   - Sandbox type: `local`, `docker`, or `podman` (default: `local`)
-* - `sandbox`\\
+* - `sandbox` \
     &nbsp;&nbsp;`.parameters`
   - object
   - Y
   -
   -
   - Pass-through parameters for sandbox plugin configuration
-* - `sandbox`\\
+* - `sandbox` \
     &nbsp;&nbsp;`.image_prefix`
   - string
   - Y
@@ -73,32 +73,32 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `variants`
   - `variants`
   - Named variant definitions (keys are names, values are config maps). Can also be a list of paths to external variant files.
-* - `variants`\
-    &nbsp;&nbsp;`.<name>`\
+* - `variants` \
+    &nbsp;&nbsp;`.<name>` \
     &nbsp;&nbsp;`.files`
   - list
   - Y
   -
   -
   - Paths or glob patterns to context files
-* - `variants`\
-    &nbsp;&nbsp;`.<name>`\
+* - `variants` \
+    &nbsp;&nbsp;`.<name>` \
     &nbsp;&nbsp;`.mcp_servers`
   - list
   - Y
   -
   -
   - MCP server configurations (list of objects with `name`, `command`, `args`, `env`, `transport`; or a `ref:` string to a Python package)
-* - `variants`\
-    &nbsp;&nbsp;`.<name>`\
+* - `variants` \
+    &nbsp;&nbsp;`.<name>` \
     &nbsp;&nbsp;`.skills`
   - list
   - Y
   -
   -
   - Paths or glob patterns to skill directories
-* - `variants`\
-    &nbsp;&nbsp;`.<name>`\
+* - `variants` \
+    &nbsp;&nbsp;`.<name>` \
     &nbsp;&nbsp;`.task_parameters`
   - object
   - Y
@@ -111,14 +111,14 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `taskFilters`
   - `task_filters`
   - Tag-based task selection filter
-* - `task_filters`\
+* - `task_filters` \
     &nbsp;&nbsp;`.include_tags`
   - list
   - Y
   - `TagFilter.includeTags`
   - `TagFilter.include_tags`
   - Only run tasks whose metadata tags include **all** of these
-* - `task_filters`\
+* - `task_filters` \
     &nbsp;&nbsp;`.exclude_tags`
   - list
   - Y
@@ -143,40 +143,40 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - `tasks`
   - `tasks`
   - Per-task configurations with inline overrides
-* - `tasks`\
-    &nbsp;&nbsp;`.<task_id>`\
+* - `tasks` \
+    &nbsp;&nbsp;`.<task_id>` \
     &nbsp;&nbsp;`.include-samples`
   - list
   - Y
   - `JobTask.includeSamples`
   - `JobTask.include_samples`
   - Only run these sample IDs
-* - `tasks`\
-    &nbsp;&nbsp;`.<task_id>`\
+* - `tasks` \
+    &nbsp;&nbsp;`.<task_id>` \
     &nbsp;&nbsp;`.exclude-samples`
   - list
   - Y
   - `JobTask.excludeSamples`
   - `JobTask.exclude_samples`
   - Exclude these sample IDs
-* - `tasks`\
-    &nbsp;&nbsp;`.<task_id>`\
+* - `tasks` \
+    &nbsp;&nbsp;`.<task_id>` \
     &nbsp;&nbsp;`.args`
   - object
   - Y
   - `JobTask.args`
   - `JobTask.args`
   - Per-task argument overrides passed to the task function
-* - `tasks`\\
-    &nbsp;&nbsp;`.<task_id>`\\
+* - `tasks` \
+    &nbsp;&nbsp;`.<task_id>` \
     &nbsp;&nbsp;`.include-variants`
   - list
   - Y
   - `JobTask.includeVariants`
   - `JobTask.include_variants`
   - Only run these variant names for this task
-* - `tasks`\\
-    &nbsp;&nbsp;`.<task_id>`\\
+* - `tasks` \
+    &nbsp;&nbsp;`.<task_id>` \
     &nbsp;&nbsp;`.exclude-variants`
   - list
   - Y
@@ -237,14 +237,14 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
   -
   -
   - Samples config with `inline` and/or `paths` keys (optional — task can have no samples)
-* - `samples`\\
+* - `samples` \
     &nbsp;&nbsp;`.inline`
   - list
   - Y
   -
   -
   - Inline sample definitions (list of sample objects)
-* - `samples`\\
+* - `samples` \
     &nbsp;&nbsp;`.paths`
   - list
   - Y
@@ -292,112 +292,7 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
   - Y
   -
   -
-  - Inspect AI `Task` parameters. See sub-fields below.
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.model`
-  - string
-  - Y
-  - `model`
-  - `model`
-  - Default model for this task
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.config`
-  - object
-  - Y
-  - `config`
-  - `config`
-  - Model generation config (e.g. `{temperature: 0.2}`)
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.model_roles`
-  - object
-  - Y
-  - `modelRoles`
-  - `model_roles`
-  - Named roles for `get_model()`
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.sandbox`
-  - string/object
-  - Y
-  - `sandbox`
-  - `sandbox`
-  - Sandbox environment type or config
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.sandbox_parameters`
-  - object
-  - Y
-  - `sandboxParameters`
-  - `sandbox_parameters`
-  - Pass-through parameters for sandbox plugin configuration
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.approval`
-  - string/object
-  - Y
-  - `approval`
-  - `approval`
-  - Tool use approval policies
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.epochs`
-  - int/object
-  - Y
-  - `epochs`
-  - `epochs`
-  - Number of times to repeat each sample
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.fail_on_error`
-  - number/bool
-  - Y
-  - `failOnError`
-  - `fail_on_error`
-  - Fail threshold for sample errors
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.continue_on_fail`
-  - bool
-  - Y
-  - `continueOnFail`
-  - `continue_on_fail`
-  - Continue running if `fail_on_error` condition is met
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.message_limit`
-  - int
-  - Y
-  - `messageLimit`
-  - `message_limit`
-  - Max total messages per sample
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.token_limit`
-  - int
-  - Y
-  - `tokenLimit`
-  - `token_limit`
-  - Max total tokens per sample
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.time_limit`
-  - int
-  - Y
-  - `timeLimit`
-  - `time_limit`
-  - Max clock time (seconds) per sample
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.working_limit`
-  - int
-  - Y
-  - `workingLimit`
-  - `working_limit`
-  - Max working time (seconds) per sample
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.cost_limit`
-  - float
-  - Y
-  - `costLimit`
-  - `cost_limit`
-  - Max cost (dollars) per sample
-* - `inspect_task_args`\\
-    &nbsp;&nbsp;`.early_stopping`
-  - string/object
-  - Y
-  - `earlyStopping`
-  - `early_stopping`
-  - Early stopping callbacks
+  - Pass-through dict of any valid Inspect AI `Task()` kwargs (e.g. `model`, `time_limit`, `message_limit`, `epochs`, `sandbox`, etc.). See [Inspect AI docs](https://inspect.ai-safety-institute.org.uk/) for the full list.
 ```
 
 ## Sample
@@ -432,21 +327,21 @@ Samples are individual test cases defined either inline in `task.yaml` under `sa
   - `target`
   - `target`
   - Expected output or grading criteria
-* - `metadata`\\
+* - `metadata` \
     &nbsp;&nbsp;`.difficulty`
   - string
   - Y
   -
   -
   - `easy`, `medium`, or `hard`
-* - `metadata`\\
+* - `metadata` \
     &nbsp;&nbsp;`.tags`
   - list
   - Y
   -
   -
   - Categories for filtering
-* - `metadata`\\
+* - `metadata` \
     &nbsp;&nbsp;`.system_message`
   - string
   - Y

From fc234213661b8a2bdcc6f19561cbc8ac90e0cfaf Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Wed, 18 Mar 2026 13:12:24 -0700
Subject: [PATCH 14/21] docs: Update changelog to detail new job and task
 configuration options, breaking changes, and documentation improvements.

---
 .github/workflows/config_parity.yml |  2 +-
 CHANGELOG.md                        | 26 ++++++++++++++++++++------
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/.github/workflows/config_parity.yml b/.github/workflows/config_parity.yml
index a2338af..4ef71af 100644
--- a/.github/workflows/config_parity.yml
+++ b/.github/workflows/config_parity.yml
@@ -43,4 +43,4 @@ jobs:
         run: pip install -e packages/dataset_config_python
 
       - name: Verify config parity
-        run: dart run tool/config_parity/bin/config_partiy.dart
+        run: dart run tool/config_parity/bin/config_parity.dart
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 13b3b70..891c523 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,17 +8,20 @@
 
 - **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files.
 
-- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at three levels:
+- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at two levels:
   - `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags
   - `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags
-  - `variant_filters` on task YAML — restrict which variants apply to a task (supplements `allowed_variants`)
 
 - **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks.
 
-- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level. Previously only available as a job-level override via `JobTask`.
+- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level.
 
 - **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration.
 
+- **`Task.files` / `Task.setup`.** Task-level file and setup declarations. Task-level `files` stack with sample-level `files` (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`.
+
+- **Variant `task_parameters`.** Variants can now declare `task_parameters`, an arbitrary dict merged into the task config at runtime.
+
 - **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks.
 
 ### Breaking Changes
@@ -27,12 +30,23 @@
 
 - **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations.
 
-- **Workspace resolution uses native Inspect fields.** The `workspace` YAML key remains as parser-level sugar but resolves into Inspect AI's native `Sample.files` and `Sample.setup` fields. The `Sample.setup` command is no longer hardcoded to `cd /workspace && flutter pub get`; it is configurable or omitted for non-Flutter tasks.
+- **Removed `workspace` and `tests` from task and sample YAML.** Replaced by `files` (a `{destination: source}` map) and `setup` (a shell command string). These are Inspect AI's native `Sample` fields. The old `workspace:` / `tests:` keys and their path/git/template sub-formats are no longer supported.
+
+- **Consolidated sandbox config.** `Job.sandboxEnvironment`, `Job.sandboxParameters`, `Job.imagePrefix` collapsed into a single `Job.sandbox` map (keys: `environment`, `parameters`, `image_prefix`).
+
+- **Consolidated Inspect AI eval arguments.** Individual top-level Job fields (`retryAttempts`, `failOnError`, `logLevel`, `maxTasks`, etc.) collapsed into a single `Job.inspectEvalArguments` / `Job.inspect_eval_arguments` pass-through dict.
+
+- **`inspect_task_args` is now a pass-through dict.** Individual sub-fields (`model`, `epochs`, `time_limit`, etc.) are no longer typed on the `Task` model. The entire `inspect_task_args` section is passed through as-is to Inspect AI's `Task()` constructor.
+
+- **Removed `JobTask.systemMessage`.** System message is now set at the task level via `Task.systemMessage`.
+
+- **Variant field renames.** `context_files` → `files`, `skill_paths` → `skills`. Variant-level task restriction uses `include-variants` / `exclude-variants` on the job's `tasks.<id>` object instead of task-level `allowed_variants`.
 
 ### Documentation
 
-- Updated `docs/reference/yaml_config.md` with all new fields and updated descriptions.
-- Updated `docs/guides/config.md` (pending — after implementation).
+- Added `docs/reference/yaml_config.md` with complete field-by-field reference tables.
+- Updated `docs/reference/configuration_reference.md` with new examples and directory structure.
+- Updated `docs/guides/config.md`.
 
 ## 11 March, 2025
 

From be6c4c51aee2bc5662e64b840b130ae96ed03a73 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Wed, 18 Mar 2026 14:23:09 -0700
Subject: [PATCH 15/21] refactor: adjust config parsing for `mcp_servers`
 string shorthand, rename `context_files` to `files`, and remove
 `allowed_variants`.

---
 .../lib/src/resolvers/eval_set_resolver.dart                  | 4 ++--
 tool/config_parity/fixtures/multi_variant/jobs/dev.yaml       | 2 +-
 .../fixtures/multi_variant/tasks/code_gen/task.yaml           | 3 ---
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index af23470..9f0e817 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -561,8 +561,8 @@ class EvalSetResolver {
       if (srv is Map) {
         mcpServers.add(Map<String, dynamic>.from(srv));
       } else if (srv is String) {
-        // Legacy string format: treat as name
-        mcpServers.add(<String, dynamic>{'name': srv});
+        // String shorthand: treat as a ref (Python import path)
+        mcpServers.add(<String, dynamic>{'ref': srv});
       }
     }
 
diff --git a/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml b/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml
index e0e884b..5ec75d4 100644
--- a/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml
+++ b/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml
@@ -5,7 +5,7 @@ models:
 variants:
   baseline: {}
   context_only:
-    context_files: []
+    files: []
   full_mcp:
     mcp_servers:
       - my_server
diff --git a/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml b/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml
index d004f01..fb1872a 100644
--- a/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml
+++ b/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml
@@ -1,9 +1,6 @@
 id: code_gen
 func: flutter_code_gen
 time_limit: 600
-allowed_variants:
-  - baseline
-  - context_only
 samples:
   inline:
     - id: sample_1

From 506b9f4810fcd16103bc0444029cb1e7c4fe0c49 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Wed, 18 Mar 2026 14:30:16 -0700
Subject: [PATCH 16/21] address code review comment

---
 .../dash_evals/runner/tasks/task_helpers.py   | 39 +++++++++++--------
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
index 6f6d44d..5a32f8e 100644
--- a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
+++ b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
@@ -72,8 +72,7 @@ def _resolve_mcp_ref(ref: str) -> MCPServer:
     """
     if ":" not in ref:
         raise ValueError(
-            f"Invalid MCP server ref '{ref}'. "
-            "Expected format: 'module.path:variable_name'"
+            f"Invalid MCP server ref '{ref}'. Expected format: 'module.path:variable_name'"
         )
     module_path, attr_name = ref.rsplit(":", 1)
     try:
@@ -121,7 +120,9 @@ def create_mcp_servers(
 
         command = cfg.get("command")
         if not command:
-            raise ValueError(f"MCP server config missing 'command': {cfg}")
+            raise ValueError(
+                f"MCP server config missing 'command' for server '{cfg.get('name', 'unknown')}' : {cfg}"
+            )
 
         name = cfg.get("name", command)
         args = cfg.get("args", [])
@@ -133,21 +134,25 @@ def create_mcp_servers(
             transport = "sandbox" if sandbox_type != "local" else "stdio"
 
         if transport == "stdio":
-            servers.append(mcp_server_stdio(
-                name=name,
-                command=command,
-                args=args,
-                env=env,
-                cwd=cwd,
-            ))
+            servers.append(
+                mcp_server_stdio(
+                    name=name,
+                    command=command,
+                    args=args,
+                    env=env,
+                    cwd=cwd,
+                )
+            )
         elif transport == "sandbox":
-            servers.append(mcp_server_sandbox(
-                name=name,
-                command=command,
-                args=args,
-                env=env,
-                cwd=cwd,
-            ))
+            servers.append(
+                mcp_server_sandbox(
+                    name=name,
+                    command=command,
+                    args=args,
+                    env=env,
+                    cwd=cwd,
+                )
+            )
         else:
             raise ValueError(f"Unknown MCP transport '{transport}' for server '{name}'")
 

From f9c4273fca2070d5e0e1f21e2989699d74c67be4 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Thu, 19 Mar 2026 08:56:21 -0700
Subject: [PATCH 17/21] docs: Overhaul and reorganize documentation guides,
 replacing quick start and tutorial content with new getting started,
 evaluation, and configuration guides.

---
 docs/guides/about_the_framework.md            | 238 +++++++++++++
 docs/guides/config.md                         |  98 -----
 docs/guides/configuring_jobs.md               | 334 ++++++++++++++++++
 docs/guides/get_started.md                    | 228 ++++++++++++
 docs/guides/index.md                          |  10 +-
 docs/guides/quick_start.md                    | 136 -------
 docs/guides/tutorial.md                       | 287 ---------------
 docs/guides/using_the_cli.md                  | 216 +++++++++++
 docs/guides/write_your_first_eval.md          | 323 +++++++++++++++++
 .../dataset_config_dart.md                    |  83 ++---
 10 files changed, 1383 insertions(+), 570 deletions(-)
 create mode 100644 docs/guides/about_the_framework.md
 delete mode 100644 docs/guides/config.md
 create mode 100644 docs/guides/configuring_jobs.md
 create mode 100644 docs/guides/get_started.md
 delete mode 100644 docs/guides/quick_start.md
 delete mode 100644 docs/guides/tutorial.md
 create mode 100644 docs/guides/using_the_cli.md
 create mode 100644 docs/guides/write_your_first_eval.md

diff --git a/docs/guides/about_the_framework.md b/docs/guides/about_the_framework.md
new file mode 100644
index 0000000..7eed13e
--- /dev/null
+++ b/docs/guides/about_the_framework.md
@@ -0,0 +1,238 @@
+# About the framework
+
+You've been using built-in task functions like `bug_fix` and `question_answer`.
+This page explains how they work — useful if you want to write custom eval logic
+or just understand what happens when you run `devals run`.
+
+---
+
+## Architecture overview
+
+When you run an eval, data flows through three layers:
+
+```
+YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI
+```
+
+| Layer | Package | What it does |
+|-------|---------|-------------|
+| **YAML config** | — | Your `task.yaml` and `job.yaml` files |
+| **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest |
+| **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` |
+| **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers |
+
+The `devals` CLI (Dart) orchestrates steps 1–2, then hands off to `run-evals`
+(Python) for steps 3–4.
+
+---
+
+## The `dash_evals` package
+
+### Entry point
+
+The Python CLI entry point is `run-evals`, defined in
+`dash_evals/main.py`. It supports two modes:
+
+```bash
+# Mode 1: From a JSON manifest (what devals uses)
+run-evals --json ./eval_set.json
+
+# Mode 2: Direct CLI arguments (what you used in Part 1)
+run-evals --task question_answer --model google/gemini-2.0-flash --dataset samples.json
+```
+
+### JSON runner
+
+When using `--json` mode, `json_runner.py` does the heavy lifting:
+
+1. Reads the manifest file
+2. For each task definition, resolves the task function by name
+3. Builds an Inspect AI `MemoryDataset` from the inline samples
+4. Calls the task function with the dataset and config
+5. Collects all `Task` objects and calls `inspect_ai.eval_set()`
+
+### Task resolution
+
+The `func` field in your `task.yaml` is resolved to a Python function. Three
+formats are supported:
+
+| Format | Example | How it resolves |
+|--------|---------|----------------|
+| **Short name** | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` |
+| **Colon syntax** | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` |
+| **Dotted path** | `my_package.tasks.my_task.my_task` | Last segment is the function name |
+
+Short names work for all built-in tasks. Use colon syntax or dotted paths for
+custom tasks in external packages.
+
+---
+
+## Anatomy of a task function
+
+Every task function follows the same pattern. Here's `question_answer` —
+the simplest built-in task:
+
+```python
+from inspect_ai import Task, task
+from inspect_ai.dataset import Dataset
+from inspect_ai.scorer import model_graded_fact
+from inspect_ai.solver import chain_of_thought
+
+@task
+def question_answer(dataset: Dataset, config: dict) -> Task:
+    system_msg = config.get("system_message") or DEFAULT_QA_SYSTEM_MESSAGE
+
+    solver_chain = [
+        add_system_message(system_msg),      # 1. Set the system prompt
+        # context_injector(...)              # 2. Inject context files (if variant has them)
+        chain_of_thought(),                   # 3. Ask for step-by-step reasoning
+        # generate() or react(tools=...)     # 4. Get the model's response
+    ]
+
+    return Task(
+        name=config["task_name"],
+        dataset=dataset,
+        solver=solver_chain,
+        scorer=model_graded_fact(),
+        time_limit=300,
+    )
+```
+
+**Key ingredients:**
+
+| Part | Purpose |
+|------|---------|
+| `@task` | Decorator that registers this function with Inspect AI |
+| `dataset` | An Inspect `Dataset` built from your samples |
+| `config` | A dict with everything from the JSON manifest — variant, system_message, sandbox_type, etc. |
+| **Solver chain** | A list of steps that process the prompt and generate a response |
+| **Scorer** | Evaluates the model's output against the `target` |
+
+### Solver chain patterns
+
+Most tasks build their solver chain from shared helpers in `task_helpers.py`:
+
+```python
+def _build_solver(config, system_msg):
+    chain = [add_system_message(system_msg)]
+
+    # Inject context files from the variant
+    append_context_injection(chain, config)
+
+    # Add chain-of-thought reasoning
+    chain.append(chain_of_thought())
+
+    # If the variant has MCP servers → use react() agent
+    # Otherwise → use plain generate()
+    append_model_interaction(chain, config)
+
+    return chain
+```
+
+This means that variants automatically affect the solver chain — if a variant
+defines `mcp_servers`, the task switches from a simple generate call to a
+full ReAct agent loop with tool access.
+
+### Agentic vs. non-agentic tasks
+
+| Pattern | Tasks that use it | What happens |
+|---------|-------------------|-------------|
+| **Non-agentic** | `question_answer`, `code_gen` | System message → chain of thought → single generate |
+| **Agentic** | `bug_fix`, `analyze_codebase`, `mcp_tool` | System message → ReAct loop with tools (bash, text editor, MCP) |
+
+Agentic tasks give the model tools (`bash_session()`, `text_editor()`, MCP servers)
+and run in a `react()` loop where the model can take multiple actions before
+calling `submit()`.
+
+---
+
+## Shared helpers
+
+The `task_helpers.py` module contains functions used across all tasks:
+
+| Helper | What it does |
+|--------|-------------|
+| `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` |
+| `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) |
+| `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured |
+| `build_task_metadata(config)` | Builds the metadata dict for the `Task` object |
+| `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config |
+| `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local |
+
+These helpers mean that most of the variant logic (context injection, MCP tools,
+skills) is handled **automatically**. You just need to define the core solver
+pattern for your task.
+
+---
+
+## Writing your own task
+
+1. **Create a file** at `packages/dash_evals/src/dash_evals/runner/tasks/your_task.py`
+
+2. **Write the task function:**
+
+   ```python
+   from inspect_ai import Task, task
+   from inspect_ai.dataset import Dataset
+   from inspect_ai.scorer import model_graded_fact
+
+   from .task_helpers import (
+       append_context_injection,
+       append_model_interaction,
+       build_task_metadata,
+   )
+   from ..solvers import add_system_message
+
+   @task
+   def your_task(dataset: Dataset, config: dict) -> Task:
+       chain = [add_system_message("You are a helpful assistant.")]
+       append_context_injection(chain, config)
+       append_model_interaction(chain, config)
+
+       return Task(
+           name=config["task_name"],
+           dataset=dataset,
+           solver=chain,
+           scorer=model_graded_fact(),
+           metadata=build_task_metadata(config),
+       )
+   ```
+
+3. **Export it** from `runner/tasks/__init__.py`:
+
+   ```python
+   from .your_task import your_task
+   ```
+
+4. **Reference it** in `task.yaml`:
+
+   ```yaml
+   func: your_task
+   ```
+
+   That's it — the JSON runner resolves the short name automatically.
+
+---
+
+## Built-in tasks
+
+| Task function | Type | What it evaluates |
+|--------------|------|-------------------|
+| `question_answer` | Non-agentic | Q&A knowledge and reasoning |
+| `code_gen` | Non-agentic | Code generation with structured output |
+| `flutter_code_gen` | Non-agentic | Flutter-specific code gen (wraps `code_gen`) |
+| `bug_fix` | Agentic | Diagnosing and fixing bugs with bash + editor |
+| `flutter_bug_fix` | Agentic | Flutter-specific bug fix (wraps `bug_fix`) |
+| `analyze_codebase` | Agentic | Exploring and answering questions about code |
+| `mcp_tool` | Agentic | Testing MCP tool usage |
+| `skill_test` | Agentic | Testing skill file usage in sandboxes |
+
+---
+
+## Further reading
+
+- {doc}`/reference/yaml_config` — complete field-by-field YAML reference
+- {doc}`/reference/configuration_reference` — directory structure and examples
+- {doc}`/reference/cli` — full CLI command reference
+- [Inspect AI documentation](https://inspect.aisi.org.uk/) — the underlying
+  evaluation framework
diff --git a/docs/guides/config.md b/docs/guides/config.md
deleted file mode 100644
index 54140c3..0000000
--- a/docs/guides/config.md
+++ /dev/null
@@ -1,98 +0,0 @@
-# Configuring jobs
-
-
-
-Evals consists of two broad pieces: the framework (InspectAI-based Python framework)
-
-define **what** to evaluate (tasks and samples), **how** to run it (jobs), and **where** code executes (sandboxes). The CLI resolves these files into a single manifest and hands it to the Python runner — so most of the time you're just editing YAML.
-
-
-
-This page walks through the main concepts and how they connect.
-
-## **Dataset**
-
-The Dataset is the collection of Tasks and Samples that are run through the python tool. A
-Sample is, at a minimum, an input and target. These are essentially test cases.
-
-In evals, the definition of dataset is expanded to include all fixtures of running evals, and all of these definitions exist in the dataset directory of the github.
-
-| 🗒️ Note!  The following diagrams provide a mental model. (They also provide a literal representation of how it works, but…) A lot of this is hidden from you, the user or sample author, so don’t let it overwhelm! |
-| :---- |
-
-![A](/_static/images/evals-dataset.png)
-
-* **Samples** - individual eval case
-* **Models** we run against
-* **Variants** - Different configurations for the agent being evaluated, e.g. with Dart MCP, with or without skills, with and without rules files, and every combination of those things.
-* **Tasks** - A task is a Python function entrypoint for one “type” of evals. For example, “question_answer”, “code_gen”, “mcp_create_project” are a few of the tasks we support. Each task generally takes a list of specific samples that are configured to run for that task.
-* **Workspaces** (The codebase that the agent is tinkering with in an eval)
-* **Sandbox definitions** (host machine, podman, docker)
-* **Default runtime configurations**
-
-### **Tasks are the basic unit of defining eval runs.**
-
-![A](/_static/images/task.png)
-
-### **Job files are run configuration**
-
-![A](/_static/images/job.png)
-
-### **Then evals run based on that job file:**
-
-![A](/_static/images/eval-set.png)
-
-This means you care about job files and task files. Job files might look like this:
-
-- job/main.yaml (runs the whole thing)
-- job/ci.yaml (a job that is run as part of ci)
-- job/local_dev.yaml (a job that is .gitignored, used for quick iteration)
-
-## Tag-based filtering
-
-Jobs can filter which tasks and samples run using tags. Tasks and samples define tags in their `metadata`, and jobs reference them via `task_filters` and `sample_filters`:
-
-```yaml
-# job.yaml
-task_filters:
-  include_tags: [code_gen]      # only tasks tagged "code_gen"
-  exclude_tags: [deprecated]    # skip deprecated tasks
-sample_filters:
-  include_tags: [flutter]       # only samples tagged "flutter"
-```
-
-- **`include_tags`** — an item must have *all* listed tags to be included
-- **`exclude_tags`** — an item is excluded if it has *any* listed tag
-
-Tag filters work alongside ID-based filtering (`tasks.<id>.include-samples` / `exclude-samples`).
-
-## Task function references
-
-The `func` field in task YAML identifies the Python `@task` function to run. Three formats are supported:
-
-| Format | Example | Resolution |
-|---|---|---|
-| Short name | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` |
-| Colon syntax | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` |
-| Dotted path | `my_package.tasks.my_task.my_task` | Last segment is the function name |
-
-## Sandbox configuration
-
-The sandbox registry is **configurable** — the resolver accepts a registry mapping names to compose files. The default registry is empty; the `devals_cli` passes the Flutter-specific registry:
-
-```yaml
-# job.yaml
-sandbox:
-  environment: podman          # looks up "podman" in the registry
-  image_prefix: us-central1-docker.pkg.dev/my-project/repo/
-```
-
-A string shorthand is also supported — `sandbox: podman` is equivalent to `sandbox: {environment: podman}`.
-
-The `image_prefix` is prepended to image names during sandbox resolution (useful for private registries).
-
-## Workspace setup
-
-When `workspace` is specified on a sample and the sandbox is a container (`docker` or `podman`), the resolver maps it to `Sample.files['/workspace']`. The setup command (e.g. `cd /workspace && flutter pub get`) is **not** auto-generated — specify it explicitly in your sample or task YAML via the `setup` field.
-
-For the full field reference, see {doc}`/reference/yaml_config`.
diff --git a/docs/guides/configuring_jobs.md b/docs/guides/configuring_jobs.md
new file mode 100644
index 0000000..51ed48f
--- /dev/null
+++ b/docs/guides/configuring_jobs.md
@@ -0,0 +1,334 @@
+# Configure jobs
+
+In {doc}`Part 1 <get_started>` and {doc}`Part 2 <write_your_first_eval>` you
+wrote tasks and jobs by following a recipe. Now let's understand the full
+configuration model so you can build your own from scratch.
+
+This page walks through every piece of the YAML configuration — building
+each file up incrementally.
+
+---
+
+## The three config files
+
+Everything lives under your `evals/` directory:
+
+| File | Purpose |
+|------|---------|
+| `tasks/<id>/task.yaml` | Defines **what** to evaluate — the task function, samples, workspace, prompt |
+| `jobs/<name>.yaml` | Defines **how** to run it — models, variants, filters, sandbox, limits |
+| Context files (optional) | Markdown files injected into the model's prompt via variants |
+
+The `devals` CLI resolves these into a single JSON manifest and hands it to the
+Python runner. Most of the time, you're just editing YAML.
+
+---
+
+## Building a task.yaml
+
+Let's build a task file from scratch, adding one concept at a time.
+
+### Start minimal
+
+The only required field is the task function:
+
+```yaml
+func: question_answer
+```
+
+This is enough to define a task, but it has no samples — nothing to evaluate yet.
+
+### Add a sample
+
+Samples go under `samples.inline`. Each sample needs at minimum an `id`,
+`input` (the prompt), and `target` (grading criteria):
+
+```yaml
+func: question_answer
+
+samples:
+  inline:
+    - id: explain_null_safety
+      input: |
+        Explain Dart's sound null safety. How does it prevent
+        null reference errors at compile time?
+      target: |
+        Should explain nullable vs non-nullable types, the `?`
+        suffix, null-aware operators, and how the analyzer enforces
+        null checks at compile time.
+```
+
+### Add a system message
+
+A `system_message` customizes the prompt sent to the model before your sample input:
+
+```yaml
+func: question_answer
+system_message: |
+  You are an expert Dart developer. Answer questions with code
+  examples where appropriate. Be concise.
+
+samples:
+  inline:
+    - id: explain_null_safety
+      # ...
+```
+
+### Add files and setup
+
+For agentic tasks that run code in a sandbox, use `files` and `setup`:
+
+```yaml
+func: bug_fix
+
+# Copy a project into the sandbox — key = destination, value = source
+files:
+  /workspace: ../../workspaces/my_dart_package
+setup: "cd /workspace && dart pub get"
+
+samples:
+  inline:
+    - id: fix_the_bug
+      input: |
+        The tests are failing. Find and fix the bug.
+      target: |
+        All tests should pass after the fix.
+```
+
+`files` and `setup` at the task level are **inherited by all samples**. A sample
+can override them:
+
+```yaml
+samples:
+  inline:
+    - id: fix_the_bug
+      files:
+        /workspace: ./custom_project    # overrides task-level files
+      setup: "cd /workspace && pub get"  # overrides task-level setup
+      input: ...
+```
+
+> [!NOTE]
+> File paths in `files` values are resolved **relative to the task directory**.
+> Task-level `files` stack with sample-level `files` — on a key conflict, the
+> sample wins.
+
+### Add metadata for filtering
+
+Samples can carry `metadata` with `tags` and `difficulty`. Jobs use these for filtering:
+
+```yaml
+samples:
+  inline:
+    - id: fix_the_bug
+      metadata:
+        difficulty: medium
+        tags: [dart, bug-fix, async]
+      input: ...
+      target: ...
+```
+
+### Use external sample files
+
+For large datasets, you can keep samples in separate files and reference them
+with glob patterns:
+
+```yaml
+func: question_answer
+
+samples:
+  paths:
+    - samples/*.yaml    # loads every .yaml in the samples/ subdirectory
+```
+
+Each external file contains a list of sample objects in the same format as
+`samples.inline`.
+
+---
+
+## Building a job.yaml
+
+Jobs control **how** tasks run. Let's build one up.
+
+### Start with models and tasks
+
+The bare minimum — which models and which tasks:
+
+```yaml
+models:
+  - google/gemini-2.5-flash
+
+tasks:
+  inline:
+    explain_null_safety: {}   # run all samples with default settings
+```
+
+### Add variants
+
+Variants let you test the same task under different conditions. Each variant is a named
+map — an empty map `{}` means "no extras" (the baseline):
+
+```yaml
+models:
+  - google/gemini-2.5-flash
+
+variants:
+  baseline: {}
+  context_only:
+    files: [./context_files/dart_docs.md]
+  mcp_only:
+    mcp_servers:
+      - name: dart
+        command: dart
+        args: [mcp-server]
+  full:
+    files: [./context_files/dart_docs.md]
+    mcp_servers:
+      - name: dart
+        command: dart
+        args: [mcp-server]
+
+tasks:
+  inline:
+    explain_null_safety: {}
+```
+
+This produces 4 runs per sample (one per variant) × however many models you list.
+
+**Variant sub-fields:**
+
+| Field | What it does |
+|-------|-------------|
+| `files` | Context files injected into the prompt |
+| `mcp_servers` | MCP tool servers the model can call |
+| `skills` | Skill directories copied into the sandbox |
+| `task_parameters` | Extra parameters merged into the task config at runtime |
+
+### Filter tasks and samples
+
+Use `task_filters` and `sample_filters` to select subsets by tag:
+
+```yaml
+task_filters:
+  include_tags: [dart]          # only tasks tagged "dart"
+  exclude_tags: [deprecated]    # skip deprecated tasks
+
+sample_filters:
+  include_tags: [bug-fix]       # only samples tagged "bug-fix"
+```
+
+- **`include_tags`** — an item must have *all* listed tags to be included
+- **`exclude_tags`** — an item is excluded if it has *any* listed tag
+
+You can also filter per-task using `include-samples` and `exclude-samples`:
+
+```yaml
+tasks:
+  inline:
+    fix_math_utils:
+      include-samples: [fix_factorial]   # only run this sample
+      include-variants: [baseline]       # only run this variant
+```
+
+### Add sandbox configuration
+
+For tasks that need container execution:
+
+```yaml
+sandbox: podman    # or "docker"
+```
+
+You can also pass additional sandbox parameters:
+
+```yaml
+sandbox:
+  environment: podman
+  image_prefix: us-central1-docker.pkg.dev/my-project/repo/
+```
+
+### Add Inspect AI parameters
+
+The `inspect_eval_arguments` section passes settings through to Inspect AI's
+`eval_set()`:
+
+```yaml
+inspect_eval_arguments:
+  retry_attempts: 20
+  fail_on_error: 0.05
+  log_level: info
+
+  # Defaults applied to every task in this job
+  task_defaults:
+    time_limit: 600
+    message_limit: 50
+```
+
+---
+
+## Putting it all together
+
+Here's a complete job file using everything above:
+
+```{code-block} yaml
+---
+caption: evals/jobs/full_example.yaml
+---
+models:
+  - google/gemini-2.5-flash
+  - anthropic/claude-sonnet-4-20250514
+
+sandbox: podman
+max_connections: 15
+
+variants:
+  baseline: {}
+  context_only:
+    files: [./context_files/dart_docs.md]
+  with_mcp:
+    mcp_servers:
+      - name: dart
+        command: dart
+        args: [mcp-server]
+
+task_filters:
+  include_tags: [dart]
+
+tasks:
+  inline:
+    fix_math_utils:
+      exclude-variants: [with_mcp]   # MCP not relevant for this task
+    dart_question_answer: {}
+
+inspect_eval_arguments:
+  retry_attempts: 10
+  task_defaults:
+    time_limit: 300
+    message_limit: 30
+```
+
+This will run:
+
+- 2 models × 2 applicable variants × all matching samples in `fix_math_utils`
+- 2 models × 3 variants × all matching samples in `dart_question_answer`
+
+---
+
+## Summary
+
+| Concept | Where it lives | What it controls |
+|---------|---------------|-----------------|
+| **Task** | `tasks/<id>/task.yaml` | What to evaluate: function, prompt, workspace, samples |
+| **Job** | `jobs/<name>.yaml` | How to run: models, variants, filters, sandbox, limits |
+| **Variant** | Inside job YAML | Different configurations for the agent being evaluated |
+| **Sample** | Inside task YAML (or external files) | Individual test cases with input/target pairs |
+| **Context file** | Referenced by variants | Extra information injected into the model's prompt |
+
+For the complete field-by-field reference, see {doc}`/reference/yaml_config`.
+
+---
+
+## Next steps
+
+Now that you understand the configuration model, {doc}`Part 4 <using_the_cli>`
+shows how the `devals` CLI can **generate** most of this config for you — and
+what you need to customize in the output.
diff --git a/docs/guides/get_started.md b/docs/guides/get_started.md
new file mode 100644
index 0000000..20f9031
--- /dev/null
+++ b/docs/guides/get_started.md
@@ -0,0 +1,228 @@
+# Install and run evals
+
+By the end of this page you'll have installed everything,
+run an evaluation with Python and Inspect AI, and (optionally) seen how the
+`devals` CLI wraps all of that into a single workflow.
+
+## Prerequisites
+
+| Tool | Details | Notes |
+|------|---------|---------|
+| [Dart SDK*](https://dart.dev/get-dart) | Ver. 3.10+ | Runs the `devals` CLI |
+| [Python](https://www.python.org/) | Ver. 3.13+ | Runs the `dash_evals` evaluation runner |
+| API keys | `GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY` | Requires at least one model provider key. |
+
+\*Dart isn't required. It powers the CLI, which assists in authoring various YAML files
+and hides some complexity of the framework. However, the framework is entirely useable 
+without the CLI. 
+
+---
+
+## 1. Set up your project
+
+Create a directory for your evals work and set up a Python virtual environment:
+
+```bash
+mkdir my-evals && cd my-evals
+python3 -m venv .venv
+source .venv/bin/activate
+```
+
+### Install `dash_evals` (Python — required)
+
+Install the evaluation runner and its config library from git:
+
+```bash
+pip install "dash-evals @ git+https://github.com/flutter/evals.git#subdirectory=packages/dash_evals"
+pip install "dataset-config-python @ git+https://github.com/flutter/evals.git#subdirectory=packages/dataset_config_python"
+```
+
+This gives you **`dash_evals`** — the runtime that drives
+[Inspect AI](https://inspect.aisi.org.uk/) to run evaluations. Its CLI entry
+point is `run-evals`.
+
+### Install `devals` CLI (Dart — optional)
+
+If you have the Dart SDK installed, you can also install the CLI, which automates some of the configuration and eval authoring.
+
+```bash
+dart pub global activate devals --source git https://github.com/flutter/evals.git --git-path packages/devals_cli
+```
+
+**`devals`** resolves YAML configuration, scaffolds new tasks and jobs, and
+wraps `run-evals` and `inspect view` into a single workflow. It reduces the
+learning curve but is entirely optional — everything it does can be done with
+vanilla Python and Inspect AI commands.
+
+## 2. Configure an API key
+
+```bash
+export GEMINI_API_KEY=your_key_here
+```
+
+Set at least one provider key. You can also add it to a `.env` file in your
+project directory — `dash_evals` loads it automatically.
+
+---
+
+## 3. Create a minimal dataset
+
+The basic unit of evals is the [*sample*](). A sample is a single 'test case', which includes, at a minimum, an prompt (called 'input') and a target, which is used to grade the evaluation.
+
+Create a file called `my_first_sample.json` in your project directory:
+
+```{code-block} json
+---
+caption: my-evals/my_first_sample.json
+---
+[
+  {
+    "input": "Explain the difference between `Future.then()` and `async/await` in Dart. When should you prefer one over the other?",
+    "target": "The answer should explain that both are mechanisms for handling asynchronous code; async/await is syntactic sugar over Futures. It should note that async/await is generally preferred for readability, while .then() can be useful for simple one-off transformations."
+  }
+]
+```
+
+### 3.2 Run it
+
+```bash
+run-evals \
+  --task question_answer \
+  --model google/gemini-2.5-flash \
+  --dataset ./my_first_sample.json
+```
+
+This runs that one sample. Here's what just happened:
+
+1. `run-evals` loaded the `question_answer` [*task*]() function from `dash_evals`. A task
+  is a Python function that desribes the logic required to run a sample. Tasks are the generic,
+  reusable logic that know how to run your bespoke samples. Some other task examples are
+  [`generate_code`][] and [`bug_fix`][].
+2. Your dataset, or collection of samples, was passed to the task, which executes its *solver chain*, the 
+    instructions given to the agent being evaluated.
+3. Inspect AI drives the agent, collects the response, and scores it with a *scorer*. Scorers vary by task. 
+  In this case, the scorer is [`model_graded_fact`][], a scorer provided by Inspect that asks a second agent
+  to compare the generated response to our target response.
+4. Finally, A log file was written to `./logs/`
+
+### 3.3 View the results
+
+Inspect AI ships with a robust log viewer. Launch it:
+
+```bash
+inspect view
+```
+
+This opens a local web UI where you can browse the run, see the full
+conversation transcript, and check how the response was scored.
+
+> [!TIP]
+> `inspect view` finds logs in the current directory by default. Pass a path
+> to point it elsewhere: `inspect view ./path/to/logs`.
+
+---
+
+## 4. That's what `devals` wraps
+
+The commands above — `run-evals`, `inspect view` — are the raw building blocks.
+The **`devals` CLI** wraps all of them, helps manage your runtime environment, 
+and manages the YAML configuration layer we've put on top of Inspect AI, 
+which replaces the Samples JSON and *many* configuration
+options that are otherwise be passed in as CLI flags. All of the quality-of-life 
+improvements provided by the CLI are described in the [Using the CLI guide][].
+
+Importantly, you can still use the Yaml configuration layer without Dart and the CLI, 
+it's just less automated and requires you writing a bit of python glue code.
+
+Let's try the `devals` workflow now.
+
+As a reminder, the install script is:
+
+```bash
+dart pub global activate devals --source git https://github.com/flutter/evals.git --git-path packages/devals_cli
+```
+
+### 4.1 Check your environment
+
+```bash
+devals doctor
+```
+
+This verifies Dart, Python, `dash_evals`, API keys, and optional tools like
+Podman and Flutter. Fix any errors it reports; warnings are safe to ignore for now.
+
+### 4.2 Initialize a dataset
+
+Run `devals init` from your project directory (the `my-evals` directory you
+created in step 1):
+
+```bash
+devals init
+```
+
+This creates:
+
+```
+my-evals/
+├── devals.yaml                        # marker file
+└── evals/
+    ├── tasks/
+    │   └── get_started/
+    │       └── task.yaml              # starter task + sample
+    └── jobs/
+        └── local_dev.yaml             # ready-to-run job
+```
+
+The starter task uses the `analyze_codebase` task function — it asks the model
+to explore your project and suggest an improvement. It's a good smoke test that
+doesn't require a sandbox.
+
+### 4.3 Run the eval
+
+```bash
+devals run local_dev
+```
+
+Behind the scenes, this:
+
+1. Resolves your YAML config (job + tasks + samples) into a JSON manifest
+2. Passes the manifest to `run-evals` (the Python `dash_evals` runner)
+3. `dash_evals` calls Inspect AI's `eval_set()`, which sends prompts, scores results,
+   and writes logs
+
+To preview the resolved config without making API calls:
+
+```bash
+devals run local_dev --dry-run
+```
+
+### 4.4 View results
+
+```bash
+devals view
+```
+
+This is the same Inspect AI log viewer from before, but `devals` automatically
+finds your `logs/` directory based on `devals.yaml`.
+
+---
+
+## Recap
+
+You've now seen the two layers of the system:
+
+| Layer | What it does |
+|-------|-------------|
+| **`dash_evals` + Inspect AI** | The engine. Runs tasks, sends prompts, scores responses. |
+| **`devals` CLI** | The convenience layer. YAML config, scaffolding, log discovery. |
+
+Everything `devals` does eventually calls down to `dash_evals` and Inspect AI.
+Understanding this makes debugging much easier.
+
+---
+
+## Next steps
+
+You're set up and you've seen the framework in action. In
+{doc}`Part 2 <write_your_first_eval>`, you'll author a more complex, agentic
+evaluation from scratch.
diff --git a/docs/guides/index.md b/docs/guides/index.md
index 73e04a8..13e1f66 100644
--- a/docs/guides/index.md
+++ b/docs/guides/index.md
@@ -1,11 +1,13 @@
 # Guides
 
-Get started with evals — learn how to author and run your own evaluations.
+Learn how to install and use the evals framework.
 
 ```{toctree}
 :maxdepth: 1
 
-quick_start
-tutorial
-config
+get_started
+write_your_first_eval
+configuring_jobs
+using_the_cli
+about_the_framework
 ```
diff --git a/docs/guides/quick_start.md b/docs/guides/quick_start.md
deleted file mode 100644
index ed93d0e..0000000
--- a/docs/guides/quick_start.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# Get started
-
-A guide to using evals as a framework for the local development of your own evals.
-
-## Prerequisites
-
-| Tool | Version | Purpose |
-|------|---------|---------| 
-| [Dart SDK](https://dart.dev/get-dart) | 3.10+ | Runs the `devals` CLI |
-| [Python](https://www.python.org/) | 3.13+ | Runs the `dash_evals` runner |
-
-You'll also need an API key for at least one model provider (`GOOGLE_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY`).
-
-## 1. Install the packages
-
-```bash
-git clone https://github.com/flutter/evals.git && cd evals
-python3 -m venv .venv
-source .venv/bin/activate
-pip install -e "packages/dash_evals[dev]"
-pip install -e "packages/dataset_config_python[dev]"
-dart pub global activate devals --source path packages/devals_cli
-```
-
-This installs two things:
-
-- **`devals`** (Dart) — the CLI you'll use for every command. It resolves YAML configuration into a JSON manifest and delegates execution.
-- **`dash_evals`** (Python) — the runtime that receives the manifest and drives [Inspect AI](https://inspect.aisi.org.uk/)'s `eval_set()` to actually run evaluations.
-
-## 2. Check your environment
-
-```bash
-devals doctor
-```
-
-This runs a series of prerequisite checks — Dart SDK, Python version, whether `dash_evals` is installed, API keys, and optional tools like Podman and Flutter. Fix any errors it reports before continuing; warnings are safe to ignore for now.
-
-## 3. Set up Podman (optional)
-
-If your evals use containerized execution (`sandbox_type: podman` in a job YAML), you need Podman installed and a container image built. You can skip this step for basic evals that run locally.
-
-**Install Podman** (macOS):
-
-```bash
-brew install podman
-podman machine init
-podman machine start
-```
-
-**Build the Flutter sandbox image:**
-
-```bash
-cd <path-to-evals>/examples/evals-dataset/evals/sandboxes/podman
-podman build -t flutter-sandbox:latest .
-```
-
-This builds `localhost/flutter-sandbox:latest`, which includes Ubuntu 24.04 and the Flutter SDK. The build takes a few minutes.
-
-> **Tip:** To target a different Flutter channel, pass `--build-arg FLUTTER_CHANNEL=beta` (or `main`).
-
-## 4. Configure API keys
-
-Make sure you have at least one model provider API key set as an environment variable. You can set them in your shell profile or in a `.env` file in your project root.
-
-```bash
-export GEMINI_API_KEY=your_key_here
-```
-
-## 5. Initialize your dataset
-
-Run `devals init` from the root of the project you want to evaluate. This is typically a Dart or Flutter project — the scaffolded starter task will point back at your project as its workspace.
-
-```bash
-cd ~/my-flutter-app
-devals init
-```
-
-This creates two things:
-
-- **`devals.yaml`** in your project root — a marker file that tells the CLI where your eval dataset lives (defaults to `./evals`).
-- **`evals/`** directory with the following structure:
-
-```
-my-flutter-app/
-├── devals.yaml                          # ← marker file
-└── evals/
-    ├── tasks/
-    │   └── get_started/
-    │       └── task.yaml                # starter task + sample
-    └── jobs/
-        └── local_dev.yaml               # job ready to run
-```
-
-The starter task uses the `analyze_codebase` task function, which asks the model to
-explore your project and suggest an improvement. It's a good smoke-test that
-doesn't require a sandbox or any extra setup.
-
-
-## 6. Run your first eval
-
-```bash
-devals run local_dev
-```
-
-Behind the scenes, this:
-
-1. Resolves your YAML config (job + tasks + samples) into an EvalSet JSON manifest
-2. Passes the manifest to the Python `dash_evals` runner
-3. `dash_evals` calls Inspect AI's `eval_set()`, which sends prompts, collects responses, and scores results
-4. Logs are written to a `logs/` directory (a sibling of `evals/`)
-
-To preview the resolved configuration without actually making API calls:
-
-```bash
-devals run local_dev --dry-run
-```
-
-This prints every task × model × variant combination that would execute, so you can verify your setup before spending API credits.
-
-## 7. View results
-
-```bash
-devals view
-```
-
-This launches the [Inspect AI log viewer](https://inspect.aisi.org.uk/log-viewer.html) — a local web UI where you can browse runs, inspect individual samples, view scores, and read full conversation transcripts. It automatically finds your `logs/` directory based on `devals.yaml`.
-
----
-
-## Next steps
-
-- **Add more samples** — `devals create sample`
-- **Add tasks** — `devals create task`
-- **Create targeted jobs** — `devals create job`
-- **Interactive walkthrough** — `devals create pipeline` guides you through creating a sample, task, and job in one go
-- **[Follow the tutorial](tutorial.md)** — a hands-on walkthrough of authoring a code-generation task from scratch
diff --git a/docs/guides/tutorial.md b/docs/guides/tutorial.md
deleted file mode 100644
index fcf8b19..0000000
--- a/docs/guides/tutorial.md
+++ /dev/null
@@ -1,287 +0,0 @@
-# Author evals
-
-This tutorial picks up where [Get Started](quick_start.md) left off.
-By the end, you'll have:
-
-1. Authored a task file with two **code-generation** samples
-2. Created a job file that targets your new task
-3. Run the job and watched Inspect AI execute it
-4. Opened the Inspect log viewer to review results
-
-> [!NOTE]
-> This guide assumes you've already completed the [Get Started](quick_start.md) guide and
-> have a working `devals` installation with at least one model API key configured.
-
----
-
-## 1. Create the task
-
-A **task** tells the framework *what* to evaluate. Each task lives in its own subdirectory
-under `evals/tasks/` and contains a `task.yaml` file.
-
-### 1.1 Set up a workspace
-
-Code-generation tasks need a **workspace** — a starter project the model writes code into
-and where tests run. Create a minimal Dart package to use as a template:
-
-```
-evals/
-└── workspaces/
-    └── dart_package/
-        ├── pubspec.yaml
-        └── lib/
-            └── main.dart
-```
-
-```{code-block} yaml
----
-caption: evals/workspaces/dart_package/pubspec.yaml
----
-name: dart_package_template
-description: Minimal Dart package template
-version: 1.0.0
-publish_to: none
-
-environment:
-  sdk: '>=3.0.0 <4.0.0'
-
-dev_dependencies:
-  test: ^1.24.0
-```
-
-```{code-block} dart
----
-caption: evals/workspaces/dart_package/lib/main.dart
----
-// Starter file — the model will overwrite this.
-```
-
-> [!TIP]
-> You can also point `workspace` at your existing project root, a Flutter app,
-> or any directory that already has a `pubspec.yaml`.
-
-### 1.2 Write a test file
-
-Each sample can have its own test file that the scorer runs automatically. Create a
-test for the first sample:
-
-```
-evals/
-└── tasks/
-    └── dart_code_gen/
-        ├── task.yaml           ← (you'll create this next)
-        └── tests/
-            └── fizzbuzz_test.dart
-```
-
-```{code-block} dart
----
-caption: evals/tasks/dart_code_gen/tests/fizzbuzz_test.dart
----
-import 'package:test/test.dart';
-import 'package:dart_package_template/main.dart';
-
-void main() {
-  test('fizzBuzz returns correct values', () {
-    expect(fizzBuzz(3), 'Fizz');
-    expect(fizzBuzz(5), 'Buzz');
-    expect(fizzBuzz(15), 'FizzBuzz');
-    expect(fizzBuzz(7), '7');
-  });
-
-  test('fizzBuzz handles 1', () {
-    expect(fizzBuzz(1), '1');
-  });
-}
-```
-
-### 1.3 Write the task.yaml
-
-Now create the task definition with two inline samples:
-
-```{code-block} yaml
----
-caption: evals/tasks/dart_code_gen/task.yaml
----
-# ============================================================
-# Task: Dart Code Generation
-# ============================================================
-# Uses the built-in `code_gen` task function which:
-#   1. Sends the prompt to the model
-#   2. Parses the structured code response
-#   3. Writes the code into the sandbox workspace
-#   4. Runs tests and scores the result
-
-func: code_gen
-workspace: ../../workspaces/dart_package
-
-samples:
-  inline:
-    # ── Sample 1: FizzBuzz ──────────────────────────────────
-    - id: fizzbuzz
-      difficulty: easy
-      tags: [dart, functions]
-      input: |
-        Write a top-level function called `fizzBuzz` that takes an
-        integer `n` and returns a String:
-        - "Fizz" if n is divisible by 3
-        - "Buzz" if n is divisible by 5
-        - "FizzBuzz" if divisible by both
-        - The number as a string otherwise
-
-        Write the complete lib/main.dart file.
-      target: |
-        The code must define a top-level `String fizzBuzz(int n)` function
-        that returns the correct value for all cases.
-        It must pass the tests in test/.
-      tests:
-        path: ./tests/fizzbuzz_test.dart
-
-    # ── Sample 2: Stack implementation ──────────────────────
-    - id: stack_class
-      difficulty: medium
-      tags: [dart, data-structures, classes]
-      input: |
-        Implement a generic Stack<T> class in Dart with the
-        following methods:
-        - push(T item) — adds an item to the top
-        - T pop() — removes and returns the top item,
-          throws StateError if empty
-        - T peek() — returns the top item without removing it,
-          throws StateError if empty
-        - bool get isEmpty
-        - int get length
-
-        Write the complete lib/main.dart file.
-      target: |
-        The code must define a generic Stack<T> class with push,
-        pop, peek, isEmpty, and length. pop and peek must throw
-        StateError when the stack is empty.
-```
-
-**Key fields explained:**
-
-| Field | What it does |
-|-------|-------------|
-| `func` | The Python `@task` function that runs the evaluation. `code_gen` is a built-in generic code-generation task. |
-| `workspace` | Path to the starter project (relative to the task directory). |
-| `samples.inline` | A list of test cases, each with an `input` prompt and a `target` grading criteria. |
-| `tests.path` | Path to test files the scorer runs against the generated code. |
-
-> [!NOTE]
-See [Tasks](../reference/configuration_reference.md#task-files) and [Samples](../reference/configuration_reference.md#sample-files) for the
-> complete field reference.
-
----
-
-## 2. Create the job
-
-A **job** controls *how* to run your tasks — which models to use, how many
-connections, and which tasks/variants to include.
-
-Create `evals/jobs/tutorial.yaml`:
-
-```{code-block} yaml
----
-caption: evals/jobs/tutorial.yaml
----
-# ============================================================
-# Job: tutorial
-# ============================================================
-# A focused job for the tutorial walkthrough.
-
-# Which model(s) to evaluate
-models:
-  - google/gemini-2.5-flash
-
-# Only run the code-gen task we just created
-tasks:
-  inline:
-    dart_code_gen: {}
-```
-
-That's the minimal job — it will:
-
-- Evaluate `google/gemini-2.5-flash`
-- Run every sample in the `dart_code_gen` task
-- Use the default `baseline` variant (no extra tools or context)
-
-> [!TIP]
-> You can add **variants** to test the model with additional context or tools.
-> For example:
-> ```yaml
-> variants:
->   baseline: {}
->   with_context:
->     context_files: [./context_files/dart_docs.md]
-> ```
-> See [Configuration Overview](../reference/configuration_reference.md#variants) for details.
-
----
-
-## 3. Run the job
-
-Make sure you're in your project directory (the one containing `devals.yaml`), then run:
-
-```bash
-devals run tutorial
-```
-
-What happens behind the scenes:
-
-1. The Dart `dataset_config_dart` package resolves your YAML into an EvalSet JSON manifest
-2. The Python `dash_evals` reads the manifest and calls Inspect AI's `eval_set()`
-3. Inspect AI creates a sandbox, sets up the workspace, sends prompts, runs tests, and scores results
-4. Logs are written to the `logs/` directory
-
-### Dry run first
-
-To preview the resolved configuration without making any API calls:
-
-```bash
-devals run tutorial --dry-run
-```
-
-This prints a summary of every task × model × variant combination that would
-execute, so you can verify everything looks right before spending API credits.
-
-### What to expect
-
-When the eval runs, you'll see Inspect AI's interactive terminal display showing
-progress for each sample. A typical run with two samples against one model takes
-1–3 minutes, depending on the model's response time.
-
----
-
-## 4. View the results
-
-After the run completes, launch the Inspect AI log viewer:
-
-```bash
-devals view
-```
-
-This opens a local web UI (powered by Inspect AI) where you can:
-
-- **Browse runs** — see each task × model × variant combination
-- **Inspect samples** — view the model's generated code, scores, and any test output
-- **Compare variants** — if you defined multiple variants, compare how they performed side-by-side
-
-The viewer automatically points at your `logs/` directory. To view logs from a
-specific directory:
-
-```bash
-devals view path/to/logs
-```
-
----
-
-## Next steps
-
-Now that you've run your first custom evaluation, here are some things to try:
-
-- **Add more samples** to your task: `devals create sample`
-- **Try different task types** — `question_answer`, `bug_fix`, or `flutter_code_gen`. See [all available task functions](../contributing/packages/dash_evals.md).
-- **Add variants** to test how context files or MCP tools affect performance. See [Variants](config/about.md#variants).
-- **Run multiple models** by adding more entries to the `models` list in your job file
-- **Read the config reference** for [Jobs](../reference/configuration_reference.md#job-files), [Tasks](../reference/configuration_reference.md#task-files), and [Samples](../reference/configuration_reference.md#sample-files)
\ No newline at end of file
diff --git a/docs/guides/using_the_cli.md b/docs/guides/using_the_cli.md
new file mode 100644
index 0000000..b105d91
--- /dev/null
+++ b/docs/guides/using_the_cli.md
@@ -0,0 +1,216 @@
+# Use the CLI
+
+You've written tasks and jobs by hand. The `devals` CLI can generate most of
+that configuration for you — this page shows how, and what you'll want to
+customize afterward.
+
+---
+
+## Scaffolding commands
+
+### `devals init`
+
+Initializes a fresh project for evals:
+
+```bash
+cd ~/my-project
+devals init
+```
+
+**What it creates:**
+
+```
+my-project/
+├── devals.yaml                        # marker file
+└── evals/
+    ├── tasks/
+    │   └── get_started/
+    │       └── task.yaml              # starter task
+    └── jobs/
+        └── local_dev.yaml             # ready-to-run job
+```
+
+**What to customize:**
+
+- The starter task uses `func: analyze_codebase` — fine for a smoke test, but
+  you'll want to change `func` to match your eval type (`question_answer`,
+  `bug_fix`, `code_gen`, etc.)
+- The job defaults to `google/gemini-2.0-flash`. Update `models:` to the
+  provider(s) you want to test.
+- `files` points at `../../` (your project root). Update if your workspace
+  lives elsewhere.
+
+### `devals create pipeline`
+
+An interactive walkthrough that creates a sample, task, and job in one go.
+Great for first-timers:
+
+```bash
+devals create pipeline
+```
+
+It prompts you for:
+1. A sample ID and prompt
+2. Which task function to use
+3. A job name and model selection
+
+The result is a fully wired-up set of YAML files ready to `devals run`.
+
+### `devals create task`
+
+Creates a new task directory with a starter `task.yaml`:
+
+```bash
+devals create task
+```
+
+**Prompts for:**
+- Task ID (becomes the directory name under `tasks/`)
+- Task function (selected from the Python registry)
+- Optional system message
+
+**What to customize after:**
+- Add your `samples` — the generated file is a skeleton
+- Add `files` and `setup` if your task needs a workspace
+- Add `metadata` with tags for filtering
+
+### `devals create sample`
+
+Adds a new sample interactively:
+
+```bash
+devals create sample
+```
+
+**Prompts for:**
+- Sample ID (snake_case)
+- Difficulty level
+- Whether a workspace is needed
+
+**What to customize after:**
+- Write a specific `input` prompt — the generated placeholder is generic
+- Write grading criteria in `target`
+- Add `metadata.tags` for filtering
+
+### `devals create job`
+
+Creates a new job YAML file:
+
+```bash
+devals create job
+```
+
+**Prompts for:**
+- Job name
+- Which models, variants, and tasks to include
+
+**What to customize after:**
+- Add or refine `variants` — the generated file may only include `baseline: {}`
+- Add `task_filters` or `sample_filters` if you want to target a subset
+- Configure `inspect_eval_arguments` for retry, timeout, and limit settings
+
+---
+
+## Running evals
+
+### Basic run
+
+```bash
+devals run <job_name>
+```
+
+The CLI:
+1. Reads `devals.yaml` to find the `evals/` directory
+2. Resolves your YAML config into a JSON manifest
+3. Passes the manifest to `run-evals` (the Python `dash_evals` runner)
+4. `dash_evals` calls Inspect AI's `eval_set()`
+5. Logs are written to `logs/`
+
+### Dry run
+
+Preview the resolved configuration without making API calls:
+
+```bash
+devals run <job_name> --dry-run
+```
+
+This prints every task × model × variant combination that would execute.
+Use it to verify your setup before spending API credits.
+
+> [!TIP]
+> Always dry-run after editing YAML config. It catches typos, missing files,
+> and bad task references before they cost you money.
+
+---
+
+## Viewing results
+
+```bash
+devals view
+```
+
+Launches the [Inspect AI log viewer](https://inspect.aisi.org.uk/log-viewer.html)
+— a local web UI. `devals` automatically finds your `logs/` directory from
+`devals.yaml`.
+
+To view logs from a specific location:
+
+```bash
+devals view /path/to/logs
+```
+
+**What to look for in the viewer:**
+
+| Section | What it shows |
+|---------|--------------|
+| **Runs** | Each task × model × variant combination |
+| **Transcript** | The full conversation, including every tool call |
+| **Score** | Pass/fail, model-graded scores, test results |
+| **Metadata** | Timing, token usage, cost |
+
+---
+
+## Troubleshooting
+
+### `devals doctor`
+
+Checks all prerequisites:
+
+```bash
+devals doctor
+```
+
+It verifies:
+- **Dart SDK** — required for the CLI itself
+- **Python 3.13+** — required for `dash_evals`
+- **`dash_evals`** — the Python evaluation package
+- **Podman/Docker** — container runtime for sandboxed tasks
+- **Flutter SDK** — needed for Flutter-based eval tasks
+- **API Keys** — checks for configured provider keys
+
+Fix any errors before running evals. Warnings (like a missing Flutter SDK)
+are safe to ignore if your evals don't need that tool.
+
+---
+
+## Quick reference
+
+| Command | What it does |
+|---------|-------------|
+| `devals init` | Initialize a new dataset in the current directory |
+| `devals doctor` | Check prerequisites |
+| `devals create pipeline` | Interactive walkthrough: sample → task → job |
+| `devals create task` | Create a new task directory |
+| `devals create sample` | Create a new sample |
+| `devals create job` | Create a new job file |
+| `devals run <job>` | Run an evaluation |
+| `devals run <job> --dry-run` | Preview without executing |
+| `devals view [path]` | Launch the Inspect AI log viewer |
+
+---
+
+## Next steps
+
+You now know the full CLI workflow. {doc}`Part 5 <about_the_framework>` looks
+under the hood at the `dash_evals` Python package — useful if you ever want
+to write custom task logic.
\ No newline at end of file
diff --git a/docs/guides/write_your_first_eval.md b/docs/guides/write_your_first_eval.md
new file mode 100644
index 0000000..1185877
--- /dev/null
+++ b/docs/guides/write_your_first_eval.md
@@ -0,0 +1,323 @@
+# Author your first eval
+
+In {doc}`Part 1 <get_started>` you installed the tools and ran a pre-built eval.
+Now you'll write one from scratch — an **agentic** evaluation where the model
+explores a codebase, diagnoses a bug, and fixes it.
+
+By the end of this page you'll have:
+
+1. Created a workspace with a deliberate bug
+2. Written a task file that uses the `bug_fix` task function
+3. Run the eval and reviewed the model's fix
+4. Added a **variant** to see how extra context changes the result
+
+> [!NOTE]
+> This guide assumes you've completed {doc}`Part 1 <get_started>` and have
+> a working installation with at least one model API key configured.
+
+---
+
+## 1. Set up a workspace
+
+Agentic tasks need a **workspace** — a project that gets copied into a sandbox
+for the model to work with. Let's create a small Dart package with a deliberate bug.
+
+Inside your project (the directory with `devals.yaml`), create:
+
+```
+evals/
+└── workspaces/
+    └── buggy_dart_package/
+        ├── pubspec.yaml
+        ├── lib/
+        │   └── math_utils.dart
+        └── test/
+            └── math_utils_test.dart
+```
+
+```{code-block} yaml
+---
+caption: evals/workspaces/buggy_dart_package/pubspec.yaml
+---
+name: buggy_dart_package
+description: A Dart package with a deliberate bug for eval testing.
+version: 1.0.0
+publish_to: none
+
+environment:
+  sdk: '>=3.0.0 <4.0.0'
+
+dev_dependencies:
+  test: ^1.24.0
+```
+
+```{code-block} dart
+---
+caption: evals/workspaces/buggy_dart_package/lib/math_utils.dart
+---
+/// Returns the factorial of [n].
+///
+/// Throws [ArgumentError] if [n] is negative.
+int factorial(int n) {
+  if (n < 0) throw ArgumentError('n must be non-negative');
+  if (n <= 1) return 1;
+  // BUG: should be n * factorial(n - 1)
+  return n + factorial(n - 1);
+}
+
+/// Returns true if [n] is a prime number.
+bool isPrime(int n) {
+  if (n < 2) return false;
+  for (var i = 2; i * i <= n; i++) {
+    if (n % i == 0) return false;
+  }
+  return true;
+}
+```
+
+```{code-block} dart
+---
+caption: evals/workspaces/buggy_dart_package/test/math_utils_test.dart
+---
+import 'package:test/test.dart';
+import 'package:buggy_dart_package/math_utils.dart';
+
+void main() {
+  group('factorial', () {
+    test('factorial(0) = 1', () => expect(factorial(0), 1));
+    test('factorial(1) = 1', () => expect(factorial(1), 1));
+    test('factorial(5) = 120', () => expect(factorial(5), 120));
+    test('factorial(10) = 3628800', () => expect(factorial(10), 3628800));
+    test('negative throws', () {
+      expect(() => factorial(-1), throwsArgumentError);
+    });
+  });
+
+  group('isPrime', () {
+    test('2 is prime', () => expect(isPrime(2), true));
+    test('4 is not prime', () => expect(isPrime(4), false));
+    test('17 is prime', () => expect(isPrime(17), true));
+  });
+}
+```
+
+The bug is in `factorial` — it uses `+` instead of `*`. The tests will catch it.
+
+---
+
+## 2. Write the task
+
+Create a task directory with a `task.yaml`:
+
+```
+evals/
+└── tasks/
+    └── fix_math_utils/
+        └── task.yaml
+```
+
+```{code-block} yaml
+---
+caption: evals/tasks/fix_math_utils/task.yaml
+---
+# Task: Fix a buggy Dart package
+#
+# Uses the built-in `bug_fix` task function, which:
+#   1. Copies the workspace into a sandbox
+#   2. Gives the model bash and text-editor access
+#   3. Lets it explore, edit, and test until it calls submit()
+#   4. Scores based on test results and code quality
+
+func: bug_fix
+
+# Copy the workspace into /workspace in the sandbox
+files:
+  /workspace: ../../workspaces/buggy_dart_package
+setup: "cd /workspace && dart pub get"
+
+samples:
+  inline:
+    - id: fix_factorial
+      metadata:
+        difficulty: easy
+        tags: [dart, math, bug-fix]
+      input: |
+        The `factorial` function in `lib/math_utils.dart` is returning
+        wrong values. Tests are failing. Find and fix the bug.
+
+        Run the tests with `dart test` to verify your fix.
+      target: |
+        The fix should change the `+` operator to `*` in the factorial
+        function's recursive case. All tests should pass after the fix.
+```
+
+**What's new here compared to Part 1:**
+
+| Field | What it does |
+|-------|-------------|
+| `func: bug_fix` | An *agentic* task. The model gets `bash_session()` and `text_editor()` tools and runs in a `react()` loop — it can explore, edit, and test code autonomously. |
+| `files` | Copies a local directory into the sandbox filesystem. The key (`/workspace`) is the destination path inside the sandbox. |
+| `setup` | A shell command run *before* the model gets control. Use it to install dependencies. |
+
+> [!IMPORTANT]
+> The `bug_fix` task requires a container sandbox (Docker or Podman) because
+> `bash_session()` and `text_editor()` inject helper scripts that only work on
+> Linux. We'll configure this in the job file.
+
+---
+
+## 3. Create a job
+
+```{code-block} yaml
+---
+caption: evals/jobs/tutorial_bugfix.yaml
+---
+# Job: tutorial bug fix
+#
+# Runs our fix_math_utils task in a Podman sandbox.
+
+models:
+  - google/gemini-2.5-flash
+
+sandbox: podman
+
+tasks:
+  inline:
+    fix_math_utils: {}
+```
+
+If you don't have Podman set up yet:
+
+```bash
+brew install podman
+podman machine init
+podman machine start
+```
+
+> [!TIP]
+> If you'd rather use Docker, change `sandbox: podman` to `sandbox: docker`.
+> The task functions work identically with either runtime.
+
+---
+
+## 4. Run it
+
+Dry run first to check your config:
+
+```bash
+devals run tutorial_bugfix --dry-run
+```
+
+Then run for real:
+
+```bash
+devals run tutorial_bugfix
+```
+
+The `bug_fix` task uses a ReAct agent loop. You'll see the model:
+
+1. Explore the project structure (`ls`, `cat`)
+2. Read the failing test output (`dart test`)
+3. Edit `math_utils.dart` to fix the bug
+4. Re-run tests to verify the fix
+5. Call `submit()` with an explanation
+
+A typical run takes 1–3 minutes.
+
+---
+
+## 5. View results
+
+```bash
+devals view
+```
+
+In the Inspect log viewer, open the run and look at:
+
+- **Transcript** — the full conversation, including every tool call the model made
+- **Score** — whether the fix passed `dart analyze` and `dart test`
+- **Metadata** — timing, token usage, and tool call counts
+
+---
+
+## 6. Add a variant
+
+What if we gave the model some context about Dart best practices? Would it
+produce a better fix, or fix it faster? **Variants** let you test this.
+
+First, create a context file:
+
+```{code-block} markdown
+---
+caption: evals/context_files/dart_best_practices.md
+---
+---
+title: "Dart Best Practices"
+version: "1.0.0"
+description: "Common Dart patterns and debugging tips"
+---
+
+## Debugging Tips
+
+- Always run `dart test` after making changes to verify your fix.
+- Use `dart analyze` to catch static errors.
+- Read test expectations carefully — they tell you what the correct behavior should be.
+- Check operator precedence when arithmetic results look wrong.
+```
+
+Now update your job to define two variants:
+
+```{code-block} yaml
+---
+caption: evals/jobs/tutorial_bugfix.yaml (updated)
+---
+models:
+  - google/gemini-2.5-flash
+
+sandbox: podman
+
+# Test with and without context
+variants:
+  baseline: {}
+  with_context:
+    files: [./context_files/dart_best_practices.md]
+
+tasks:
+  inline:
+    fix_math_utils: {}
+```
+
+Run again:
+
+```bash
+devals run tutorial_bugfix
+```
+
+This time, the framework runs *two* evaluations:
+
+- `fix_math_utils` × `baseline` — no extra context
+- `fix_math_utils` × `with_context` — the context file is injected into the prompt
+
+In `devals view`, you can compare the two runs side by side. Did the context
+help? Did the model find the bug faster?
+
+---
+
+## Recap
+
+You've now written an agentic eval from scratch. Here's what you learned:
+
+| Concept | What it means |
+|---------|---------------|
+| **Workspace** | A project directory copied into the sandbox for the model to work with |
+| **`files` + `setup`** | How to get code into the sandbox and prepare it |
+| **`bug_fix` (agentic task)** | A task where the model gets tools and runs in a ReAct loop |
+| **Variants** | Different configurations for the *same* task — great for A/B testing |
+
+---
+
+## Next steps
+
+Now that you've written tasks and jobs by hand, {doc}`Part 3 <configuring_jobs>`
+dives deeper into the configuration model — every field in `task.yaml` and
+`job.yaml`, and how they all fit together.
diff --git a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
index 7d3afaf..fc7e1e9 100644
--- a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
+++ b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
@@ -768,22 +768,21 @@ Resolves parsed task configs and job into fully-resolved
 This is the resolution engine. It:
 1. Resolves models, sandboxes, and variants
 2. Expands task × variant combinations into [Task] entries
-3. Groups by flutter_channel (one [EvalSet] per group)
-4. Propagates job-level and task-level settings to the output
+3. Propagates job-level and task-level settings to the output
 
 ### Constructors
 
 #### `EvalSetResolver`
 
 ```dart
-EvalSetResolver({Map<String, Map<String, String>> sandboxRegistry, Map<String, String> sdkChannels})
+EvalSetResolver({Map<String, Map<String, String>> sandboxRegistry})
 ```
 
 Creates a resolver with optional sandbox configuration.
 
-If [sandboxRegistry] or [sdkChannels] are not provided, they default
-to empty maps (no sandbox resolution). Pass [kDefaultSandboxRegistry]
-and [kDefaultSdkChannels] for the Flutter-specific sandbox setup.
+If [sandboxRegistry] is not provided, it defaults to an empty map
+(no sandbox resolution). Pass [kDefaultSandboxRegistry] for the
+Flutter-specific sandbox setup.
 
 ### Properties
 
@@ -791,10 +790,6 @@ and [kDefaultSdkChannels] for the Flutter-specific sandbox setup.
 
   Named sandbox configurations (e.g. `'podman'` → compose file path).
 
-- **`sdkChannels`** → `Map<String, String>` *(final)*
-
-  SDK channel → sandbox registry key mapping.
-
 ### Methods
 
 #### `resolve`
@@ -805,8 +800,6 @@ List<EvalSet> resolve(List<ParsedTask> datasetTasks, Job job, String datasetRoot
 
 Resolve task configs and job into [EvalSet] objects.
 
-Groups by flutter_channel so each gets its own sandbox.
-
 **Parameters:**
 
 - `datasetTasks` (`List<ParsedTask>`) *(required)*
@@ -988,27 +981,25 @@ be specified there and will be passed through to the Python runner.
 Example YAML:
 ```yaml
 log_dir: ./logs/my_run
-sandbox: podman
+sandbox:
+  environment: podman
 max_connections: 10
 models:
   - google/gemini-2.5-flash
 variants:
   baseline: {}
   context_only:
-    context_files: [./context_files/flutter.md]
+    files: [./context_files/flutter.md]
 tasks:
   dart_qa:
     include-samples: [sample_1]
 
-# Pass-through to eval_set()
-eval_set_overrides:
+# All Inspect AI eval_set() parameters
+inspect_eval_arguments:
   retry_attempts: 20
   log_level: debug
-
-# Default Task-level overrides applied to every task
-task_defaults:
-  time_limit: 600
-  message_limit: 50
+  task_defaults:
+    time_limit: 600
 ```
 
 ### Constructors
@@ -1016,7 +1007,7 @@ task_defaults:
 #### `Job`
 
 ```dart
-Job({String? description, String? imagePrefix, required String logDir, String sandboxType, int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, int? retryAttempts, int? maxRetries, double? retryWait, double? retryConnections, bool? retryCleanup, double? failOnError, bool? continueOnFail, int? retryOnError, bool? debugErrors, int? maxSamples, int? maxTasks, int? maxSubprocesses, int? maxSandboxes, String? logLevel, String? logLevelTranscript, String? logFormat, List<String>? tags, Map<String, dynamic>? metadata, bool? trace, String? display, bool? score, Object? limit, Object? sampleId, Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, bool? sandboxCleanup, String? modelBaseUrl, Map<String, Object?>? modelArgs, Map<String, String>? modelRoles, Map<String, Object?>? taskArgs, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Map<String, Object?>? modelCostConfig, bool? logSamples, bool? logRealtime, bool? logImages, int? logBuffer, int? logShared, String? bundleDir, bool? bundleOverwrite, bool? logDirAllowDirty, String? evalSetId, Map<String, dynamic>? evalSetOverrides, Map<String, dynamic>? taskDefaults, TagFilter? taskFilters, TagFilter? sampleFilters})
+Job({String? description, required String logDir, int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, Map<String, dynamic>? sandbox, Map<String, dynamic>? inspectEvalArguments, TagFilter? taskFilters, TagFilter? sampleFilters})
 ```
 
 #### `Job.fromJson`
@@ -1033,15 +1024,14 @@ Job.fromJson(Map<String, dynamic> json)
 
 Per-task configuration within a job.
 
-Allows overriding which samples run for specific tasks and providing
-a custom system message.
+Allows overriding which samples and variants run for specific tasks.
 
 ### Constructors
 
 #### `JobTask`
 
 ```dart
-JobTask({required String id, List<String>? includeSamples, List<String>? excludeSamples, String? systemMessage, Map<String, dynamic>? args})
+JobTask({required String id, List<String>? includeSamples, List<String>? excludeSamples, List<String>? includeVariants, List<String>? excludeVariants, Map<String, dynamic>? args})
 ```
 
 #### `JobTask.fromJson`
@@ -1058,9 +1048,6 @@ JobTask.fromYaml(String taskId, Map<String, dynamic>? data)
 
 Create a [JobTask] from parsed YAML data.
 
-The [taskId] is the map key from the job YAML `tasks:` section.
-The [data] may be `null` for a simple task reference with no overrides.
-
 ---
 
 ## class `JsonParser`
@@ -1216,7 +1203,7 @@ former `TaskConfig` model-package class.
 #### `ParsedTask`
 
 ```dart
-ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, List<String>? allowedVariants, bool saveExamples, String? examplesDir, TagFilter? variantFilters, Map<String, dynamic>? sandboxParameters, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
+ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, bool saveExamples, String? examplesDir, Map<String, dynamic>? sandboxParameters, Map<String, String>? taskFiles, String? taskSetup, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
 ```
 
 ### Properties
@@ -1233,20 +1220,22 @@ ParsedTask({required String id, required String func, required List<Sample> samp
 
 - **`systemMessage`** → `String?` *(final)*
 
-- **`allowedVariants`** → `List<String>?` *(final)*
-
 - **`saveExamples`** → `bool` *(final)*
 
 - **`examplesDir`** → `String?` *(final)*
 
-- **`variantFilters`** → `TagFilter?` *(final)*
-
-  Tag filter for variant selection.
-
 - **`sandboxParameters`** → `Map<String, dynamic>?` *(final)*
 
   Pass-through dict for sandbox plugin configuration.
 
+- **`taskFiles`** → `Map<String, String>?` *(final)*
+
+  Task-level files to copy into sandbox.
+
+- **`taskSetup`** → `String?` *(final)*
+
+  Task-level setup script.
+
 - **`model`** → `String?` *(final)*
 
   Default model for this task.
@@ -1320,7 +1309,7 @@ ParsedTask({required String id, required String func, required List<Sample> samp
 #### `copyWith`
 
 ```dart
-ParsedTask copyWith({String? id, String? func, List<Sample>? samples, Variant? variant, String? sandboxType, String? systemMessage, List<String>? allowedVariants, bool? saveExamples, String? examplesDir, TagFilter? variantFilters, Map<String, dynamic>? sandboxParameters, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
+ParsedTask copyWith({String? id, String? func, List<Sample>? samples, Variant? variant, String? sandboxType, String? systemMessage, bool? saveExamples, String? examplesDir, Map<String, dynamic>? sandboxParameters, Map<String, String>? taskFiles, String? taskSetup, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
 ```
 
 Create a copy with overrides.
@@ -1333,11 +1322,11 @@ Create a copy with overrides.
 - `variant` (`Variant?`)
 - `sandboxType` (`String?`)
 - `systemMessage` (`String?`)
-- `allowedVariants` (`List<String>?`)
 - `saveExamples` (`bool?`)
 - `examplesDir` (`String?`)
-- `variantFilters` (`TagFilter?`)
 - `sandboxParameters` (`Map<String, dynamic>?`)
+- `taskFiles` (`Map<String, String>?`)
+- `taskSetup` (`String?`)
 - `model` (`String?`)
 - `config` (`Map<String, dynamic>?`)
 - `modelRoles` (`Map<String, String>?`)
@@ -1523,7 +1512,7 @@ constructor.
 #### `Task`
 
 ```dart
-Task({Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, String? func, String? systemMessage, Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata})
+Task({Dataset? dataset, Map<String, String>? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, String? func, String? systemMessage, Map<String, dynamic>? sandboxParameters, String? name, Object version, Map<String, dynamic>? metadata})
 ```
 
 #### `Task.fromJson`
@@ -1644,9 +1633,10 @@ Variants define different testing configurations to compare model
 performance with and without specific tooling or context.
 
 Features are implied by field presence — no explicit feature list needed:
-- [contextFiles] populated → context injection enabled
+- [files] populated → context injection enabled
 - [mcpServers] populated → MCP tools enabled
-- [skillPaths] populated → agent skills enabled
+- [skills] populated → agent skills enabled
+- [taskParameters] populated → extra parameters passed to the task
 - all empty → baseline variant
 
 Example YAML:
@@ -1654,10 +1644,13 @@ Example YAML:
 variants:
   baseline: {}
   context_only:
-    context_files: [./context_files/flutter.md]
+    files: [./context_files/flutter.md]
   full:
-    context_files: [./context_files/flutter.md]
-    mcp_servers: [dart]
+    files: [./context_files/flutter.md]
+    mcp_servers:
+      - name: dart
+        command: dart
+        args: [mcp-server]
     skills: [./skills/flutter_docs_ui]
 ```
 
@@ -1666,7 +1659,7 @@ variants:
 #### `Variant`
 
 ```dart
-Variant({String name, List<ContextFile> contextFiles, List<String> mcpServers, List<String> skillPaths, String? flutterChannel})
+Variant({String name, List<ContextFile> files, List<Map<String, dynamic>> mcpServers, List<String> skills, Map<String, dynamic> taskParameters})
 ```
 
 #### `Variant.fromJson`

From 868cce376fa1ec633a9e0a3e4f6cb92078cc971f Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Thu, 19 Mar 2026 14:01:50 -0700
Subject: [PATCH 18/21] feat: Add HTTP transport support for MCP servers,
 update configuration model, and related documentation.

---
 docs/guides/configuring_jobs.md               |  2 +-
 docs/guides/get_started.md                    |  2 -
 docs/reference/configuration_reference.md     | 28 ++++++++++++
 docs/reference/yaml_config.md                 |  2 +-
 .../dash_evals/runner/tasks/task_helpers.py   | 42 ++++++++++++++---
 .../models/mcp_server_config.py               | 45 +++++++++++++++----
 .../src/dataset_config_python/resolver.py     | 40 ++++++++---------
 7 files changed, 120 insertions(+), 41 deletions(-)

diff --git a/docs/guides/configuring_jobs.md b/docs/guides/configuring_jobs.md
index 51ed48f..5d6ed97 100644
--- a/docs/guides/configuring_jobs.md
+++ b/docs/guides/configuring_jobs.md
@@ -200,7 +200,7 @@ This produces 4 runs per sample (one per variant) × however many models you lis
 | Field | What it does |
 |-------|-------------|
 | `files` | Context files injected into the prompt |
-| `mcp_servers` | MCP tool servers the model can call |
+| `mcp_servers` | MCP tool servers the model can call (stdio, HTTP, or Python ref) |
 | `skills` | Skill directories copied into the sandbox |
 | `task_parameters` | Extra parameters merged into the task config at runtime |
 
diff --git a/docs/guides/get_started.md b/docs/guides/get_started.md
index 20f9031..5250a96 100644
--- a/docs/guides/get_started.md
+++ b/docs/guides/get_started.md
@@ -216,8 +216,6 @@ You've now seen the two layers of the system:
 | **`dash_evals` + Inspect AI** | The engine. Runs tasks, sends prompts, scores responses. |
 | **`devals` CLI** | The convenience layer. YAML config, scaffolding, log discovery. |
 
-Everything `devals` does eventually calls down to `dash_evals` and Inspect AI.
-Understanding this makes debugging much easier.
 
 ---
 
diff --git a/docs/reference/configuration_reference.md b/docs/reference/configuration_reference.md
index baeae69..e7f6420 100644
--- a/docs/reference/configuration_reference.md
+++ b/docs/reference/configuration_reference.md
@@ -268,6 +268,34 @@ tasks:
 
 Glob patterns (containing `*`, `?`, or `[`) are expanded automatically. For skills, only directories containing `SKILL.md` are included.
 
+### MCP Server Modes
+
+MCP servers in variants support three modes:
+
+```yaml
+variants:
+  # 1. Declarative stdio/sandbox — command-based
+  with_dart_mcp:
+    mcp_servers:
+      - name: dart
+        command: dart
+        args: [mcp-server]
+
+  # 2. Declarative HTTP — url-based
+  with_http_mcp:
+    mcp_servers:
+      - name: my-api
+        url: https://mcp.example.com/api
+        authorization: "bearer-token-here"    # optional OAuth Bearer token
+        headers:                               # optional extra headers
+          X-Custom-Header: value
+
+  # 3. Python ref — import a pre-built MCPServer
+  with_custom_mcp:
+    mcp_servers:
+      - ref: "my_package.mcp:staging_server"
+```
+
 > [!IMPORTANT]
 > The `skills` feature requires a sandbox (docker/podman). Skill directories are copied into the sandbox filesystem by Inspect AI's built-in `skill()` tool. Each skill directory must contain a `SKILL.md` file.
 
diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index 625f363..e7bb643 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -88,7 +88,7 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Y
   -
   -
-  - MCP server configurations (list of objects with `name`, `command`, `args`, `env`, `transport`; or a `ref:` string to a Python package)
+  - MCP server configurations. Each entry is one of: (1) an object with `command`/`args` for stdio/sandbox, (2) an object with `url` for HTTP, or (3) a `ref:` string pointing to a Python MCPServer object. Common sub-fields: `name`, `transport`. Stdio sub-fields: `command`, `args`, `env`, `cwd`. HTTP sub-fields: `url`, `authorization`, `headers`.
 * - `variants` \
     &nbsp;&nbsp;`.<name>` \
     &nbsp;&nbsp;`.skills`
diff --git a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
index 5a32f8e..bca2517 100644
--- a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
+++ b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py
@@ -16,7 +16,14 @@
 
 from inspect_ai.agent import react
 from inspect_ai.solver import Solver, generate
-from inspect_ai.tool import MCPServer, Tool, mcp_server_sandbox, mcp_server_stdio, skill
+from inspect_ai.tool import (
+    MCPServer,
+    Tool,
+    mcp_server_http,
+    mcp_server_sandbox,
+    mcp_server_stdio,
+    skill,
+)
 
 from dash_evals.runner.solvers import context_injector
 
@@ -97,13 +104,15 @@ def create_mcp_servers(
 ) -> list[MCPServer]:
     """Create MCP server objects from variant config.
 
-    Supports two modes per entry:
-    - **Declarative**: dict with ``name``, ``command``, ``args``, etc.
+    Supports three modes per entry:
+    - **Declarative stdio/sandbox**: dict with ``command``, ``args``, etc.
+    - **Declarative HTTP**: dict with ``url``, and optionally ``authorization``/``headers``.
     - **Python ref**: dict with ``ref`` key pointing to a pre-built MCPServer.
 
-    Transport is auto-selected based on sandbox_type when not explicit:
-    - ``"local"`` → ``mcp_server_stdio``
-    - anything else (docker, podman) → ``mcp_server_sandbox``
+    Transport is auto-selected when not explicit:
+    - If ``url`` is present → ``mcp_server_http``
+    - If sandbox is non-local → ``mcp_server_sandbox``
+    - Otherwise → ``mcp_server_stdio``
 
     Args:
         mcp_configs: List of MCP server config dicts from variant_config.
@@ -114,14 +123,33 @@ def create_mcp_servers(
     """
     servers: list[MCPServer] = []
     for cfg in mcp_configs:
+        # Ref mode — import a pre-built MCPServer from Python
         if cfg.get("ref"):
             servers.append(_resolve_mcp_ref(cfg["ref"]))
             continue
 
+        # HTTP mode — url-based server
+        url = cfg.get("url")
+        if url:
+            name = cfg.get("name", url)
+            authorization = cfg.get("authorization") or cfg.get("auth")
+            headers = cfg.get("headers")
+            servers.append(
+                mcp_server_http(
+                    url=url,
+                    name=name,
+                    authorization=authorization,
+                    headers=headers,
+                )
+            )
+            continue
+
+        # Stdio / sandbox mode — command-based server
         command = cfg.get("command")
         if not command:
             raise ValueError(
-                f"MCP server config missing 'command' for server '{cfg.get('name', 'unknown')}' : {cfg}"
+                f"MCP server config missing 'command' or 'url' for server "
+                f"'{cfg.get('name', 'unknown')}': {cfg}"
             )
 
         name = cfg.get("name", command)
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py b/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py
index 42414fb..598eb44 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py
@@ -10,20 +10,21 @@
 class McpServerConfig(BaseModel):
     """MCP server configuration.
 
-    Supports two modes:
-    1. **Declarative** — specify command, args, env, etc. directly.
-    2. **Python ref** — point to a pre-built MCPServer object via
+    Supports three modes:
+    1. **Declarative stdio/sandbox** — specify command, args, env, etc.
+    2. **Declarative HTTP** — specify url, and optionally headers/auth.
+    3. **Python ref** — point to a pre-built MCPServer object via
        ``ref: "my_package.module:variable_name"``.
 
     When ``ref`` is set, all other fields are ignored.
     """
 
-    # Declarative fields
+    # Declarative fields (stdio / sandbox)
     name: str | None = None
     """Human-readable server name (e.g. ``"dart"``)."""
 
     command: str | None = None
-    """Executable to run (e.g. ``"dart"``)."""
+    """Executable to run (e.g. ``"dart"``). Required for stdio/sandbox transport."""
 
     args: list[str] = Field(default_factory=list)
     """Command-line arguments (e.g. ``["mcp-server"]``)."""
@@ -34,8 +35,28 @@ class McpServerConfig(BaseModel):
     cwd: str | None = None
     """Working directory for the server process."""
 
+    # Declarative fields (HTTP)
+    url: str | None = None
+    """URL endpoint for HTTP transport (e.g. ``"https://mcp.example.com/api"``)."""
+
+    headers: dict[str, str] | None = None
+    """HTTP headers to send with requests (e.g. for authentication)."""
+
+    authorization: str | None = None
+    """OAuth Bearer token for HTTP authentication.
+
+    Maps to Inspect AI's ``authorization`` parameter on ``mcp_server_http``.
+    """
+
+    # Common
     transport: str | None = None
-    """Transport type: ``"stdio"``, ``"sandbox"``, or ``None`` (auto-select)."""
+    """Transport type: ``"stdio"``, ``"sandbox"``, ``"http"``, or ``None`` (auto).
+
+    Auto-selection logic:
+    - If ``url`` is set → ``"http"``
+    - If ``command`` is set and sandbox is non-local → ``"sandbox"``
+    - If ``command`` is set and sandbox is local → ``"stdio"``
+    """
 
     # Python import escape hatch
     ref: str | None = None
@@ -47,10 +68,16 @@ class McpServerConfig(BaseModel):
 
     @model_validator(mode="after")
     def _validate_mode(self) -> McpServerConfig:
-        if self.ref is None and self.command is None:
+        if self.ref is None and self.command is None and self.url is None:
+            raise ValueError(
+                "McpServerConfig requires one of: 'ref' (Python import), "
+                "'command' (stdio/sandbox), or 'url' (HTTP). "
+                "None was provided."
+            )
+        if self.command is not None and self.url is not None:
             raise ValueError(
-                "McpServerConfig requires either 'ref' (Python import) "
-                "or 'command' (declarative). Neither was provided."
+                "McpServerConfig cannot have both 'command' (stdio/sandbox) "
+                "and 'url' (HTTP). Use one or the other."
             )
         return self
 
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 62db329..18b1e91 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -20,16 +20,8 @@
 
 # Default models when a job doesn't specify its own.
 DEFAULT_MODELS: list[str] = [
-    "anthropic/claude-haiku-4-5",
-    "anthropic/claude-sonnet-4-5",
-    "anthropic/claude-opus-4-6",
     "google/gemini-2.5-flash",
-    "google/gemini-3-pro-preview",
     "google/gemini-3-flash-preview",
-    "openai/gpt-5-mini",
-    "openai/gpt-5-nano",
-    "openai/gpt-5",
-    "openai/gpt-5-pro",
 ]
 
 # Default sandbox configurations for Flutter evaluations.
@@ -270,17 +262,25 @@ def _build_eval_set(
                 approval=tc.approval or task_defaults.get("approval"),
                 epochs=tc.epochs or task_defaults.get("epochs"),
                 fail_on_error=tc.fail_on_error or task_defaults.get("fail_on_error"),
-                continue_on_fail=tc.continue_on_fail if tc.continue_on_fail is not None else task_defaults.get("continue_on_fail"),
+                continue_on_fail=tc.continue_on_fail
+                if tc.continue_on_fail is not None
+                else task_defaults.get("continue_on_fail"),
                 message_limit=tc.message_limit or task_defaults.get("message_limit"),
                 token_limit=tc.token_limit or task_defaults.get("token_limit"),
                 time_limit=resolved_time_limit,
                 working_limit=tc.working_limit or task_defaults.get("working_limit"),
-                cost_limit=tc.cost_limit if tc.cost_limit is not None else (
-                    float(task_defaults["cost_limit"]) if task_defaults.get("cost_limit") is not None else None
+                cost_limit=tc.cost_limit
+                if tc.cost_limit is not None
+                else (
+                    float(task_defaults["cost_limit"])
+                    if task_defaults.get("cost_limit") is not None
+                    else None
                 ),
                 early_stopping=tc.early_stopping or task_defaults.get("early_stopping"),
                 display_name=tc.display_name or task_defaults.get("display_name"),
-                version=tc.version if tc.version is not None else (task_defaults.get("version") or 0),
+                version=tc.version
+                if tc.version is not None
+                else (task_defaults.get("version") or 0),
             )
         )
 
@@ -434,13 +434,11 @@ def _expand_task_configs(
         job_task = job.tasks.get(task_id) if job.tasks else None
         if job_task and job_task.include_variants:
             effective_variants = {
-                k: v for k, v in effective_variants.items()
-                if k in job_task.include_variants
+                k: v for k, v in effective_variants.items() if k in job_task.include_variants
             }
         if job_task and job_task.exclude_variants:
             effective_variants = {
-                k: v for k, v in effective_variants.items()
-                if k not in job_task.exclude_variants
+                k: v for k, v in effective_variants.items() if k not in job_task.exclude_variants
             }
 
         # Apply sample filtering
@@ -454,7 +452,8 @@ def _expand_task_configs(
         # Apply sample tag filtering (job-level)
         if job.sample_filters is not None:
             samples = [
-                s for s in samples
+                s
+                for s in samples
                 if matches_tag_filter((s.metadata or {}).get("tags", []), job.sample_filters)
             ]
 
@@ -512,7 +511,8 @@ def _resolve_variant(
             matched = sorted(
                 f
                 for f in globmod.glob(full_pattern, recursive=True)
-                if os.path.isfile(f) and (f.endswith(".yaml") or f.endswith(".yml") or f.endswith(".md"))
+                if os.path.isfile(f)
+                and (f.endswith(".yaml") or f.endswith(".yml") or f.endswith(".md"))
             )
             if not matched:
                 raise FileNotFoundError(f"No context files matched pattern: {cf_path}")
@@ -529,9 +529,7 @@ def _resolve_variant(
         if _is_glob(skill_path_str):
             full_pattern = os.path.join(dataset_root, skill_path_str)
             matched_dirs = sorted(
-                d
-                for d in globmod.glob(full_pattern, recursive=True)
-                if os.path.isdir(d)
+                d for d in globmod.glob(full_pattern, recursive=True) if os.path.isdir(d)
             )
             valid_dirs = [d for d in matched_dirs if os.path.isfile(os.path.join(d, "SKILL.md"))]
             if not valid_dirs:

From 03926f5b0f93c12e83cf44a00bb86e4ab0527df9 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 20 Mar 2026 11:50:52 -0700
Subject: [PATCH 19/21] feat: Introduce flexible dataset configuration
 supporting inline, JSON, and CSV formats with new `dataset` key and
 `json_runner`.

---
 docs/guides/configuring_jobs.md               |  88 +++----
 docs/guides/write_your_first_eval.md          |  29 +--
 docs/reference/configuration_reference.md     |  72 +++---
 docs/reference/yaml_config.md                 |  44 +++-
 .../src/dash_evals/runner/json_runner.py      |  93 ++++++--
 packages/dash_evals/tests/test_json_runner.py | 217 ++++++++++++++++++
 .../lib/src/models/dataset.dart               |  11 +-
 .../lib/src/models/dataset.freezed.dart       |  70 ++++--
 .../lib/src/models/dataset.g.dart             |   6 +
 .../lib/src/models/job.dart                   |   4 +-
 .../lib/src/models/job.freezed.dart           |  40 ++--
 .../lib/src/models/job.g.dart                 |   2 +-
 .../lib/src/parsed_task.dart                  |  15 ++
 .../lib/src/parsers/json_parser.dart          | 151 +++++++-----
 .../lib/src/parsers/yaml_parser.dart          |  95 ++++++--
 .../lib/src/resolvers/eval_set_resolver.dart  |  34 +--
 .../test/eval_set_resolver_test.dart          |  17 +-
 .../test/json_parser_test.dart                |  51 ++--
 .../dataset_config_python/models/dataset.py   |  21 +-
 .../src/dataset_config_python/models/job.py   |   2 +-
 .../src/dataset_config_python/parser.py       |  81 ++++++-
 .../src/dataset_config_python/resolver.py     |  25 +-
 .../tests/test_config.py                      | 125 ++++++++--
 .../lib/src/commands/create_job_command.dart  |  20 +-
 .../src/commands/create_pipeline_command.dart |  12 +-
 .../init_templates/init_sample_template.dart  |  29 +--
 .../dataset/file_templates/task_template.dart |  17 +-
 27 files changed, 995 insertions(+), 376 deletions(-)
 create mode 100644 packages/dash_evals/tests/test_json_runner.py

diff --git a/docs/guides/configuring_jobs.md b/docs/guides/configuring_jobs.md
index 5d6ed97..c3def62 100644
--- a/docs/guides/configuring_jobs.md
+++ b/docs/guides/configuring_jobs.md
@@ -40,22 +40,23 @@ This is enough to define a task, but it has no samples — nothing to evaluate y
 
 ### Add a sample
 
-Samples go under `samples.inline`. Each sample needs at minimum an `id`,
+Samples go under `dataset.samples.inline`. Each sample needs at minimum an `id`,
 `input` (the prompt), and `target` (grading criteria):
 
 ```yaml
 func: question_answer
 
-samples:
-  inline:
-    - id: explain_null_safety
-      input: |
-        Explain Dart's sound null safety. How does it prevent
-        null reference errors at compile time?
-      target: |
-        Should explain nullable vs non-nullable types, the `?`
-        suffix, null-aware operators, and how the analyzer enforces
-        null checks at compile time.
+dataset:
+  samples:
+    inline:
+      - id: explain_null_safety
+        input: |
+          Explain Dart's sound null safety. How does it prevent
+          null reference errors at compile time?
+        target: |
+          Should explain nullable vs non-nullable types, the `?`
+          suffix, null-aware operators, and how the analyzer enforces
+          null checks at compile time.
 ```
 
 ### Add a system message
@@ -68,10 +69,11 @@ system_message: |
   You are an expert Dart developer. Answer questions with code
   examples where appropriate. Be concise.
 
-samples:
-  inline:
-    - id: explain_null_safety
-      # ...
+dataset:
+  samples:
+    inline:
+      - id: explain_null_safety
+        # ...
 ```
 
 ### Add files and setup
@@ -86,26 +88,28 @@ files:
   /workspace: ../../workspaces/my_dart_package
 setup: "cd /workspace && dart pub get"
 
-samples:
-  inline:
-    - id: fix_the_bug
-      input: |
-        The tests are failing. Find and fix the bug.
-      target: |
-        All tests should pass after the fix.
+dataset:
+  samples:
+    inline:
+      - id: fix_the_bug
+        input: |
+          The tests are failing. Find and fix the bug.
+        target: |
+          All tests should pass after the fix.
 ```
 
 `files` and `setup` at the task level are **inherited by all samples**. A sample
 can override them:
 
 ```yaml
-samples:
-  inline:
-    - id: fix_the_bug
-      files:
-        /workspace: ./custom_project    # overrides task-level files
-      setup: "cd /workspace && pub get"  # overrides task-level setup
-      input: ...
+dataset:
+  samples:
+    inline:
+      - id: fix_the_bug
+        files:
+          /workspace: ./custom_project    # overrides task-level files
+        setup: "cd /workspace && pub get"  # overrides task-level setup
+        input: ...
 ```
 
 > [!NOTE]
@@ -118,14 +122,15 @@ samples:
 Samples can carry `metadata` with `tags` and `difficulty`. Jobs use these for filtering:
 
 ```yaml
-samples:
-  inline:
-    - id: fix_the_bug
-      metadata:
-        difficulty: medium
-        tags: [dart, bug-fix, async]
-      input: ...
-      target: ...
+dataset:
+  samples:
+    inline:
+      - id: fix_the_bug
+        metadata:
+          difficulty: medium
+          tags: [dart, bug-fix, async]
+        input: ...
+        target: ...
 ```
 
 ### Use external sample files
@@ -136,13 +141,14 @@ with glob patterns:
 ```yaml
 func: question_answer
 
-samples:
-  paths:
-    - samples/*.yaml    # loads every .yaml in the samples/ subdirectory
+dataset:
+  samples:
+    paths:
+      - samples/*.yaml    # loads every .yaml in the samples/ subdirectory
 ```
 
 Each external file contains a list of sample objects in the same format as
-`samples.inline`.
+`dataset.samples.inline`.
 
 ---
 
diff --git a/docs/guides/write_your_first_eval.md b/docs/guides/write_your_first_eval.md
index 1185877..36e22fd 100644
--- a/docs/guides/write_your_first_eval.md
+++ b/docs/guides/write_your_first_eval.md
@@ -135,20 +135,21 @@ files:
   /workspace: ../../workspaces/buggy_dart_package
 setup: "cd /workspace && dart pub get"
 
-samples:
-  inline:
-    - id: fix_factorial
-      metadata:
-        difficulty: easy
-        tags: [dart, math, bug-fix]
-      input: |
-        The `factorial` function in `lib/math_utils.dart` is returning
-        wrong values. Tests are failing. Find and fix the bug.
-
-        Run the tests with `dart test` to verify your fix.
-      target: |
-        The fix should change the `+` operator to `*` in the factorial
-        function's recursive case. All tests should pass after the fix.
+dataset:
+  samples:
+    inline:
+      - id: fix_factorial
+        metadata:
+          difficulty: easy
+          tags: [dart, math, bug-fix]
+        input: |
+          The `factorial` function in `lib/math_utils.dart` is returning
+          wrong values. Tests are failing. Find and fix the bug.
+
+          Run the tests with `dart test` to verify your fix.
+        target: |
+          The fix should change the `+` operator to `*` in the factorial
+          function's recursive case. All tests should pass after the fix.
 ```
 
 **What's new here compared to Part 1:**
diff --git a/docs/reference/configuration_reference.md b/docs/reference/configuration_reference.md
index e7f6420..288caf5 100644
--- a/docs/reference/configuration_reference.md
+++ b/docs/reference/configuration_reference.md
@@ -55,27 +55,28 @@ files:
   /workspace: ./project
 setup: "cd /workspace && flutter pub get"
 
-samples:
-  inline:
-    - id: flutter_bloc_cart_mutation_001
-      input: |
-        Fix the bug where adding items to cart doesn't update the total.
-      target: |
-        The fix should modify the BLoC to emit a new state instead of mutating.
-      metadata:
-        difficulty: medium
-        tags: [bloc, state]
-
-    - id: navigation_crash
-      files:
-        /workspace: ./nav_project    # Override task-level files
-      input: |
-        Fix the crash when navigating back from the detail screen.
-      target: |
-        The fix should handle the disposed controller properly.
-      metadata:
-        difficulty: hard
-        tags: [navigation]
+dataset:
+  samples:
+    inline:
+      - id: flutter_bloc_cart_mutation_001
+        input: |
+          Fix the bug where adding items to cart doesn't update the total.
+        target: |
+          The fix should modify the BLoC to emit a new state instead of mutating.
+        metadata:
+          difficulty: medium
+          tags: [bloc, state]
+
+      - id: navigation_crash
+        files:
+          /workspace: ./nav_project    # Override task-level files
+        input: |
+          Fix the crash when navigating back from the detail screen.
+        target: |
+          The fix should handle the disposed controller properly.
+        metadata:
+          difficulty: hard
+          tags: [navigation]
 ```
 
 For the complete list of task fields (including Inspect AI `Task` parameters), see the [Task fields table](yaml_config.md#task).
@@ -105,19 +106,20 @@ A sample is a single test case containing an input prompt, expected output (grad
 
 ```yaml
 # Inline in task.yaml
-samples:
-  inline:
-    - id: dart_async_await_001
-      input: |
-        Explain the difference between Future.then() and async/await in Dart.
-      target: |
-        The answer should cover both approaches, explain that they are
-        functionally equivalent, and note when each is preferred.
-      metadata:
-        difficulty: medium
-        tags: [async, dart]
-        added: 2025-02-04
-        category: language_fundamentals
+dataset:
+  samples:
+    inline:
+      - id: dart_async_await_001
+        input: |
+          Explain the difference between Future.then() and async/await in Dart.
+        target: |
+          The answer should cover both approaches, explain that they are
+          functionally equivalent, and note when each is preferred.
+        metadata:
+          difficulty: medium
+          tags: [async, dart]
+          added: 2025-02-04
+          category: language_fundamentals
 ```
 
 For the complete list of sample fields, see the [Sample fields table](yaml_config.md#sample).
@@ -168,7 +170,7 @@ max_connections: 15
 # Save the agent's final workspace output to logs/<run>/examples/
 # save_examples: true
 
-# Filter what to run (optional - omit to run all)
+# Filter what to run (required)
 models:
   - google/gemini-2.5-flash
 
diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md
index e7bb643..4eeaf2c 100644
--- a/docs/reference/yaml_config.md
+++ b/docs/reference/yaml_config.md
@@ -63,10 +63,10 @@ Job files define runtime settings for an evaluation run, including sandbox confi
   - Maximum concurrent API connections (default: `10`)
 * - `models`
   - list
-  - Y
+  - N
   - `models`
   - `models`
-  - Filter to specific models — omit to use defaults
+  - List of model identifiers to evaluate (required — at least one model must be specified)
 * - `variants`
   - map
   - Y
@@ -231,26 +231,56 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
   - `description`
   - `description`
   - Human-readable description
-* - `samples`
+* - `dataset`
   - object
   - Y
   -
   -
-  - Samples config with `inline` and/or `paths` keys (optional — task can have no samples)
-* - `samples` \
+  - Dataset configuration. Must contain exactly one of `samples`, `json`, or `csv`.
+* - `dataset` \
+    &nbsp;&nbsp;`.samples`
+  - object
+  - Y
+  -
+  -
+  - Inline/file-based sample definitions (see `samples.inline` and `samples.paths` below)
+* - `dataset` \
+    &nbsp;&nbsp;`.samples` \
     &nbsp;&nbsp;`.inline`
   - list
   - Y
   -
   -
   - Inline sample definitions (list of sample objects)
-* - `samples` \
+* - `dataset` \
+    &nbsp;&nbsp;`.samples` \
     &nbsp;&nbsp;`.paths`
   - list
   - Y
   -
   -
   - Glob patterns for external sample YAML files (relative to task dir)
+* - `dataset` \
+    &nbsp;&nbsp;`.json`
+  - string
+  - Y
+  -
+  -
+  - Path or URL to a JSON/JSONL dataset file (maps to Inspect's `json_dataset()`)
+* - `dataset` \
+    &nbsp;&nbsp;`.csv`
+  - string
+  - Y
+  -
+  -
+  - Path to a CSV dataset file (maps to Inspect's `csv_dataset()`)
+* - `dataset` \
+    &nbsp;&nbsp;`.args`
+  - object
+  - Y
+  - `Dataset.args`
+  - `Dataset.args`
+  - Additional arguments passed through to the dataset constructor (e.g. `auto_id`, `shuffle`, `delimiter`)
 * - `system_message`
   - string
   - Y
@@ -297,7 +327,7 @@ Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are neste
 
 ## Sample
 
-Samples are individual test cases defined either inline in `task.yaml` under `samples.inline`, or in external YAML files referenced via `samples.paths`. Fields like `difficulty` and `tags` should be nested inside the sample's `metadata` dict.
+Samples are individual test cases defined either inline in `task.yaml` under `dataset.samples.inline`, or in external YAML files referenced via `dataset.samples.paths`. Fields like `difficulty` and `tags` should be nested inside the sample's `metadata` dict.
 
 ```{list-table}
 :header-rows: 1
diff --git a/packages/dash_evals/src/dash_evals/runner/json_runner.py b/packages/dash_evals/src/dash_evals/runner/json_runner.py
index 2f395b7..828d048 100644
--- a/packages/dash_evals/src/dash_evals/runner/json_runner.py
+++ b/packages/dash_evals/src/dash_evals/runner/json_runner.py
@@ -11,7 +11,7 @@
 from pathlib import Path
 
 import inspect_ai
-from inspect_ai.dataset import MemoryDataset, Sample
+from inspect_ai.dataset import MemoryDataset, Sample, csv_dataset, json_dataset
 
 from dash_evals.utils.logging import capture_output, setup_logging
 
@@ -94,32 +94,73 @@ def _resolve_task_func(name: str):
         return func
 
 
-def _build_dataset_from_inline(task_def: dict) -> MemoryDataset:
-    """Build an Inspect AI MemoryDataset from inline dataset in the task def.
+def _build_dataset(task_def: dict):
+    """Build an Inspect AI dataset from a task definition.
 
-    The task_def["dataset"]["samples"] contains a list of InspectSample dicts.
+    Dispatches on ``task_def["dataset"]["format"]``:
+
+    - ``"memory"`` (default): builds a ``MemoryDataset`` from inline samples.
+    - ``"json"``: delegates to ``inspect_ai.dataset.json_dataset(source, **args)``.
+    - ``"csv"``: delegates to ``inspect_ai.dataset.csv_dataset(source, **args)``.
+
+    Args:
+        task_def: A task entry from the EvalSet JSON manifest.
+
+    Returns:
+        An Inspect AI dataset object.
+
+    Raises:
+        ValueError: If the dataset format is unrecognized or required fields
+            (e.g. ``source`` for json/csv) are missing.
     """
     dataset_def = task_def.get("dataset")
+    task_name = task_def.get("name", "")
+
     if not dataset_def:
-        return MemoryDataset([], name=task_def.get("name", ""))
-
-    raw_samples = dataset_def.get("samples", [])
-    samples = []
-    for raw in raw_samples:
-        sample = Sample(
-            input=raw["input"],
-            target=raw.get("target", ""),
-            id=raw.get("id"),
-            metadata=raw.get("metadata"),
-            files=raw.get("files"),
-            setup=raw.get("setup"),
-            sandbox=raw.get("sandbox"),
+        return MemoryDataset([], name=task_name)
+
+    fmt = dataset_def.get("format", "memory")
+    extra_args: dict = dataset_def.get("args") or {}
+
+    if fmt == "json":
+        source = dataset_def.get("source")
+        if not source:
+            raise ValueError(
+                f"Task '{task_name}': dataset format 'json' requires a 'source' field."
+            )
+        return json_dataset(source, **extra_args)
+
+    if fmt == "csv":
+        source = dataset_def.get("source")
+        if not source:
+            raise ValueError(
+                f"Task '{task_name}': dataset format 'csv' requires a 'source' field."
+            )
+        return csv_dataset(source, **extra_args)
+
+    if fmt == "memory":
+        raw_samples = dataset_def.get("samples", [])
+        samples = []
+        for raw in raw_samples:
+            sample = Sample(
+                input=raw["input"],
+                target=raw.get("target", ""),
+                id=raw.get("id"),
+                metadata=raw.get("metadata"),
+                files=raw.get("files"),
+                setup=raw.get("setup"),
+                sandbox=raw.get("sandbox"),
+            )
+            samples.append(sample)
+
+        return MemoryDataset(
+            samples,
+            name=dataset_def.get("name", task_name),
         )
-        samples.append(sample)
 
-    return MemoryDataset(
-        samples,
-        name=dataset_def.get("name", task_def.get("name", "")),
+    raise ValueError(
+        f"Task '{task_name}': unknown dataset format '{fmt}'. "
+        f"Expected one of: 'memory', 'json', 'csv'."
     )
 
 
@@ -157,7 +198,7 @@ def _run_single_manifest(manifest: dict) -> bool:
     Path(log_dir).mkdir(parents=True, exist_ok=True)
     job_logger, log_file_path = setup_logging(Path(log_dir), name="dash_evals")
 
-    # Build Task objects from inline datasets
+    # Build Task objects from task definitions
     task_defs = manifest["tasks"]
     task_instances: list[inspect_ai.Task] = []
 
@@ -176,8 +217,12 @@ def _run_single_manifest(manifest: dict) -> bool:
             job_logger.warning(f"  ✗ {task_name}: {e}")
             continue
 
-        # Build inline dataset
-        dataset = _build_dataset_from_inline(task_def)
+        # Build dataset (dispatches on format: memory | json | csv)
+        try:
+            dataset = _build_dataset(task_def)
+        except ValueError as e:
+            job_logger.warning(f"  ✗ {task_name}: {e}")
+            continue
 
         # Inject task_name into the config for task functions that expect it.
         # The Dart CLI emits "name" but task functions use "task_name".
diff --git a/packages/dash_evals/tests/test_json_runner.py b/packages/dash_evals/tests/test_json_runner.py
new file mode 100644
index 0000000..067f30c
--- /dev/null
+++ b/packages/dash_evals/tests/test_json_runner.py
@@ -0,0 +1,217 @@
+"""Tests for json_runner._build_dataset() — dataset format dispatch."""
+
+from __future__ import annotations
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+from inspect_ai.dataset import MemoryDataset
+
+from dash_evals.runner.json_runner import _build_dataset
+
+
+class TestBuildDatasetMemoryFormat:
+    """Tests for inline MemoryDataset (format='memory')."""
+
+    def test_no_dataset_returns_empty_memory_dataset(self):
+        """Tasks without a dataset key produce an empty MemoryDataset."""
+        task_def = {"name": "my_task:baseline", "func": "question_answer"}
+        result = _build_dataset(task_def)
+        assert isinstance(result, MemoryDataset)
+        assert len(result) == 0
+
+    def test_empty_dataset_dict_returns_empty_memory_dataset(self):
+        """An empty dataset dict produces an empty MemoryDataset."""
+        task_def = {"name": "my_task:baseline", "dataset": {}}
+        result = _build_dataset(task_def)
+        assert isinstance(result, MemoryDataset)
+        assert len(result) == 0
+
+    def test_memory_format_explicit(self):
+        """Explicit format='memory' builds a MemoryDataset from inline samples."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {
+                "format": "memory",
+                "samples": [
+                    {"id": "s1", "input": "What is Dart?", "target": "A language"},
+                ],
+            },
+        }
+        result = _build_dataset(task_def)
+        assert isinstance(result, MemoryDataset)
+        assert len(result) == 1
+        assert result[0].input == "What is Dart?"
+        assert result[0].target == "A language"
+        assert result[0].id == "s1"
+
+    def test_memory_format_default_when_format_absent(self):
+        """Omitting 'format' defaults to memory format."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {
+                "samples": [
+                    {"id": "s1", "input": "q", "target": "a"},
+                ],
+            },
+        }
+        result = _build_dataset(task_def)
+        assert isinstance(result, MemoryDataset)
+        assert len(result) == 1
+
+    def test_memory_format_preserves_optional_sample_fields(self):
+        """Optional sample fields (metadata, files, setup, sandbox) are passed through."""
+        task_def = {
+            "name": "t:v",
+            "dataset": {
+                "samples": [
+                    {
+                        "id": "s1",
+                        "input": "q",
+                        "target": "a",
+                        "metadata": {"difficulty": "hard"},
+                        "files": {"/workspace": "./proj"},
+                        "setup": "dart pub get",
+                        "sandbox": "docker",
+                    }
+                ],
+            },
+        }
+        result = _build_dataset(task_def)
+        sample = result[0]
+        assert sample.metadata == {"difficulty": "hard"}
+        assert sample.files == {"/workspace": "./proj"}
+        assert sample.setup == "dart pub get"
+        # Inspect AI normalises string sandbox values to SandboxEnvironmentSpec
+        sandbox = sample.sandbox
+        sandbox_type = sandbox.type if hasattr(sandbox, "type") else sandbox
+        assert sandbox_type == "docker"
+
+    def test_memory_format_dataset_name(self):
+        """Dataset name falls back to task name when not set in dataset dict."""
+        task_def = {
+            "name": "dart_qa:baseline",
+            "dataset": {
+                "samples": [],
+            },
+        }
+        result = _build_dataset(task_def)
+        assert isinstance(result, MemoryDataset)
+        # Name is set (MemoryDataset stores it)
+        assert result.name == "dart_qa:baseline"
+
+    def test_memory_format_explicit_dataset_name_wins(self):
+        """Explicit dataset name takes precedence over task name."""
+        task_def = {
+            "name": "dart_qa:baseline",
+            "dataset": {
+                "name": "custom_name",
+                "samples": [],
+            },
+        }
+        result = _build_dataset(task_def)
+        assert result.name == "custom_name"
+
+
+class TestBuildDatasetJsonFormat:
+    """Tests for JSON file-backed dataset (format='json')."""
+
+    def test_json_format_calls_json_dataset(self):
+        """format='json' calls inspect_ai.dataset.json_dataset(source)."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {
+                "format": "json",
+                "source": "gs://bucket/data.jsonl",
+            },
+        }
+        mock_ds = MagicMock(name="json_dataset_result")
+        with patch("dash_evals.runner.json_runner.json_dataset", return_value=mock_ds) as mock_fn:
+            result = _build_dataset(task_def)
+
+        mock_fn.assert_called_once_with("gs://bucket/data.jsonl")
+        assert result is mock_ds
+
+    def test_json_format_passes_extra_args(self):
+        """Extra args from dataset.args are passed as kwargs to json_dataset()."""
+        task_def = {
+            "name": "t:v",
+            "dataset": {
+                "format": "json",
+                "source": "./data.jsonl",
+                "args": {"auto_id": True, "shuffle": True},
+            },
+        }
+        with patch("dash_evals.runner.json_runner.json_dataset") as mock_fn:
+            _build_dataset(task_def)
+
+        mock_fn.assert_called_once_with("./data.jsonl", auto_id=True, shuffle=True)
+
+    def test_json_format_missing_source_raises(self):
+        """format='json' without a source raises ValueError."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {"format": "json"},
+        }
+        with pytest.raises(ValueError, match="requires a 'source' field"):
+            _build_dataset(task_def)
+
+
+class TestBuildDatasetCsvFormat:
+    """Tests for CSV file-backed dataset (format='csv')."""
+
+    def test_csv_format_calls_csv_dataset(self):
+        """format='csv' calls inspect_ai.dataset.csv_dataset(source)."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {
+                "format": "csv",
+                "source": "./data.csv",
+            },
+        }
+        mock_ds = MagicMock(name="csv_dataset_result")
+        with patch("dash_evals.runner.json_runner.csv_dataset", return_value=mock_ds) as mock_fn:
+            result = _build_dataset(task_def)
+
+        mock_fn.assert_called_once_with("./data.csv")
+        assert result is mock_ds
+
+    def test_csv_format_passes_extra_args(self):
+        """Extra args from dataset.args are passed as kwargs to csv_dataset()."""
+        task_def = {
+            "name": "t:v",
+            "dataset": {
+                "format": "csv",
+                "source": "./data.csv",
+                "args": {"delimiter": "\t", "encoding": "utf-8"},
+            },
+        }
+        with patch("dash_evals.runner.json_runner.csv_dataset") as mock_fn:
+            _build_dataset(task_def)
+
+        mock_fn.assert_called_once_with("./data.csv", delimiter="\t", encoding="utf-8")
+
+    def test_csv_format_missing_source_raises(self):
+        """format='csv' without a source raises ValueError."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {"format": "csv"},
+        }
+        with pytest.raises(ValueError, match="requires a 'source' field"):
+            _build_dataset(task_def)
+
+
+class TestBuildDatasetUnknownFormat:
+    """Tests for unknown dataset formats."""
+
+    def test_unknown_format_raises(self):
+        """An unrecognized format string raises ValueError."""
+        task_def = {
+            "name": "my_task:baseline",
+            "dataset": {
+                "format": "parquet",
+                "source": "./data.parquet",
+            },
+        }
+        with pytest.raises(ValueError, match="unknown dataset format 'parquet'"):
+            _build_dataset(task_def)
diff --git a/packages/dataset_config_dart/lib/src/models/dataset.dart b/packages/dataset_config_dart/lib/src/models/dataset.dart
index 874080e..0bd9970 100644
--- a/packages/dataset_config_dart/lib/src/models/dataset.dart
+++ b/packages/dataset_config_dart/lib/src/models/dataset.dart
@@ -17,7 +17,7 @@ part 'dataset.g.dart';
 @freezed
 sealed class Dataset with _$Dataset {
   const factory Dataset({
-    /// The list of sample objects.
+    /// The list of sample objects (only used when format is 'memory').
     @Default([]) List<Sample> samples,
 
     /// Dataset name.
@@ -28,6 +28,15 @@ sealed class Dataset with _$Dataset {
 
     /// Whether the dataset was shuffled after reading.
     @Default(false) bool shuffled,
+
+    /// Dataset format: 'memory' (inline samples), 'json', or 'csv'.
+    @Default('memory') String format,
+
+    /// File path or URL for json/csv datasets.
+    String? source,
+
+    /// Extra kwargs passed to json_dataset() or csv_dataset().
+    Map<String, dynamic>? args,
   }) = _Dataset;
 
   factory Dataset.fromJson(Map<String, dynamic> json) =>
diff --git a/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart b/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart
index fdd77dc..8c0c2d2 100644
--- a/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart
@@ -15,11 +15,14 @@ T _$identity<T>(T value) => value;
 /// @nodoc
 mixin _$Dataset {
 
-/// The list of sample objects.
+/// The list of sample objects (only used when format is 'memory').
  List<Sample> get samples;/// Dataset name.
  String? get name;/// Dataset location (file path or remote URL).
  String? get location;/// Whether the dataset was shuffled after reading.
- bool get shuffled;
+ bool get shuffled;/// Dataset format: 'memory' (inline samples), 'json', or 'csv'.
+ String get format;/// File path or URL for json/csv datasets.
+ String? get source;/// Extra kwargs passed to json_dataset() or csv_dataset().
+ Map<String, dynamic>? get args;
 /// Create a copy of Dataset
 /// with the given fields replaced by the non-null parameter values.
 @JsonKey(includeFromJson: false, includeToJson: false)
@@ -32,16 +35,16 @@ $DatasetCopyWith<Dataset> get copyWith => _$DatasetCopyWithImpl<Dataset>(this as
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is Dataset&&const DeepCollectionEquality().equals(other.samples, samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is Dataset&&const DeepCollectionEquality().equals(other.samples, samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled)&&(identical(other.format, format) || other.format == format)&&(identical(other.source, source) || other.source == source)&&const DeepCollectionEquality().equals(other.args, args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(samples),name,location,shuffled);
+int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(samples),name,location,shuffled,format,source,const DeepCollectionEquality().hash(args));
 
 @override
 String toString() {
-  return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled)';
+  return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled, format: $format, source: $source, args: $args)';
 }
 
 
@@ -52,7 +55,7 @@ abstract mixin class $DatasetCopyWith<$Res>  {
   factory $DatasetCopyWith(Dataset value, $Res Function(Dataset) _then) = _$DatasetCopyWithImpl;
 @useResult
 $Res call({
- List<Sample> samples, String? name, String? location, bool shuffled
+ List<Sample> samples, String? name, String? location, bool shuffled, String format, String? source, Map<String, dynamic>? args
 });
 
 
@@ -69,13 +72,16 @@ class _$DatasetCopyWithImpl<$Res>
 
 /// Create a copy of Dataset
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,Object? format = null,Object? source = freezed,Object? args = freezed,}) {
   return _then(_self.copyWith(
 samples: null == samples ? _self.samples : samples // ignore: cast_nullable_to_non_nullable
 as List<Sample>,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
 as String?,location: freezed == location ? _self.location : location // ignore: cast_nullable_to_non_nullable
 as String?,shuffled: null == shuffled ? _self.shuffled : shuffled // ignore: cast_nullable_to_non_nullable
-as bool,
+as bool,format: null == format ? _self.format : format // ignore: cast_nullable_to_non_nullable
+as String,source: freezed == source ? _self.source : source // ignore: cast_nullable_to_non_nullable
+as String?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,
   ));
 }
 
@@ -157,10 +163,10 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( List<Sample> samples,  String? name,  String? location,  bool shuffled)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( List<Sample> samples,  String? name,  String? location,  bool shuffled,  String format,  String? source,  Map<String, dynamic>? args)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Dataset() when $default != null:
-return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _:
+return $default(_that.samples,_that.name,_that.location,_that.shuffled,_that.format,_that.source,_that.args);case _:
   return orElse();
 
 }
@@ -178,10 +184,10 @@ return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( List<Sample> samples,  String? name,  String? location,  bool shuffled)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( List<Sample> samples,  String? name,  String? location,  bool shuffled,  String format,  String? source,  Map<String, dynamic>? args)  $default,) {final _that = this;
 switch (_that) {
 case _Dataset():
-return $default(_that.samples,_that.name,_that.location,_that.shuffled);}
+return $default(_that.samples,_that.name,_that.location,_that.shuffled,_that.format,_that.source,_that.args);}
 }
 /// A variant of `when` that fallback to returning `null`
 ///
@@ -195,10 +201,10 @@ return $default(_that.samples,_that.name,_that.location,_that.shuffled);}
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( List<Sample> samples,  String? name,  String? location,  bool shuffled)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( List<Sample> samples,  String? name,  String? location,  bool shuffled,  String format,  String? source,  Map<String, dynamic>? args)?  $default,) {final _that = this;
 switch (_that) {
 case _Dataset() when $default != null:
-return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _:
+return $default(_that.samples,_that.name,_that.location,_that.shuffled,_that.format,_that.source,_that.args);case _:
   return null;
 
 }
@@ -210,12 +216,12 @@ return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _:
 @JsonSerializable()
 
 class _Dataset implements Dataset {
-  const _Dataset({final  List<Sample> samples = const [], this.name, this.location, this.shuffled = false}): _samples = samples;
+  const _Dataset({final  List<Sample> samples = const [], this.name, this.location, this.shuffled = false, this.format = 'memory', this.source, final  Map<String, dynamic>? args}): _samples = samples,_args = args;
   factory _Dataset.fromJson(Map<String, dynamic> json) => _$DatasetFromJson(json);
 
-/// The list of sample objects.
+/// The list of sample objects (only used when format is 'memory').
  final  List<Sample> _samples;
-/// The list of sample objects.
+/// The list of sample objects (only used when format is 'memory').
 @override@JsonKey() List<Sample> get samples {
   if (_samples is EqualUnmodifiableListView) return _samples;
   // ignore: implicit_dynamic_type
@@ -228,6 +234,21 @@ class _Dataset implements Dataset {
 @override final  String? location;
 /// Whether the dataset was shuffled after reading.
 @override@JsonKey() final  bool shuffled;
+/// Dataset format: 'memory' (inline samples), 'json', or 'csv'.
+@override@JsonKey() final  String format;
+/// File path or URL for json/csv datasets.
+@override final  String? source;
+/// Extra kwargs passed to json_dataset() or csv_dataset().
+ final  Map<String, dynamic>? _args;
+/// Extra kwargs passed to json_dataset() or csv_dataset().
+@override Map<String, dynamic>? get args {
+  final value = _args;
+  if (value == null) return null;
+  if (_args is EqualUnmodifiableMapView) return _args;
+  // ignore: implicit_dynamic_type
+  return EqualUnmodifiableMapView(value);
+}
+
 
 /// Create a copy of Dataset
 /// with the given fields replaced by the non-null parameter values.
@@ -242,16 +263,16 @@ Map<String, dynamic> toJson() {
 
 @override
 bool operator ==(Object other) {
-  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Dataset&&const DeepCollectionEquality().equals(other._samples, _samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled));
+  return identical(this, other) || (other.runtimeType == runtimeType&&other is _Dataset&&const DeepCollectionEquality().equals(other._samples, _samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled)&&(identical(other.format, format) || other.format == format)&&(identical(other.source, source) || other.source == source)&&const DeepCollectionEquality().equals(other._args, _args));
 }
 
 @JsonKey(includeFromJson: false, includeToJson: false)
 @override
-int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(_samples),name,location,shuffled);
+int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(_samples),name,location,shuffled,format,source,const DeepCollectionEquality().hash(_args));
 
 @override
 String toString() {
-  return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled)';
+  return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled, format: $format, source: $source, args: $args)';
 }
 
 
@@ -262,7 +283,7 @@ abstract mixin class _$DatasetCopyWith<$Res> implements $DatasetCopyWith<$Res> {
   factory _$DatasetCopyWith(_Dataset value, $Res Function(_Dataset) _then) = __$DatasetCopyWithImpl;
 @override @useResult
 $Res call({
- List<Sample> samples, String? name, String? location, bool shuffled
+ List<Sample> samples, String? name, String? location, bool shuffled, String format, String? source, Map<String, dynamic>? args
 });
 
 
@@ -279,13 +300,16 @@ class __$DatasetCopyWithImpl<$Res>
 
 /// Create a copy of Dataset
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,Object? format = null,Object? source = freezed,Object? args = freezed,}) {
   return _then(_Dataset(
 samples: null == samples ? _self._samples : samples // ignore: cast_nullable_to_non_nullable
 as List<Sample>,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable
 as String?,location: freezed == location ? _self.location : location // ignore: cast_nullable_to_non_nullable
 as String?,shuffled: null == shuffled ? _self.shuffled : shuffled // ignore: cast_nullable_to_non_nullable
-as bool,
+as bool,format: null == format ? _self.format : format // ignore: cast_nullable_to_non_nullable
+as String,source: freezed == source ? _self.source : source // ignore: cast_nullable_to_non_nullable
+as String?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable
+as Map<String, dynamic>?,
   ));
 }
 
diff --git a/packages/dataset_config_dart/lib/src/models/dataset.g.dart b/packages/dataset_config_dart/lib/src/models/dataset.g.dart
index a3c87a3..f7ff71a 100644
--- a/packages/dataset_config_dart/lib/src/models/dataset.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/dataset.g.dart
@@ -15,6 +15,9 @@ _Dataset _$DatasetFromJson(Map<String, dynamic> json) => _Dataset(
   name: json['name'] as String?,
   location: json['location'] as String?,
   shuffled: json['shuffled'] as bool? ?? false,
+  format: json['format'] as String? ?? 'memory',
+  source: json['source'] as String?,
+  args: json['args'] as Map<String, dynamic>?,
 );
 
 Map<String, dynamic> _$DatasetToJson(_Dataset instance) => <String, dynamic>{
@@ -22,4 +25,7 @@ Map<String, dynamic> _$DatasetToJson(_Dataset instance) => <String, dynamic>{
   'name': instance.name,
   'location': instance.location,
   'shuffled': instance.shuffled,
+  'format': instance.format,
+  'source': instance.source,
+  'args': instance.args,
 };
diff --git a/packages/dataset_config_dart/lib/src/models/job.dart b/packages/dataset_config_dart/lib/src/models/job.dart
index a7566aa..793946b 100644
--- a/packages/dataset_config_dart/lib/src/models/job.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.dart
@@ -53,8 +53,8 @@ sealed class Job with _$Job {
     /// Maximum concurrent API connections.
     @JsonKey(name: 'max_connections') @Default(10) int maxConnections,
 
-    /// Models to run. `null` means use defaults from registries.
-    List<String>? models,
+    /// Models to run (required).
+    required List<String> models,
 
     /// Named variant map. Keys are variant names, values are config dicts.
     /// `null` means baseline only.
diff --git a/packages/dataset_config_dart/lib/src/models/job.freezed.dart b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
index b0de561..14cd2bb 100644
--- a/packages/dataset_config_dart/lib/src/models/job.freezed.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.freezed.dart
@@ -21,8 +21,8 @@ mixin _$Job {
 /// Human-readable description of this job.
  String? get description;/// Directory to write evaluation logs to.
 @JsonKey(name: 'log_dir') String get logDir;/// Maximum concurrent API connections.
-@JsonKey(name: 'max_connections') int get maxConnections;/// Models to run. `null` means use defaults from registries.
- List<String>? get models;/// Named variant map. Keys are variant names, values are config dicts.
+@JsonKey(name: 'max_connections') int get maxConnections;/// Models to run (required).
+ List<String> get models;/// Named variant map. Keys are variant names, values are config dicts.
 /// `null` means baseline only.
  Map<String, Map<String, dynamic>>? get variants;/// Glob patterns for discovering task directories (relative to dataset root).
 @JsonKey(name: 'task_paths') List<String>? get taskPaths;/// Per-task configurations with inline overrides.
@@ -74,7 +74,7 @@ abstract mixin class $JobCopyWith<$Res>  {
   factory $JobCopyWith(Job value, $Res Function(Job) _then) = _$JobCopyWithImpl;
 @useResult
 $Res call({
- String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map<String, dynamic>? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
+ String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List<String> models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map<String, dynamic>? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
 });
 
 
@@ -91,13 +91,13 @@ class _$JobCopyWithImpl<$Res>
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
-@pragma('vm:prefer-inline') @override $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
+@pragma('vm:prefer-inline') @override $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = null,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
   return _then(_self.copyWith(
 description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable
 as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
 as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable
-as int,models: freezed == models ? _self.models : models // ignore: cast_nullable_to_non_nullable
-as List<String>?,variants: freezed == variants ? _self.variants : variants // ignore: cast_nullable_to_non_nullable
+as int,models: null == models ? _self.models : models // ignore: cast_nullable_to_non_nullable
+as List<String>,variants: freezed == variants ? _self.variants : variants // ignore: cast_nullable_to_non_nullable
 as Map<String, Map<String, dynamic>>?,taskPaths: freezed == taskPaths ? _self.taskPaths : taskPaths // ignore: cast_nullable_to_non_nullable
 as List<String>?,tasks: freezed == tasks ? _self.tasks : tasks // ignore: cast_nullable_to_non_nullable
 as Map<String, JobTask>?,saveExamples: null == saveExamples ? _self.saveExamples : saveExamples // ignore: cast_nullable_to_non_nullable
@@ -211,7 +211,7 @@ return $default(_that);case _:
 /// }
 /// ```
 
-@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,{required TResult orElse(),}) {final _that = this;
+@optionalTypeArgs TResult maybeWhen<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String> models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,{required TResult orElse(),}) {final _that = this;
 switch (_that) {
 case _Job() when $default != null:
 return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);case _:
@@ -232,7 +232,7 @@ return $default(_that.description,_that.logDir,_that.maxConnections,_that.models
 /// }
 /// ```
 
-@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)  $default,) {final _that = this;
+@optionalTypeArgs TResult when<TResult extends Object?>(TResult Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String> models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)  $default,) {final _that = this;
 switch (_that) {
 case _Job():
 return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);}
@@ -249,7 +249,7 @@ return $default(_that.description,_that.logDir,_that.maxConnections,_that.models
 /// }
 /// ```
 
-@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String>? models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,) {final _that = this;
+@optionalTypeArgs TResult? whenOrNull<TResult extends Object?>(TResult? Function( String? description, @JsonKey(name: 'log_dir')  String logDir, @JsonKey(name: 'max_connections')  int maxConnections,  List<String> models,  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths')  List<String>? taskPaths,  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples')  bool saveExamples,  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments')  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters')  TagFilter? taskFilters, @JsonKey(name: 'sample_filters')  TagFilter? sampleFilters)?  $default,) {final _that = this;
 switch (_that) {
 case _Job() when $default != null:
 return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);case _:
@@ -264,7 +264,7 @@ return $default(_that.description,_that.logDir,_that.maxConnections,_that.models
 @JsonSerializable()
 
 class _Job implements Job {
-  const _Job({this.description, @JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'max_connections') this.maxConnections = 10, final  List<String>? models, final  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths') final  List<String>? taskPaths, final  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, final  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments') final  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters') this.taskFilters, @JsonKey(name: 'sample_filters') this.sampleFilters}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_sandbox = sandbox,_inspectEvalArguments = inspectEvalArguments;
+  const _Job({this.description, @JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'max_connections') this.maxConnections = 10, required final  List<String> models, final  Map<String, Map<String, dynamic>>? variants, @JsonKey(name: 'task_paths') final  List<String>? taskPaths, final  Map<String, JobTask>? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, final  Map<String, dynamic>? sandbox, @JsonKey(name: 'inspect_eval_arguments') final  Map<String, dynamic>? inspectEvalArguments, @JsonKey(name: 'task_filters') this.taskFilters, @JsonKey(name: 'sample_filters') this.sampleFilters}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_sandbox = sandbox,_inspectEvalArguments = inspectEvalArguments;
   factory _Job.fromJson(Map<String, dynamic> json) => _$JobFromJson(json);
 
 // ------------------------------------------------------------------
@@ -276,15 +276,13 @@ class _Job implements Job {
 @override@JsonKey(name: 'log_dir') final  String logDir;
 /// Maximum concurrent API connections.
 @override@JsonKey(name: 'max_connections') final  int maxConnections;
-/// Models to run. `null` means use defaults from registries.
- final  List<String>? _models;
-/// Models to run. `null` means use defaults from registries.
-@override List<String>? get models {
-  final value = _models;
-  if (value == null) return null;
+/// Models to run (required).
+ final  List<String> _models;
+/// Models to run (required).
+@override List<String> get models {
   if (_models is EqualUnmodifiableListView) return _models;
   // ignore: implicit_dynamic_type
-  return EqualUnmodifiableListView(value);
+  return EqualUnmodifiableListView(_models);
 }
 
 /// Named variant map. Keys are variant names, values are config dicts.
@@ -401,7 +399,7 @@ abstract mixin class _$JobCopyWith<$Res> implements $JobCopyWith<$Res> {
   factory _$JobCopyWith(_Job value, $Res Function(_Job) _then) = __$JobCopyWithImpl;
 @override @useResult
 $Res call({
- String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map<String, dynamic>? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
+ String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List<String> models, Map<String, Map<String, dynamic>>? variants,@JsonKey(name: 'task_paths') List<String>? taskPaths, Map<String, JobTask>? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map<String, dynamic>? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map<String, dynamic>? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters
 });
 
 
@@ -418,13 +416,13 @@ class __$JobCopyWithImpl<$Res>
 
 /// Create a copy of Job
 /// with the given fields replaced by the non-null parameter values.
-@override @pragma('vm:prefer-inline') $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
+@override @pragma('vm:prefer-inline') $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = null,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) {
   return _then(_Job(
 description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable
 as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable
 as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable
-as int,models: freezed == models ? _self._models : models // ignore: cast_nullable_to_non_nullable
-as List<String>?,variants: freezed == variants ? _self._variants : variants // ignore: cast_nullable_to_non_nullable
+as int,models: null == models ? _self._models : models // ignore: cast_nullable_to_non_nullable
+as List<String>,variants: freezed == variants ? _self._variants : variants // ignore: cast_nullable_to_non_nullable
 as Map<String, Map<String, dynamic>>?,taskPaths: freezed == taskPaths ? _self._taskPaths : taskPaths // ignore: cast_nullable_to_non_nullable
 as List<String>?,tasks: freezed == tasks ? _self._tasks : tasks // ignore: cast_nullable_to_non_nullable
 as Map<String, JobTask>?,saveExamples: null == saveExamples ? _self.saveExamples : saveExamples // ignore: cast_nullable_to_non_nullable
diff --git a/packages/dataset_config_dart/lib/src/models/job.g.dart b/packages/dataset_config_dart/lib/src/models/job.g.dart
index e2996b3..c5aad96 100644
--- a/packages/dataset_config_dart/lib/src/models/job.g.dart
+++ b/packages/dataset_config_dart/lib/src/models/job.g.dart
@@ -10,7 +10,7 @@ _Job _$JobFromJson(Map<String, dynamic> json) => _Job(
   description: json['description'] as String?,
   logDir: json['log_dir'] as String,
   maxConnections: (json['max_connections'] as num?)?.toInt() ?? 10,
-  models: (json['models'] as List<dynamic>?)?.map((e) => e as String).toList(),
+  models: (json['models'] as List<dynamic>).map((e) => e as String).toList(),
   variants: (json['variants'] as Map<String, dynamic>?)?.map(
     (k, e) => MapEntry(k, e as Map<String, dynamic>),
   ),
diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart
index c2d54f6..465d9cb 100644
--- a/packages/dataset_config_dart/lib/src/parsed_task.dart
+++ b/packages/dataset_config_dart/lib/src/parsed_task.dart
@@ -85,6 +85,15 @@ class ParsedTask {
   /// Additional metadata to associate with the task.
   final Map<String, dynamic>? metadata;
 
+  /// Dataset format: 'memory' (inline samples), 'json', or 'csv'.
+  final String datasetFormat;
+
+  /// File path or URL for json/csv datasets.
+  final String? datasetSource;
+
+  /// Extra kwargs passed to json_dataset() or csv_dataset().
+  final Map<String, dynamic>? datasetArgs;
+
   const ParsedTask({
     required this.id,
     required this.func,
@@ -114,6 +123,9 @@ class ParsedTask {
     this.displayName,
     this.version,
     this.metadata,
+    this.datasetFormat = 'memory',
+    this.datasetSource,
+    this.datasetArgs,
   });
 
   /// Create a copy with overrides.
@@ -176,6 +188,9 @@ class ParsedTask {
       displayName: displayName ?? this.displayName,
       version: version ?? this.version,
       metadata: metadata ?? this.metadata,
+      datasetFormat: this.datasetFormat,
+      datasetSource: this.datasetSource,
+      datasetArgs: this.datasetArgs,
     );
   }
 }
diff --git a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
index b6e40d3..74ad76d 100644
--- a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart
@@ -24,66 +24,90 @@ class JsonParser extends Parser {
       final func = (data['func'] as String?) ?? taskId;
       final systemMessage = data['system_message'] as String?;
 
-      // Parse samples from inline data (no file I/O) - optional
-      final samplesRaw = data['samples'];
+      // Parse dataset section (matches YAML parser's dataset key structure)
+      final datasetRaw = data['dataset'];
       final samples = <Sample>[];
-      if (samplesRaw is Map) {
-        final inlineDefs =
-            (samplesRaw['inline'] as List?)?.cast<Map<String, dynamic>>() ??
-            const [];
-        for (final def in inlineDefs) {
-          if (def.isEmpty) continue;
-
-          // Validate required fields
-          for (final field in ['id', 'input', 'target']) {
-            if (!def.containsKey(field)) {
-              throw FormatException(
-                "Sample '${def['id'] ?? 'unknown'}' missing required "
-                "field: $field",
+      var datasetFormat = 'memory';
+      String? datasetSource;
+      Map<String, dynamic>? datasetArgs;
+
+      if (datasetRaw is Map) {
+        final datasetMap = Map<String, dynamic>.from(datasetRaw);
+
+        // Parse optional args
+        if (datasetMap['args'] is Map) {
+          datasetArgs = Map<String, dynamic>.from(datasetMap['args'] as Map);
+        }
+
+        if (datasetMap.containsKey('json')) {
+          datasetFormat = 'json';
+          datasetSource = datasetMap['json'].toString();
+        } else if (datasetMap.containsKey('csv')) {
+          datasetFormat = 'csv';
+          datasetSource = datasetMap['csv'].toString();
+        } else if (datasetMap.containsKey('samples')) {
+          // Inline samples — same as before
+          final samplesSection = datasetMap['samples'];
+          if (samplesSection is Map) {
+            final inlineDefs =
+                (samplesSection['inline'] as List?)
+                    ?.cast<Map<String, dynamic>>() ??
+                const [];
+            for (final def in inlineDefs) {
+              if (def.isEmpty) continue;
+
+              // Validate required fields
+              for (final field in ['id', 'input', 'target']) {
+                if (!def.containsKey(field)) {
+                  throw FormatException(
+                    "Sample '${def['id'] ?? 'unknown'}' missing required "
+                    "field: $field",
+                  );
+                }
+              }
+
+              // Read metadata from the metadata dict
+              final metaRaw = Map<String, dynamic>.from(
+                def['metadata'] as Map? ?? {},
               );
-            }
-          }
 
-          // Read metadata from the metadata dict
-          final metaRaw = Map<String, dynamic>.from(
-            def['metadata'] as Map? ?? {},
-          );
-
-          // Normalize tags from metadata
-          final rawTags = metaRaw['tags'];
-          final List<String> tags;
-          if (rawTags is String) {
-            tags = rawTags.split(',').map((t) => t.trim()).toList();
-          } else if (rawTags is List) {
-            tags = rawTags.cast<String>();
-          } else {
-            tags = [];
-          }
+              // Normalize tags from metadata
+              final rawTags = metaRaw['tags'];
+              final List<String> tags;
+              if (rawTags is String) {
+                tags = rawTags.split(',').map((t) => t.trim()).toList();
+              } else if (rawTags is List) {
+                tags = rawTags.cast<String>();
+              } else {
+                tags = [];
+              }
+
+              // Parse sample-level fields
+              final choices = (def['choices'] as List?)?.cast<String>();
+              final sampleSandbox = def['sandbox'];
+              final setup = def['setup'] as String?;
+              final files = def['files'] is Map
+                  ? Map<String, String>.from(def['files'] as Map)
+                  : null;
 
-          // Parse sample-level fields
-          final choices = (def['choices'] as List?)?.cast<String>();
-          final sampleSandbox = def['sandbox'];
-          final setup = def['setup'] as String?;
-          final files = def['files'] is Map
-              ? Map<String, String>.from(def['files'] as Map)
-              : null;
-
-          samples.add(
-            Sample(
-              id: def['id'] as String,
-              input: def['input'] as String,
-              target: def['target'] as String,
-              metadata: <String, dynamic>{
-                ...metaRaw,
-                'difficulty': metaRaw['difficulty'] as String? ?? 'medium',
-                'tags': tags,
-              },
-              choices: choices,
-              sandbox: sampleSandbox,
-              setup: setup,
-              files: files,
-            ),
-          );
+              samples.add(
+                Sample(
+                  id: def['id'] as String,
+                  input: def['input'] as String,
+                  target: def['target'] as String,
+                  metadata: <String, dynamic>{
+                    ...metaRaw,
+                    'difficulty': metaRaw['difficulty'] as String? ?? 'medium',
+                    'tags': tags,
+                  },
+                  choices: choices,
+                  sandbox: sampleSandbox,
+                  setup: setup,
+                  files: files,
+                ),
+              );
+            }
+          }
         }
       }
 
@@ -139,6 +163,9 @@ class JsonParser extends Parser {
         displayName: displayName,
         version: version,
         metadata: taskMetadata,
+        datasetFormat: datasetFormat,
+        datasetSource: datasetSource,
+        datasetArgs: datasetArgs,
       );
     }).toList();
   }
@@ -164,10 +191,20 @@ class JsonParser extends Parser {
       sandbox = {'environment': sandboxRaw};
     }
 
+    // Parse models (required)
+    final modelsRaw = data['models'] as List?;
+    if (modelsRaw == null || modelsRaw.isEmpty) {
+      throw FormatException(
+        "Job data is missing required 'models' field. "
+        'Specify at least one model.',
+      );
+    }
+    final models = modelsRaw.cast<String>();
+
     return Job(
       logDir: (data['log_dir'] as String?) ?? '',
       maxConnections: (data['max_connections'] as int?) ?? 10,
-      models: (data['models'] as List?)?.cast<String>(),
+      models: models,
       saveExamples: data['save_examples'] == true,
       sandbox: sandbox,
       inspectEvalArguments: data['inspect_eval_arguments'] is Map
diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
index f57d4a4..a8d2e33 100644
--- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
+++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
@@ -58,21 +58,71 @@ class YamlParser extends Parser {
     final taskFiles = _asStringMap(data['files']);
     final taskSetup = data['setup'] as String?;
 
-    // Parse samples section
-    final samplesRaw = data['samples'];
-    if (samplesRaw is! Map) {
+    // Parse dataset section (replaces the old top-level 'samples' key)
+    final datasetRaw = data['dataset'];
+    var datasetFormat = 'memory';
+    String? datasetSource;
+    Map<String, dynamic>? datasetArgs;
+    List<Sample> samples;
+
+    if (datasetRaw == null) {
+      samples = <Sample>[];
+    } else if (datasetRaw is! Map) {
       throw FormatException(
-        "Task '$taskId': 'samples' must be a dict with 'inline' and/or "
-        "'paths' keys, got ${samplesRaw.runtimeType}",
+        "Task '$taskId': 'dataset' must be a dict with one of "
+        "'samples', 'json', or 'csv' keys, got ${datasetRaw.runtimeType}",
       );
+    } else {
+      final datasetMap = Map<String, dynamic>.from(datasetRaw);
+      final formatKeys = {'samples', 'json', 'csv'};
+      final presentKeys =
+          formatKeys.intersection(datasetMap.keys.toSet().cast<String>());
+      if (presentKeys.length > 1) {
+        throw FormatException(
+          "Task '$taskId': 'dataset' must have exactly one of "
+          "'samples', 'json', or 'csv', found: $presentKeys",
+        );
+      }
+
+      // Parse optional args
+      final argsRaw = datasetMap['args'];
+      if (argsRaw != null) {
+        if (argsRaw is! Map) {
+          throw FormatException(
+            "Task '$taskId': 'dataset.args' must be a dict, "
+            'got ${argsRaw.runtimeType}',
+          );
+        }
+        datasetArgs = Map<String, dynamic>.from(argsRaw);
+      }
+
+      if (datasetMap.containsKey('samples')) {
+        // Inline/path-based samples (existing MemoryDataset behavior)
+        final samplesSection = datasetMap['samples'];
+        if (samplesSection is! Map) {
+          throw FormatException(
+            "Task '$taskId': 'dataset.samples' must be a dict with "
+            "'inline' and/or 'paths' keys, got ${samplesSection.runtimeType}",
+          );
+        }
+        samples = _loadSamplesSection(
+          Map<String, dynamic>.from(samplesSection),
+          datasetRoot,
+          taskFiles,
+          taskDir,
+        );
+      } else if (datasetMap.containsKey('json')) {
+        datasetFormat = 'json';
+        datasetSource = datasetMap['json'].toString();
+        samples = <Sample>[];
+      } else if (datasetMap.containsKey('csv')) {
+        datasetFormat = 'csv';
+        datasetSource = datasetMap['csv'].toString();
+        samples = <Sample>[];
+      } else {
+        samples = <Sample>[];
+      }
     }
-    final samplesMap = Map<String, dynamic>.from(samplesRaw);
-    final samples = _loadSamplesSection(
-      samplesMap,
-      datasetRoot,
-      taskFiles,
-      taskDir,
-    );
 
     // Task-level Inspect AI args are nested under inspect_task_args
     final taskArgs = _asMap(data['inspect_task_args']) ?? <String, dynamic>{};
@@ -105,6 +155,9 @@ class YamlParser extends Parser {
         sandboxParameters: sandboxParameters,
         taskFiles: taskFiles,
         taskSetup: taskSetup,
+        datasetFormat: datasetFormat,
+        datasetSource: datasetSource,
+        datasetArgs: datasetArgs,
         // Task-level settings
         model: model,
         config: config,
@@ -350,11 +403,22 @@ class YamlParser extends Parser {
         ? TagFilter.fromJson(Map<String, dynamic>.from(sampleFiltersRaw))
         : null;
 
+    // Parse models (required)
+    final modelsRaw = data['models'] as List?;
+    if (modelsRaw == null || modelsRaw.isEmpty) {
+      throw FormatException(
+        "Job file '$jobPath' is missing required 'models' field. "
+        "Specify at least one model, e.g.:\n"
+        '  models:\n    - google/gemini-2.5-flash',
+      );
+    }
+    final models = modelsRaw.cast<String>();
+
     return Job(
       logDir: logDir,
       maxConnections: maxConnections,
       description: data['description'] as String?,
-      models: (data['models'] as List?)?.cast<String>(),
+      models: models,
       variants: variants,
       taskPaths: taskPaths,
       tasks: tasks,
@@ -381,9 +445,14 @@ class YamlParser extends Parser {
   }
 
   /// Create a [Job] with default settings (when no job file is provided).
+  ///
+  /// Note: The caller must specify models, as there are no defaults.
+  /// This method creates a job with an empty models list; the resolver
+  /// will raise an error if models is empty at resolution time.
   Job createDefaultJob(String baseDir) {
     return Job(
       logDir: _resolveLogDir(_kDefaultLogsDir, baseDir),
+      models: <String>[],
     );
   }
 
diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
index 9f0e817..ec9b36c 100644
--- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
+++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
@@ -7,19 +7,7 @@ import 'package:path/path.dart' as p;
 
 import '../parsed_task.dart';
 
-/// Default models used when a job doesn't specify its own.
-const List<String> kDefaultModels = [
-  'anthropic/claude-haiku-4-5',
-  'anthropic/claude-sonnet-4-5',
-  'anthropic/claude-opus-4-6',
-  'google/gemini-2.5-flash',
-  'google/gemini-3-pro-preview',
-  'google/gemini-3-flash-preview',
-  'openai/gpt-5-mini',
-  'openai/gpt-5-nano',
-  'openai/gpt-5',
-  'openai/gpt-5-pro',
-];
+
 
 /// Default sandbox configurations for Flutter evaluations.
 ///
@@ -64,7 +52,14 @@ class EvalSetResolver {
     Job job,
     String datasetRoot,
   ) {
-    final models = _resolveModels(job);
+    if (job.models.isEmpty) {
+      throw ArgumentError(
+        'job.models is required and must contain at least one model. '
+        'Specify models in your job YAML, e.g.:\n'
+        '  models:\n    - google/gemini-2.5-flash',
+      );
+    }
+    final models = job.models;
     final sandboxCfg = job.sandbox ?? <String, dynamic>{};
     final sandboxTypeStr = (sandboxCfg['environment'] as String?) ?? 'local';
     final expandedTasks = _expandTaskConfigs(
@@ -150,6 +145,9 @@ class EvalSetResolver {
       final dataset = Dataset(
         samples: inspectSamples,
         name: '${tc.id}:${tc.variant.name}',
+        format: tc.datasetFormat,
+        source: tc.datasetSource,
+        args: tc.datasetArgs,
       );
 
       // Build task metadata (variant config, system message, etc.)
@@ -343,15 +341,7 @@ class EvalSetResolver {
     );
   }
 
-  // ------------------------------------------------------------------
-  // Model resolution
-  // ------------------------------------------------------------------
 
-  /// Resolve which models to run. Job overrides default.
-  List<String> _resolveModels(Job job) {
-    if (job.models != null && job.models!.isNotEmpty) return job.models!;
-    return List.of(kDefaultModels);
-  }
 
   // ------------------------------------------------------------------
   // Sandbox resolution
diff --git a/packages/dataset_config_dart/test/eval_set_resolver_test.dart b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
index 8765930..bd4eb82 100644
--- a/packages/dataset_config_dart/test/eval_set_resolver_test.dart
+++ b/packages/dataset_config_dart/test/eval_set_resolver_test.dart
@@ -42,7 +42,7 @@ void main() {
   Job makeJob({
     String logDir = '/tmp/logs',
     Map<String, dynamic>? sandbox,
-    List<String>? models,
+    List<String> models = const ['test-model'],
     Map<String, Map<String, dynamic>>? variants,
     Map<String, JobTask>? tasks,
     bool saveExamples = false,
@@ -148,14 +148,15 @@ void main() {
       expect(results.first.model, ['model_a', 'model_b']);
     });
 
-    test('uses default models when job has none', () {
-      final results = resolver.resolve(
-        [makeTask()],
-        makeJob(models: null),
-        '/tmp/dataset',
+    test('throws when job has empty models', () {
+      expect(
+        () => resolver.resolve(
+          [makeTask()],
+          makeJob(models: []),
+          '/tmp/dataset',
+        ),
+        throwsArgumentError,
       );
-
-      expect(results.first.model, kDefaultModels);
     });
 
     test('job with include_samples filters to only matching samples', () {
diff --git a/packages/dataset_config_dart/test/json_parser_test.dart b/packages/dataset_config_dart/test/json_parser_test.dart
index a95c994..9583e65 100644
--- a/packages/dataset_config_dart/test/json_parser_test.dart
+++ b/packages/dataset_config_dart/test/json_parser_test.dart
@@ -14,11 +14,11 @@ void main() {
         {
           'id': 'my_task',
           'func': 'question_answer',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [
               {'id': 's1', 'input': 'What is Dart?', 'target': 'A language'},
             ],
-          },
+          }},
         },
       ]);
 
@@ -35,7 +35,7 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'dart_qa',
-          'samples': {'inline': <Map<String, dynamic>>[]},
+          'dataset': {'samples': {'inline': <Map<String, dynamic>>[]}},
         },
       ]);
 
@@ -47,11 +47,11 @@ void main() {
         () => parser.parseTasksFromMaps([
           {
             'id': 'bad_task',
-            'samples': {
+            'dataset': {'samples': {
               'inline': [
                 {'id': 's1', 'input': 'hello'}, // missing 'target'
               ],
-            },
+            }},
           },
         ]),
         throwsA(isA<FormatException>()),
@@ -62,7 +62,7 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'tagged_task',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [
               {
                 'id': 's1',
@@ -73,7 +73,7 @@ void main() {
                 },
               },
             ],
-          },
+          }},
         },
       ]);
 
@@ -85,7 +85,7 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'tagged_task',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [
               {
                 'id': 's1',
@@ -96,7 +96,7 @@ void main() {
                 },
               },
             ],
-          },
+          }},
         },
       ]);
 
@@ -108,11 +108,11 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'no_tags',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [
               {'id': 's1', 'input': 'q', 'target': 'a'},
             ],
-          },
+          }},
         },
       ]);
 
@@ -124,11 +124,11 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'task',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [
               {'id': 's1', 'input': 'q', 'target': 'a'},
             ],
-          },
+          }},
         },
       ]);
 
@@ -140,7 +140,7 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'task',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [
               {
                 'id': 's1',
@@ -151,7 +151,7 @@ void main() {
                 'files': {'main.dart': 'void main() {}'},
               },
             ],
-          },
+          }},
         },
       ]);
 
@@ -180,7 +180,7 @@ void main() {
           'display_name': 'Full Task',
           'version': 2,
           'metadata': {'author': 'test'},
-          'samples': {'inline': <Map<String, dynamic>>[]},
+          'dataset': {'samples': {'inline': <Map<String, dynamic>>[]}},
         },
       ]);
 
@@ -203,9 +203,9 @@ void main() {
       final tasks = parser.parseTasksFromMaps([
         {
           'id': 'task',
-          'samples': {
+          'dataset': {'samples': {
             'inline': [<String, dynamic>{}],
-          },
+          }},
         },
       ]);
 
@@ -214,13 +214,11 @@ void main() {
   });
 
   group('parseJobFromMap()', () {
-    test('parses minimal job with defaults', () {
-      final job = parser.parseJobFromMap(<String, dynamic>{});
-
-      expect(job.logDir, '');
-      expect(job.maxConnections, 10);
-      expect(job.models, isNull);
-      expect(job.saveExamples, false);
+    test('throws when models is missing', () {
+      expect(
+        () => parser.parseJobFromMap(<String, dynamic>{}),
+        throwsA(isA<FormatException>()),
+      );
     });
 
     test('parses all core fields', () {
@@ -242,6 +240,7 @@ void main() {
     test('parses sandbox string shorthand', () {
       final job = parser.parseJobFromMap({
         'sandbox': 'podman',
+        'models': ['test-model'],
       });
 
       expect(job.sandbox, {'environment': 'podman'});
@@ -249,6 +248,7 @@ void main() {
 
     test('parses inspect_eval_arguments', () {
       final job = parser.parseJobFromMap({
+        'models': ['test-model'],
         'inspect_eval_arguments': {
           'retry_attempts': 20,
           'max_retries': 3,
@@ -278,6 +278,7 @@ void main() {
 
     test('parses nested overrides in inspect_eval_arguments', () {
       final job = parser.parseJobFromMap({
+        'models': ['test-model'],
         'inspect_eval_arguments': {
           'eval_set_overrides': {'custom_key': 'custom_value'},
           'task_defaults': {'time_limit': 600},
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/dataset.py b/packages/dataset_config_python/src/dataset_config_python/models/dataset.py
index b04ceb5..fe363ee 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/dataset.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/dataset.py
@@ -2,16 +2,24 @@
 
 from __future__ import annotations
 
+from typing import Any
+
 from pydantic import BaseModel
 
 from dataset_config_python.models.sample import Sample
 
 
 class Dataset(BaseModel):
-    """A named collection of samples."""
+    """A named collection of samples, or a reference to a file-backed dataset.
+
+    Supports three dataset formats:
+    - ``format="memory"`` (default): inline samples via ``samples`` list.
+    - ``format="json"``: loads via Inspect AI's ``json_dataset(source, **args)``.
+    - ``format="csv"``: loads via Inspect AI's ``csv_dataset(source, **args)``.
+    """
 
     samples: list[Sample] = []
-    """The sample records in this dataset."""
+    """The sample records (only used when format is 'memory')."""
 
     name: str = ""
     """Display name for the dataset."""
@@ -21,3 +29,12 @@ class Dataset(BaseModel):
 
     shuffled: bool = False
     """Whether the dataset was shuffled after reading."""
+
+    format: str = "memory"
+    """Dataset format: 'memory' (inline samples), 'json', or 'csv'."""
+
+    source: str | None = None
+    """File path or URL for json/csv datasets."""
+
+    args: dict[str, Any] | None = None
+    """Extra kwargs passed to json_dataset() or csv_dataset()."""
diff --git a/packages/dataset_config_python/src/dataset_config_python/models/job.py b/packages/dataset_config_python/src/dataset_config_python/models/job.py
index ee0801a..2049b91 100644
--- a/packages/dataset_config_python/src/dataset_config_python/models/job.py
+++ b/packages/dataset_config_python/src/dataset_config_python/models/job.py
@@ -52,7 +52,7 @@ class Job(BaseModel):
     description: str | None = None
     log_dir: str
     max_connections: int = 10
-    models: list[str] | None = None
+    models: list[str]
     variants: dict[str, dict[str, Any]] | None = None
     task_paths: list[str] | None = None
     tasks: dict[str, JobTask] | None = None
diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py
index eecd3b4..218b840 100644
--- a/packages/dataset_config_python/src/dataset_config_python/parser.py
+++ b/packages/dataset_config_python/src/dataset_config_python/parser.py
@@ -58,6 +58,10 @@ def __init__(
         # Task-level files and setup
         task_files: dict[str, str] | None = None,
         task_setup: str | None = None,
+        # Dataset format metadata
+        dataset_format: str = "memory",
+        dataset_source: str | None = None,
+        dataset_args: dict[str, Any] | None = None,
     ):
         self.id = id
         self.func = func
@@ -87,6 +91,9 @@ def __init__(
         self.sandbox_parameters = sandbox_parameters
         self.task_files = task_files
         self.task_setup = task_setup
+        self.dataset_format = dataset_format
+        self.dataset_source = dataset_source
+        self.dataset_args = dataset_args
 
     _UNSET: Any = object()
 
@@ -153,6 +160,9 @@ def copy_with(
             display_name=self.display_name if display_name is _U else display_name,
             version=self.version if version is _U else version,
             metadata=self.metadata if metadata is _U else metadata,
+            dataset_format=self.dataset_format,
+            dataset_source=self.dataset_source,
+            dataset_args=self.dataset_args,
         )
 
 
@@ -238,17 +248,51 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
     else:
         task_setup = None
 
-    # Parse samples section (optional)
-    samples_raw = data.get("samples")
-    if samples_raw is None:
-        samples: list[Sample] = []
-    elif not isinstance(samples_raw, dict):
-        raise ValueError(
-            f"Task '{task_id}': 'samples' must be a dict with 'inline' and/or "
-            f"'paths' keys, got {type(samples_raw).__name__}"
-        )
-    else:
-        samples = _load_samples_section(samples_raw, dataset_root, task_files, task_dir)
+    # Parse dataset section (replaces the old top-level 'samples' key)
+    dataset_raw = data.get("dataset")
+    samples: list[Sample] = []
+    dataset_format = "memory"
+    dataset_source: str | None = None
+    dataset_args: dict[str, Any] | None = None
+
+    if dataset_raw is not None:
+        if not isinstance(dataset_raw, dict):
+            raise ValueError(
+                f"Task '{task_id}': 'dataset' must be a dict with one of "
+                f"'samples', 'json', or 'csv' keys, got {type(dataset_raw).__name__}"
+            )
+
+        # Check for mutually exclusive format keys
+        format_keys = {'samples', 'json', 'csv'}
+        present_keys = format_keys & set(dataset_raw.keys())
+        if len(present_keys) > 1:
+            raise ValueError(
+                f"Task '{task_id}': 'dataset' must have exactly one of "
+                f"'samples', 'json', or 'csv', found: {present_keys}"
+            )
+
+        dataset_args = dataset_raw.get("args")
+        if dataset_args is not None and not isinstance(dataset_args, dict):
+            raise ValueError(
+                f"Task '{task_id}': 'dataset.args' must be a dict, "
+                f"got {type(dataset_args).__name__}"
+            )
+
+        if "samples" in dataset_raw:
+            # Inline/path-based samples (existing MemoryDataset behavior)
+            samples_section = dataset_raw["samples"]
+            if not isinstance(samples_section, dict):
+                raise ValueError(
+                    f"Task '{task_id}': 'dataset.samples' must be a dict with "
+                    f"'inline' and/or 'paths' keys, got {type(samples_section).__name__}"
+                )
+            samples = _load_samples_section(samples_section, dataset_root, task_files, task_dir)
+        elif "json" in dataset_raw:
+            dataset_format = "json"
+            dataset_source = str(dataset_raw["json"])
+        elif "csv" in dataset_raw:
+            dataset_format = "csv"
+            dataset_source = str(dataset_raw["csv"])
 
     # Task-level Inspect AI args are nested under inspect_task_args
     task_args = data.get("inspect_task_args") or {}
@@ -280,6 +324,9 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]:
             sandbox_parameters=data.get("sandbox_parameters") if isinstance(data.get("sandbox_parameters"), dict) else None,
             task_files=task_files,
             task_setup=task_setup,
+            dataset_format=dataset_format,
+            dataset_source=dataset_source,
+            dataset_args=dataset_args,
         )
     ]
 
@@ -475,10 +522,20 @@ def parse_job(job_path: str, dataset_root: str) -> Job:
     else:
         inspect_eval_arguments = None
 
+    # Parse models (required)
+    models_raw = data.get("models")
+    if not models_raw or not isinstance(models_raw, list) or len(models_raw) == 0:
+        raise ValueError(
+            f"Job file '{job_path}' is missing required 'models' field. "
+            "Specify at least one model, e.g.:\n"
+            "  models:\n    - google/gemini-2.5-flash"
+        )
+    models: list[str] = [str(m) for m in models_raw]
+
     return Job(
         log_dir=log_dir,
         max_connections=max_connections,
-        models=data.get("models"),
+        models=models,
         variants=variants,
         task_paths=task_paths,
         tasks=tasks,
diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py
index 18b1e91..21469c3 100644
--- a/packages/dataset_config_python/src/dataset_config_python/resolver.py
+++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py
@@ -18,11 +18,6 @@
 from dataset_config_python.models.variant import Variant
 from dataset_config_python.parser import ParsedTask, find_job_file, parse_job, parse_tasks
 
-# Default models when a job doesn't specify its own.
-DEFAULT_MODELS: list[str] = [
-    "google/gemini-2.5-flash",
-    "google/gemini-3-flash-preview",
-]
 
 # Default sandbox configurations for Flutter evaluations.
 # Consumers can pass these to resolve() or provide their own.
@@ -123,7 +118,13 @@ def _resolve_job(
     sandbox_registry: dict[str, dict[str, str]],
 ) -> list[EvalSet]:
     """Resolve task configs and job into EvalSet objects."""
-    models = job.models if job.models else list(DEFAULT_MODELS)
+    if not job.models:
+        raise ValueError(
+            "job.models is required and must contain at least one model. "
+            "Specify models in your job YAML, e.g.:\n"
+            "  models:\n    - google/gemini-2.5-flash"
+        )
+    models = job.models
     sandbox_cfg = job.sandbox or {}
     sandbox_type_str = sandbox_cfg.get("environment", "local")
 
@@ -196,6 +197,9 @@ def _build_eval_set(
         dataset = Dataset(
             samples=inspect_samples,
             name=f"{tc.id}:{tc.variant.name}",
+            format=tc.dataset_format,
+            source=tc.dataset_source,
+            args=tc.dataset_args,
         )
 
         # Task metadata (variant config, system message, etc.)
@@ -361,15 +365,6 @@ def _get(key: str, default: Any = None) -> Any:
     )
 
 
-# ---------------------------------------------------------------------------
-# Model resolution
-# ---------------------------------------------------------------------------
-
-
-def _resolve_models(job: Any) -> list[str]:
-    if job.models:
-        return job.models
-    return list(DEFAULT_MODELS)
 
 
 # ---------------------------------------------------------------------------
diff --git a/packages/dataset_config_python/tests/test_config.py b/packages/dataset_config_python/tests/test_config.py
index 3d79e90..865b7bd 100644
--- a/packages/dataset_config_python/tests/test_config.py
+++ b/packages/dataset_config_python/tests/test_config.py
@@ -32,24 +32,25 @@ def dataset_dir(tmp_path):
     task_dir.mkdir(parents=True)
     task_yaml = task_dir / "task.yaml"
     task_yaml.write_text(
-        """
+        """\
 id: dart_qa
 func: question_answer
 system_message: "You are an expert."
-samples:
-  inline:
-    - id: sample_1
-      input: "What is Dart?"
-      target: "A programming language."
-      difficulty: easy
-    - id: sample_2
-      input: "What is Flutter?"
-      target: "A UI framework."
-      metadata:
-        difficulty: medium
-        tags:
-          - ui
-          - framework
+dataset:
+  samples:
+    inline:
+      - id: sample_1
+        input: "What is Dart?"
+        target: "A programming language."
+        difficulty: easy
+      - id: sample_2
+        input: "What is Flutter?"
+        target: "A UI framework."
+        metadata:
+          difficulty: medium
+          tags:
+            - ui
+            - framework
 """
     )
 
@@ -62,11 +63,12 @@ def dataset_dir(tmp_path):
 func: flutter_code_gen
 inspect_task_args:
   time_limit: 600
-samples:
-  inline:
-    - id: sample_1
-      input: "Create a counter app."
-      target: "A working counter app."
+dataset:
+  samples:
+    inline:
+      - id: sample_1
+        input: "Create a counter app."
+        target: "A working counter app."
 """
     )
 
@@ -116,9 +118,10 @@ def dataset_dir_with_sample_files(tmp_path):
         """
 id: qa
 func: question_answer
-samples:
-  paths:
-    - samples/basics.yaml
+dataset:
+  samples:
+    paths:
+      - samples/basics.yaml
 """
     )
 
@@ -127,6 +130,8 @@ def dataset_dir_with_sample_files(tmp_path):
     (jobs_dir / "default.yaml").write_text(
         """
 logs_dir: ./logs
+models:
+  - test/model
 """
     )
 
@@ -247,6 +252,80 @@ def test_parse_tasks_empty_dir(self, tmp_path):
         tasks = parse_tasks(str(tmp_path))
         assert tasks == []
 
+    def test_parse_task_json_dataset(self, tmp_path):
+        """Test parsing a task with a json dataset format."""
+        task_dir = tmp_path / "tasks" / "json_ds"
+        task_dir.mkdir(parents=True)
+        (task_dir / "task.yaml").write_text(
+            """\
+id: json_ds
+func: question_answer
+dataset:
+  json: gs://bucket/data.jsonl
+  args:
+    auto_id: true
+    shuffle: true
+"""
+        )
+        tasks = parse_tasks(str(tmp_path))
+        assert len(tasks) == 1
+        assert tasks[0].dataset_format == "json"
+        assert tasks[0].dataset_source == "gs://bucket/data.jsonl"
+        assert tasks[0].dataset_args == {"auto_id": True, "shuffle": True}
+        assert tasks[0].samples == []
+
+    def test_parse_task_csv_dataset(self, tmp_path):
+        """Test parsing a task with a csv dataset format."""
+        task_dir = tmp_path / "tasks" / "csv_ds"
+        task_dir.mkdir(parents=True)
+        (task_dir / "task.yaml").write_text(
+            """\
+id: csv_ds
+func: question_answer
+dataset:
+  csv: ./data.csv
+  args:
+    delimiter: "\\t"
+"""
+        )
+        tasks = parse_tasks(str(tmp_path))
+        assert len(tasks) == 1
+        assert tasks[0].dataset_format == "csv"
+        assert tasks[0].dataset_source == "./data.csv"
+
+    def test_parse_task_mutually_exclusive_dataset_keys(self, tmp_path):
+        """Test that specifying both json and csv in dataset raises error."""
+        task_dir = tmp_path / "tasks" / "bad_ds"
+        task_dir.mkdir(parents=True)
+        (task_dir / "task.yaml").write_text(
+            """\
+id: bad_ds
+func: question_answer
+dataset:
+  json: ./data.jsonl
+  csv: ./data.csv
+"""
+        )
+        with pytest.raises(ValueError, match="exactly one"):
+            parse_tasks(str(tmp_path))
+
+    def test_parse_job_missing_models(self, tmp_path):
+        """Test that a job without models raises a validation error."""
+        jobs_dir = tmp_path / "jobs"
+        jobs_dir.mkdir()
+        (jobs_dir / "bad.yaml").write_text(
+            """\
+logs_dir: ./logs
+"""
+        )
+        job_path = str(jobs_dir / "bad.yaml")
+        with pytest.raises(ValueError, match="models"):
+            parse_job(job_path, str(tmp_path))
+
+
+# Runner integration tests for json/csv datasets are in:
+# packages/dash_evals/tests/test_json_runner.py
+
 
 # ---------------------------------------------------------------------------
 # Resolver tests
diff --git a/packages/devals_cli/lib/src/commands/create_job_command.dart b/packages/devals_cli/lib/src/commands/create_job_command.dart
index ba4dc03..f9acf7b 100644
--- a/packages/devals_cli/lib/src/commands/create_job_command.dart
+++ b/packages/devals_cli/lib/src/commands/create_job_command.dart
@@ -1,5 +1,5 @@
 import 'package:args/command_runner.dart';
-import 'package:dataset_config_dart/dataset_config_dart.dart';
+
 import 'package:devals/src/dataset/dataset_reader.dart';
 import 'package:devals/src/dataset/eval_writer.dart';
 import 'package:devals/src/dataset/file_templates/job_template.dart';
@@ -19,7 +19,14 @@ class CreateJobCommand extends Command<int> {
     terminal.writeln();
 
     // Get available options from the generated registries and filesystem
-    final models = List.of(kDefaultModels);
+    // Suggested models for model selection prompt
+    final models = <String>[
+      'google/gemini-2.5-flash',
+      'google/gemini-3-flash-preview',
+      'google/gemini-3-pro-preview',
+      'anthropic/claude-sonnet-4-5',
+      'openai/gpt-5-mini',
+    ];
     final variants = datasetReader.getVariants();
     final tasks = datasetReader.getTasks();
 
@@ -65,9 +72,14 @@ class CreateJobCommand extends Command<int> {
               'Select models',
               help: 'Tasks will run against each of these',
               options: models.map((m) => Option(label: m, value: m)).toList(),
-              key: 'models',
+              validator: (List<dynamic>? selection) {
+                if (selection == null || selection.isEmpty) {
+                  return 'You must select at least one model.';
+                }
+                return null;
+              },
               defaultValue: models
-                  .where((String name) => name.contains('gemini'))
+                  .where((name) => name.contains('gemini'))
                   .toList(),
             ),
             Multiselect(
diff --git a/packages/devals_cli/lib/src/commands/create_pipeline_command.dart b/packages/devals_cli/lib/src/commands/create_pipeline_command.dart
index 22b1e61..17942da 100644
--- a/packages/devals_cli/lib/src/commands/create_pipeline_command.dart
+++ b/packages/devals_cli/lib/src/commands/create_pipeline_command.dart
@@ -1,7 +1,7 @@
 import 'dart:io';
 
 import 'package:args/command_runner.dart';
-import 'package:dataset_config_dart/dataset_config_dart.dart';
+
 import 'package:devals/src/cli_exception.dart';
 import 'package:devals/src/dataset/eval_writer.dart';
 import 'package:devals/src/dataset/file_templates/job_template.dart';
@@ -34,7 +34,13 @@ class CreatePipelineCommand extends Command<int> {
     }
 
     final availableVariants = datasetReader.getVariants();
-    final models = List.of(kDefaultModels);
+    final models = <String>[
+      'google/gemini-2.5-flash',
+      'google/gemini-3-flash-preview',
+      'google/gemini-3-pro-preview',
+      'anthropic/claude-sonnet-4-5',
+      'openai/gpt-5-mini',
+    ];
     if (models.isEmpty) {
       throw CliException(
         'No models configured.',
@@ -212,7 +218,7 @@ class CreatePipelineCommand extends Command<int> {
               'Models',
               help:
                   'Choose which models to evaluate. You need API keys for each provider.',
-              options: models.map((m) => Option(label: m, value: m)).toList(),
+              options: models.map((m) => Option<String>(label: m, value: m)).toList(),
               defaultValue: [defaultModel],
               key: 'models',
             ),
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart
index dd6a75a..1621d35 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart
@@ -17,19 +17,20 @@ files:
   /workspace: ../../
 setup: "cd /workspace && flutter pub get"
 
-samples:
-  inline:
-    - id: get_started
-      difficulty: easy
-      tags: []
-      # Input: The prompt given to the model
-      input: |
-        Explore this codebase and suggest one improvement
-        to the code quality, readability, or architecture.
-      # Target: Expected output or grading criteria
-      target: |
-        The suggestion should be specific, actionable, and reference
-        actual code in the project. It should explain why the change
-        improves the codebase.
+dataset:
+  samples:
+    inline:
+      - id: get_started
+        difficulty: easy
+        tags: []
+        # Input: The prompt given to the model
+        input: |
+          Explore this codebase and suggest one improvement
+          to the code quality, readability, or architecture.
+        # Target: Expected output or grading criteria
+        target: |
+          The suggestion should be specific, actionable, and reference
+          actual code in the project. It should explain why the change
+          improves the codebase.
 ''';
 }
diff --git a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
index c3370c9..16e6a75 100644
--- a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
+++ b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart
@@ -30,14 +30,15 @@ String taskTemplate({
 # See docs/configuration_reference.md for full schema reference.
 func: $taskFunc
 $variantsLine$systemMessageBlock$filesSection
-samples:
-  inline:
-    - id: sample_1
-      difficulty: medium
-      input: |
-        # Write prompt here
-      target: |
-        # Write target here
+dataset:
+  samples:
+    inline:
+      - id: sample_1
+        difficulty: medium
+        input: |
+          # Write prompt here
+        target: |
+          # Write target here
 ''';
 }
 

From 3cf7fb6a3e4bc0fb4083974961838a2dcd79d450 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 20 Mar 2026 13:01:18 -0700
Subject: [PATCH 20/21] dart format

---
 packages/dataset_config_dart/lib/src/parsed_task.dart | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart
index 465d9cb..40fa790 100644
--- a/packages/dataset_config_dart/lib/src/parsed_task.dart
+++ b/packages/dataset_config_dart/lib/src/parsed_task.dart
@@ -188,9 +188,9 @@ class ParsedTask {
       displayName: displayName ?? this.displayName,
       version: version ?? this.version,
       metadata: metadata ?? this.metadata,
-      datasetFormat: this.datasetFormat,
-      datasetSource: this.datasetSource,
-      datasetArgs: this.datasetArgs,
+      datasetFormat: datasetFormat,
+      datasetSource: datasetSource,
+      datasetArgs: datasetArgs,
     );
   }
 }

From 58a21c337e4ae211b082fc03234e7212e9f51fc7 Mon Sep 17 00:00:00 2001
From: Eric Windmill <eric@ericwindmill.com>
Date: Fri, 20 Mar 2026 13:17:03 -0700
Subject: [PATCH 21/21] remove old meta doc

---
 IMPLEMENTATION_PLAN.md | 315 -----------------------------------------
 1 file changed, 315 deletions(-)
 delete mode 100644 IMPLEMENTATION_PLAN.md

diff --git a/IMPLEMENTATION_PLAN.md b/IMPLEMENTATION_PLAN.md
deleted file mode 100644
index 74441ea..0000000
--- a/IMPLEMENTATION_PLAN.md
+++ /dev/null
@@ -1,315 +0,0 @@
-# Config Improvements — Implementation Plan
-
-This document details the implementation steps for all decided config improvements. Each section includes the specific files to modify in both Dart and Python packages, what to change, and relevant context.
-
-> **Branch:** `yardstick-config-updates`
-> **Related docs:** `CHANGELOG.md`, `docs/reference/yaml_config.md`
-> **Design analysis:** The original design doc (`config_improvements.md`) has been deleted. The finalized decisions are captured in `CHANGELOG.md`.
-
----
-
-## Table of Contents
-
-1. [Model Changes](#1-model-changes)
-2. [Parser/Resolver Changes](#2-parserresolver-changes)
-3. [Tag-Based Filtering](#3-tag-based-filtering)
-4. [File Index](#4-file-index)
-5. [Verification](#5-verification)
-
----
-
-## 1. Model Changes
-
-### 1.1 Add `description` to Job
-
-Simple optional string field.
-
-**Dart** — `packages/dataset_config_dart/lib/src/models/job.dart`
-```dart
-String? description,  // Add to Job freezed class
-```
-
-**Python** — `packages/dataset_config_python/src/dataset_config_python/models/job.py`
-```python
-description: str | None = None
-```
-
-**Parser** — `packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart`
-```dart
-final description = data['description'] as String?;
-// Pass to Job constructor
-```
-
----
-
-### 1.2 Add `image_prefix` to Job
-
-Registry URL prefix prepended to image names during sandbox resolution (e.g. `us-central1-docker.pkg.dev/project/repo/`).
-
-**Dart** — `packages/dataset_config_dart/lib/src/models/job.dart`
-```dart
-String? imagePrefix,
-```
-
-**Python** — `packages/dataset_config_python/src/dataset_config_python/models/job.py`
-```python
-image_prefix: str | None = None
-```
-
-**Parser** — read `image_prefix` from YAML, pass to Job.
-
-**Resolver** — `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart`
-- In `_resolveSandbox()`, prepend `job.imagePrefix` to image names when constructing sandbox specs.
-
----
-
-### 1.3 Add `args` to JobTask
-
-Per-task argument overrides passed to the task function.
-
-**Dart** — `packages/dataset_config_dart/lib/src/models/job.dart` (on `JobTask` class)
-```dart
-@JsonKey(name: 'args') Map<String, dynamic>? args,
-```
-
-**Python** — `packages/dataset_config_python/src/dataset_config_python/models/job.py` (on `JobTask` class)
-```python
-args: dict[str, Any] | None = None
-```
-
-**Parser** — In `JobTask.fromYaml()` (both Dart and Python), read `args` from the per-task map.
-
----
-
-### 1.4 Add `system_message` to Task model
-
-Currently exists on `ParsedTask` but not the output `Task` model. Promote it.
-
-**Dart** — `packages/dataset_config_dart/lib/src/models/task.dart`
-```dart
-@JsonKey(name: 'system_message') String? systemMessage,
-```
-
-**Python** — `packages/dataset_config_python/src/dataset_config_python/models/task.py`
-```python
-system_message: str | None = None
-```
-
-**Resolver** — `eval_set_resolver.dart` already puts `system_message` into Task metadata. After this change, set it as a first-class field on the Task object instead.
-
----
-
-### 1.5 Add `sandbox_parameters` to Task
-
-Pass-through dict for sandbox plugin configuration.
-
-**Dart** — `packages/dataset_config_dart/lib/src/models/task.dart`
-```dart
-@JsonKey(name: 'sandbox_parameters') Map<String, dynamic>? sandboxParameters,
-```
-
-**Python** — `packages/dataset_config_python/src/dataset_config_python/models/task.py`
-```python
-sandbox_parameters: dict[str, Any] | None = None
-```
-
-**Parser** — read `sandbox_parameters` from task.yaml.
-
----
-
-### 1.6 Rename `task_func` → `func`
-
-The YAML parser already aliases `func` → `task_func`. This renames the model field to match.
-
-**Dart** — `packages/dataset_config_dart/lib/src/models/task.dart`
-- Rename `taskFunc` → `func`
-- Update `@JsonKey(name: 'task_func')` → `@JsonKey(name: 'func')`
-- Regenerate `.freezed.dart` / `.g.dart`
-
-**Python** — `packages/dataset_config_python/src/dataset_config_python/models/task.py`
-- Rename `task_func` → `func`
-
-**Other files to update:**
-- `packages/dataset_config_dart/lib/src/parsed_task.dart` — `taskFunc` field and `copyWith`
-- `packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart` — variable names referencing `taskFunc`
-- `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart` — `tc.taskFunc`
-- `packages/devals_cli/lib/src/dataset/dry_run.dart` — references `task_func`
-- `packages/dash_evals/src/dash_evals/runner/json_runner.py` — `task_def.get("task_func")`
-- `packages/dataset_config_python/tests/test_config.py` — Task construction with `task_func=`
-- `tool/config_parity/` — both `resolve_dart.dart` and `resolve_python.py`
-
----
-
-## 2. Parser/Resolver Changes
-
-### 2.1 Support `module:task` syntax
-
-Task function references can use `module.path:function_name` format.
-
-**Python** — `packages/dash_evals/src/dash_evals/runner/json_runner.py`
-- Update `_resolve_task_func()` to split on `:` and import the module, then get the function by attribute name.
-
-**Dart parser** — `yaml_parser.dart` L53 already reads `func` as a string. No Dart change needed — the module resolution happens in the Python runner.
-
----
-
-### 2.2 Make sandbox registry configurable
-
-The hardcoded `kSandboxRegistry` and `kSdkChannels` in `eval_set_resolver.dart` (lines 25-42) need to become data-driven.
-
-**Approach:**
-1. Move `kSandboxRegistry` and `kSdkChannels` out of the resolver
-2. Add an optional `sandbox_registry` parameter to `EvalSetResolver.resolve()`, or make it a field on the resolver
-3. The consuming project (dash_evals CLI) passes its sandbox registry when calling the resolver
-4. Default to an empty registry if none provided (no sandbox resolution)
-
-**Files:**
-- `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart` — extract constants, add parameter
-- `packages/devals_cli/` — pass the Flutter-specific registry when calling the resolver
-- Python resolver (`packages/dataset_config_python/src/dataset_config_python/resolver.py`) — mirror the same approach
-
----
-
-### 2.3 Workspace: use native Inspect fields
-
-The `workspace` YAML key stays as parser sugar but resolves into Inspect's native `Sample.files` and `Sample.setup`.
-
-**Current behavior** (`eval_set_resolver.dart` L132-141):
-```dart
-if (workspace != null && isContainer) {
-  files = {...?files, '/workspace': workspace};
-  setup = setup ?? 'cd /workspace && flutter pub get';
-  enriched['workspace'] = '/workspace';
-}
-```
-
-**Change:**
-- Make the auto-generated `setup` command configurable. Options:
-  - Add a `workspace_setup` field to Task YAML (e.g. `workspace_setup: "cd /workspace && npm install"`)
-  - Or: only auto-generate setup for tasks that have a Flutter-specific tag/metadata
-  - Or: remove auto-generation entirely; require the task author to specify `setup` if needed
-- The resolver should still map `workspace` → `Sample.files['/workspace']`, but not assume Flutter.
-
-**Files:**
-- `packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart` — update workspace → files mapping
-- `packages/dataset_config_python/src/dataset_config_python/resolver.py` — mirror
-
----
-
-## 3. Tag-Based Filtering
-
-### 3.1 New `TagFilter` model
-
-**Dart** — new file `packages/dataset_config_dart/lib/src/models/tag_filter.dart`
-```dart
-@freezed
-sealed class TagFilter with _$TagFilter {
-  const factory TagFilter({
-    @JsonKey(name: 'include_tags') List<String>? includeTags,
-    @JsonKey(name: 'exclude_tags') List<String>? excludeTags,
-  }) = _TagFilter;
-
-  factory TagFilter.fromJson(Map<String, dynamic> json) =>
-      _$TagFilterFromJson(json);
-}
-```
-
-**Python** — new file or add to `packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py`
-```python
-class TagFilter(BaseModel):
-    include_tags: list[str] | None = None
-    exclude_tags: list[str] | None = None
-```
-
-**Shared matching function** (add to both languages):
-```python
-def matches_filter(item_tags: list[str], filter: TagFilter) -> bool:
-    if filter.include_tags and not all(t in item_tags for t in filter.include_tags):
-        return False
-    if filter.exclude_tags and any(t in item_tags for t in filter.exclude_tags):
-        return False
-    return True
-```
-
-### 3.2 Add filters to Job and Task
-
-**Job model:**
-- `taskFilters: TagFilter?` / `task_filters: TagFilter | None`
-- `sampleFilters: TagFilter?` / `sample_filters: TagFilter | None`
-
-**Task YAML (parser-level, not model):**
-- `variant_filters: TagFilter?` — parsed from task.yaml, stored on `ParsedTask`
-
-### 3.3 Apply filters in resolver
-
-In `_expandTaskConfigs()` (`eval_set_resolver.dart` L418-493), add filtering steps:
-
-1. **Task filtering** (after L431): if `job.taskFilters` is set, check `taskConfig.metadata['tags']` against the filter
-2. **Sample filtering** (after L460): if `job.sampleFilters` is set, filter samples by `sample.metadata['tags']`
-3. **Variant filtering** (after L440): if `taskConfig.variantFilters` is set, check variant metadata tags
-
-These run alongside (not replacing) the existing ID-based filters.
-
----
-
-## 4. File Index
-
-All files that need modification, grouped by package:
-
-### `dataset_config_dart`
-| File | Changes |
-|---|---|
-| `lib/src/models/job.dart` | Add `description`, `imagePrefix`, `taskFilters`, `sampleFilters` |
-| `lib/src/models/job.dart` (JobTask) | Add `args` |
-| `lib/src/models/task.dart` | Rename `taskFunc` → `func`, add `systemMessage`, `sandboxParameters` |
-| `lib/src/models/tag_filter.dart` | **New file** — `TagFilter` model |
-| `lib/src/models/models.dart` | Export `tag_filter.dart` |
-| `lib/src/parsed_task.dart` | Rename `taskFunc` → `func`, add `variantFilters` |
-| `lib/src/parsers/yaml_parser.dart` | Read new fields from YAML |
-| `lib/src/resolvers/eval_set_resolver.dart` | Configurable sandbox registry, tag filtering, workspace setup |
-| `test/` | Update tests for renamed fields and new features |
-
-### `dataset_config_python`
-| File | Changes |
-|---|---|
-| `models/job.py` | Add `description`, `image_prefix`, `task_filters`, `sample_filters` |
-| `models/job.py` (JobTask) | Add `args` |
-| `models/task.py` | Rename `task_func` → `func`, add `system_message`, `sandbox_parameters` |
-| `models/tag_filter.py` | **New file** — `TagFilter` model |
-| `models/__init__.py` | Export `TagFilter` |
-| `parser.py` | Read new fields from YAML |
-| `resolver.py` | Configurable sandbox registry, tag filtering, workspace setup |
-| `tests/test_config.py` | Update tests |
-
-### `dash_evals` (Python runner)
-| File | Changes |
-|---|---|
-| `runner/json_runner.py` | `task_func` → `func`, `module:task` syntax support |
-
-### `devals_cli` (Dart CLI)
-| File | Changes |
-|---|---|
-| `lib/src/dataset/dry_run.dart` | `task_func` → `func` references |
-
-### Other
-| File | Changes |
-|---|---|
-| `tool/config_parity/` | Update both resolve scripts for renamed fields |
-| `docs/reference/yaml_config.md` | Already updated |
-| `CHANGELOG.md` | Already updated |
-| `docs/guides/config.md` | Update after implementation |
-
----
-
-## 5. Verification
-
-### Automated
-- Run `dart test` in `dataset_config_dart`
-- Run `pytest` in `dataset_config_python`
-- Run `tool/config_parity` to verify Dart/Python output parity
-- Run `dart analyze` across workspace
-
-### Manual
-- Verify `make html` in `docs/` builds without new errors
-- Verify a sample job YAML with the new fields parses correctly
-- Verify tag filtering produces expected task/sample subsets