flutter · ericwindmill · Mar 20, 2026 · Mar 12, 2026 · Mar 13, 2026 · Mar 13, 2026
diff --git a/.github/workflows/config_parity.yml b/.github/workflows/config_parity.yml
@@ -43,4 +43,4 @@ jobs:
         run: pip install -e packages/dataset_config_python
 
       - name: Verify config parity
-        run: dart run tool/config_parity/bin/config_partiy.dart
+        run: dart run tool/config_parity/bin/config_parity.dart
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,119 @@
+# Changelog
+
+## Unreleased
+
+### New
+
+- **`Job.description`.** Optional human-readable description field on Job.
+
+- **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files.
+
+- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at two levels:
+  - `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags
+  - `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags
+
+- **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks.
+
+- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level.
+
+- **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration.
+
+- **`Task.files` / `Task.setup`.** Task-level file and setup declarations. Task-level `files` stack with sample-level `files` (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`.
+
+- **Variant `task_parameters`.** Variants can now declare `task_parameters`, an arbitrary dict merged into the task config at runtime.
+
+- **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks.
+
+### Breaking Changes
+
+- **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep.
+
+- **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations.
+
+- **Removed `workspace` and `tests` from task and sample YAML.** Replaced by `files` (a `{destination: source}` map) and `setup` (a shell command string). These are Inspect AI's native `Sample` fields. The old `workspace:` / `tests:` keys and their path/git/template sub-formats are no longer supported.
+
+- **Consolidated sandbox config.** `Job.sandboxEnvironment`, `Job.sandboxParameters`, `Job.imagePrefix` collapsed into a single `Job.sandbox` map (keys: `environment`, `parameters`, `image_prefix`).
+
+- **Consolidated Inspect AI eval arguments.** Individual top-level Job fields (`retryAttempts`, `failOnError`, `logLevel`, `maxTasks`, etc.) collapsed into a single `Job.inspectEvalArguments` / `Job.inspect_eval_arguments` pass-through dict.
+
+- **`inspect_task_args` is now a pass-through dict.** Individual sub-fields (`model`, `epochs`, `time_limit`, etc.) are no longer typed on the `Task` model. The entire `inspect_task_args` section is passed through as-is to Inspect AI's `Task()` constructor.
+
+- **Removed `JobTask.systemMessage`.** System message is now set at the task level via `Task.systemMessage`.
+
+- **Variant field renames.** `context_files` → `files`, `skill_paths` → `skills`. Variant-level task restriction uses `include-variants` / `exclude-variants` on the job's `tasks.<id>` object instead of task-level `allowed_variants`.
+
+### Documentation
+
+- Added `docs/reference/yaml_config.md` with complete field-by-field reference tables.
+- Updated `docs/reference/configuration_reference.md` with new examples and directory structure.
+- Updated `docs/guides/config.md`.
+
+## 11 March, 2025
+
+### New
+
+- **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML.
+
+### Breaking Changes
+
+- **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package.
+
+- **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package.
+
+## 28 February, 2025
+
+### New
+
+- **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`.
+
+- **Dual-mode eval runner.** The Python runner now supports two invocation modes:
+  - `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI
+  - `run-evals --task <name> --model <model>` — run a single task directly from CLI arguments
+
+- **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`.
+
+- **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure.
+
+### Breaking Changes
+
+- **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically).
+
+- **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model).
+
+- **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package.
+
+- **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models.
+
+- **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`.
+
+## 25 February, 2025
+
+### Breaking Changes
+
+- **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`.
+
+  **Before (list format):**
+  ```yaml
+  variants:
+    - baseline
+    - { mcp_servers: [dart] }
+  ```
+
+  **After (named map format):**
+  ```yaml
+  # job.yaml
+  variants:
+    baseline: {}
+    mcp_only: { mcp_servers: [dart] }
+    context_only: { context_files: [./context_files/flutter.md] }
+    full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
+  ```
+
+  ```yaml
+  # task.yaml (optional — omit to accept all job variants)
+  allowed_variants: [baseline, mcp_only]
+  ```
+
+- **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants.
+
+- **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks.<id>.variants`) are replaced by task-level `allowed_variants` whitelists.
diff --git a/docs/_static/custom.css b/docs/_static/custom.css
@@ -418,3 +418,22 @@ html[data-theme="dark"] .sig > span.pre:not(:first-child) {
 html[data-theme="dark"] .sig-paren {
     color: #888888;
 }
+
+
+/* ============================================
+   COLLAPSIBLE SIDEBARS ON WIDE SCREENS
+   ============================================ */
+
+.bd-sidebar-primary {
+    padding-right: 30px;
+    width: auto !important;
+}
+
+
+.bd-sidebar-secondary  {
+    width: auto !important;
+}
+
+.bd-article-container {
+    max-width: none !important;
+}
diff --git a/docs/guides/about_the_framework.md b/docs/guides/about_the_framework.md
@@ -0,0 +1,238 @@
+# About the framework
+
+You've been using built-in task functions like `bug_fix` and `question_answer`.
+This page explains how they work — useful if you want to write custom eval logic
+or just understand what happens when you run `devals run`.
+
+---
+
+## Architecture overview
+
+When you run an eval, data flows through three layers:
+
+```
+YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI
+```
+
+| Layer | Package | What it does |
+|-------|---------|-------------|
+| **YAML config** | — | Your `task.yaml` and `job.yaml` files |
+| **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest |
+| **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` |
+| **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers |
+
+The `devals` CLI (Dart) orchestrates steps 1–2, then hands off to `run-evals`
+(Python) for steps 3–4.
+
+---
+
+## The `dash_evals` package
+
+### Entry point
+
+The Python CLI entry point is `run-evals`, defined in
+`dash_evals/main.py`. It supports two modes:
+
+```bash
+# Mode 1: From a JSON manifest (what devals uses)
+run-evals --json ./eval_set.json
+
+# Mode 2: Direct CLI arguments (what you used in Part 1)
+run-evals --task question_answer --model google/gemini-2.0-flash --dataset samples.json
+```
+
+### JSON runner
+
+When using `--json` mode, `json_runner.py` does the heavy lifting:
+
+1. Reads the manifest file
+2. For each task definition, resolves the task function by name
+3. Builds an Inspect AI `MemoryDataset` from the inline samples
+4. Calls the task function with the dataset and config
+5. Collects all `Task` objects and calls `inspect_ai.eval_set()`
+
+### Task resolution
+
+The `func` field in your `task.yaml` is resolved to a Python function. Three
+formats are supported:
+
+| Format | Example | How it resolves |
+|--------|---------|----------------|
+| **Short name** | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` |
+| **Colon syntax** | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` |
+| **Dotted path** | `my_package.tasks.my_task.my_task` | Last segment is the function name |
+
+Short names work for all built-in tasks. Use colon syntax or dotted paths for
+custom tasks in external packages.
+
+---
+
+## Anatomy of a task function
+
+Every task function follows the same pattern. Here's `question_answer` —
+the simplest built-in task:
+
+```python
+from inspect_ai import Task, task
+from inspect_ai.dataset import Dataset
+from inspect_ai.scorer import model_graded_fact
+from inspect_ai.solver import chain_of_thought
+
+@task
+def question_answer(dataset: Dataset, config: dict) -> Task:
+    system_msg = config.get("system_message") or DEFAULT_QA_SYSTEM_MESSAGE
+
+    solver_chain = [
+        add_system_message(system_msg),      # 1. Set the system prompt
+        # context_injector(...)              # 2. Inject context files (if variant has them)
+        chain_of_thought(),                   # 3. Ask for step-by-step reasoning
+        # generate() or react(tools=...)     # 4. Get the model's response
+    ]
+
+    return Task(
+        name=config["task_name"],
+        dataset=dataset,
+        solver=solver_chain,
+        scorer=model_graded_fact(),
+        time_limit=300,
+    )
+```
+
+**Key ingredients:**
+
+| Part | Purpose |
+|------|---------|
+| `@task` | Decorator that registers this function with Inspect AI |
+| `dataset` | An Inspect `Dataset` built from your samples |
+| `config` | A dict with everything from the JSON manifest — variant, system_message, sandbox_type, etc. |
+| **Solver chain** | A list of steps that process the prompt and generate a response |
+| **Scorer** | Evaluates the model's output against the `target` |
+
+### Solver chain patterns
+
+Most tasks build their solver chain from shared helpers in `task_helpers.py`:
+
+```python
+def _build_solver(config, system_msg):
+    chain = [add_system_message(system_msg)]
+
+    # Inject context files from the variant
+    append_context_injection(chain, config)
+
+    # Add chain-of-thought reasoning
+    chain.append(chain_of_thought())
+
+    # If the variant has MCP servers → use react() agent
+    # Otherwise → use plain generate()
+    append_model_interaction(chain, config)
+
+    return chain
+```
+
+This means that variants automatically affect the solver chain — if a variant
+defines `mcp_servers`, the task switches from a simple generate call to a
+full ReAct agent loop with tool access.
+
+### Agentic vs. non-agentic tasks
+
+| Pattern | Tasks that use it | What happens |
+|---------|-------------------|-------------|
+| **Non-agentic** | `question_answer`, `code_gen` | System message → chain of thought → single generate |
+| **Agentic** | `bug_fix`, `analyze_codebase`, `mcp_tool` | System message → ReAct loop with tools (bash, text editor, MCP) |
+
+Agentic tasks give the model tools (`bash_session()`, `text_editor()`, MCP servers)
+and run in a `react()` loop where the model can take multiple actions before
+calling `submit()`.
+
+---
+
+## Shared helpers
+
+The `task_helpers.py` module contains functions used across all tasks:
+
+| Helper | What it does |
+|--------|-------------|
+| `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` |
+| `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) |
+| `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured |
+| `build_task_metadata(config)` | Builds the metadata dict for the `Task` object |
+| `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config |
+| `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local |
+
+These helpers mean that most of the variant logic (context injection, MCP tools,
+skills) is handled **automatically**. You just need to define the core solver
+pattern for your task.
+
+---
+
+## Writing your own task
+
+1. **Create a file** at `packages/dash_evals/src/dash_evals/runner/tasks/your_task.py`
+
+2. **Write the task function:**
+
+   ```python
+   from inspect_ai import Task, task
+   from inspect_ai.dataset import Dataset
+   from inspect_ai.scorer import model_graded_fact
+
+   from .task_helpers import (
+       append_context_injection,
+       append_model_interaction,
+       build_task_metadata,
+   )
+   from ..solvers import add_system_message
+
+   @task
+   def your_task(dataset: Dataset, config: dict) -> Task:
+       chain = [add_system_message("You are a helpful assistant.")]
+       append_context_injection(chain, config)
+       append_model_interaction(chain, config)
+
+       return Task(
+           name=config["task_name"],
+           dataset=dataset,
+           solver=chain,
+           scorer=model_graded_fact(),
+           metadata=build_task_metadata(config),
+       )
+   ```
+
+3. **Export it** from `runner/tasks/__init__.py`:
+
+   ```python
+   from .your_task import your_task
+   ```
+
+4. **Reference it** in `task.yaml`:
+
+   ```yaml
+   func: your_task
+   ```
+
+   That's it — the JSON runner resolves the short name automatically.
+
+---
+
+## Built-in tasks
+
+| Task function | Type | What it evaluates |
+|--------------|------|-------------------|
+| `question_answer` | Non-agentic | Q&A knowledge and reasoning |
+| `code_gen` | Non-agentic | Code generation with structured output |
+| `flutter_code_gen` | Non-agentic | Flutter-specific code gen (wraps `code_gen`) |
+| `bug_fix` | Agentic | Diagnosing and fixing bugs with bash + editor |
+| `flutter_bug_fix` | Agentic | Flutter-specific bug fix (wraps `bug_fix`) |
+| `analyze_codebase` | Agentic | Exploring and answering questions about code |
+| `mcp_tool` | Agentic | Testing MCP tool usage |
+| `skill_test` | Agentic | Testing skill file usage in sandboxes |
+
+---
+
+## Further reading
+
+- {doc}`/reference/yaml_config` — complete field-by-field YAML reference
+- {doc}`/reference/configuration_reference` — directory structure and examples
+- {doc}`/reference/cli` — full CLI command reference
+- [Inspect AI documentation](https://inspect.aisi.org.uk/) — the underlying
+  evaluation framework