Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
4e75faf
feat: Introduce a dedicated `yaml_config.md` for detailed configurati…
ericwindmill Mar 12, 2026
3cce708
updates in flight
ericwindmill Mar 13, 2026
b9af6a2
rename func
ericwindmill Mar 13, 2026
28fba88
adds task level fields and updates parser
ericwindmill Mar 13, 2026
fe24d91
feat: allow configurable sandbox and SDK channel mappings in dataset …
ericwindmill Mar 13, 2026
2eb1104
feat: Introduce tag-based filtering, refined task function references…
ericwindmill Mar 14, 2026
757acb7
feat: Add variant filtering and propagate image prefix and job task a…
ericwindmill Mar 14, 2026
9c52fd1
feat: Generalize SDK channel to 'branch', consolidate sandbox configu…
ericwindmill Mar 14, 2026
254f6a1
feat: Introduce `resolve_from_parsed` for explicit configuration reso…
ericwindmill Mar 17, 2026
4ce2824
refactor: Consolidate sandbox configuration and Inspect AI eval_set a…
ericwindmill Mar 17, 2026
147319d
feat: Refactor variant configuration to use explicit include/exclude …
ericwindmill Mar 18, 2026
9d522c1
feat: Replace task and sample `workspace` and `tests` fields with `fi…
ericwindmill Mar 18, 2026
87e053e
docs: simplify `inspect_task_args` documentation by replacing a detai…
ericwindmill Mar 18, 2026
fc23421
docs: Update changelog to detail new job and task configuration optio…
ericwindmill Mar 18, 2026
be6c4c5
refactor: adjust config parsing for `mcp_servers` string shorthand, r…
ericwindmill Mar 18, 2026
506b9f4
address code review comment
ericwindmill Mar 18, 2026
f9c4273
docs: Overhaul and reorganize documentation guides, replacing quick s…
ericwindmill Mar 19, 2026
868cce3
feat: Add HTTP transport support for MCP servers, update configuratio…
ericwindmill Mar 19, 2026
03926f5
feat: Introduce flexible dataset configuration supporting inline, JSO…
ericwindmill Mar 20, 2026
3cf7fb6
dart format
ericwindmill Mar 20, 2026
58a21c3
remove old meta doc
ericwindmill Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/config_parity.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,4 @@ jobs:
run: pip install -e packages/dataset_config_python

- name: Verify config parity
run: dart run tool/config_parity/bin/config_partiy.dart
run: dart run tool/config_parity/bin/config_parity.dart
119 changes: 119 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Changelog

## Unreleased

### New

- **`Job.description`.** Optional human-readable description field on Job.

- **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files.

- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at two levels:
- `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags
- `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags

- **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks.

- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level.

- **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration.

- **`Task.files` / `Task.setup`.** Task-level file and setup declarations. Task-level `files` stack with sample-level `files` (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`.

- **Variant `task_parameters`.** Variants can now declare `task_parameters`, an arbitrary dict merged into the task config at runtime.

- **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks.

### Breaking Changes

- **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep.

- **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations.

- **Removed `workspace` and `tests` from task and sample YAML.** Replaced by `files` (a `{destination: source}` map) and `setup` (a shell command string). These are Inspect AI's native `Sample` fields. The old `workspace:` / `tests:` keys and their path/git/template sub-formats are no longer supported.

- **Consolidated sandbox config.** `Job.sandboxEnvironment`, `Job.sandboxParameters`, `Job.imagePrefix` collapsed into a single `Job.sandbox` map (keys: `environment`, `parameters`, `image_prefix`).

- **Consolidated Inspect AI eval arguments.** Individual top-level Job fields (`retryAttempts`, `failOnError`, `logLevel`, `maxTasks`, etc.) collapsed into a single `Job.inspectEvalArguments` / `Job.inspect_eval_arguments` pass-through dict.

- **`inspect_task_args` is now a pass-through dict.** Individual sub-fields (`model`, `epochs`, `time_limit`, etc.) are no longer typed on the `Task` model. The entire `inspect_task_args` section is passed through as-is to Inspect AI's `Task()` constructor.

- **Removed `JobTask.systemMessage`.** System message is now set at the task level via `Task.systemMessage`.

- **Variant field renames.** `context_files` → `files`, `skill_paths` → `skills`. Variant-level task restriction uses `include-variants` / `exclude-variants` on the job's `tasks.<id>` object instead of task-level `allowed_variants`.

### Documentation

- Added `docs/reference/yaml_config.md` with complete field-by-field reference tables.
- Updated `docs/reference/configuration_reference.md` with new examples and directory structure.
- Updated `docs/guides/config.md`.

## 11 March, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date 11 March, 2025 is mentioned in the changelog. It is extremely important to avoid commenting on dates, times, or versions mentioned in non-code areas such as copyright sections, as per the core instructions.


### New

- **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML.

### Breaking Changes

- **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package.

- **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package.

## 28 February, 2025

### New

- **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`.

- **Dual-mode eval runner.** The Python runner now supports two invocation modes:
- `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI
- `run-evals --task <name> --model <model>` — run a single task directly from CLI arguments

- **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`.

- **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure.

### Breaking Changes

- **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically).

- **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model).

- **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package.

- **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models.

- **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`.

## 25 February, 2025

### Breaking Changes

- **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`.

**Before (list format):**
```yaml
variants:
- baseline
- { mcp_servers: [dart] }
```

**After (named map format):**
```yaml
# job.yaml
variants:
baseline: {}
mcp_only: { mcp_servers: [dart] }
context_only: { context_files: [./context_files/flutter.md] }
full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
```

```yaml
# task.yaml (optional — omit to accept all job variants)
allowed_variants: [baseline, mcp_only]
```

- **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants.

- **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks.<id>.variants`) are replaced by task-level `allowed_variants` whitelists.
19 changes: 19 additions & 0 deletions docs/_static/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -418,3 +418,22 @@ html[data-theme="dark"] .sig > span.pre:not(:first-child) {
html[data-theme="dark"] .sig-paren {
color: #888888;
}


/* ============================================
COLLAPSIBLE SIDEBARS ON WIDE SCREENS
============================================ */

.bd-sidebar-primary {
padding-right: 30px;
width: auto !important;
}


.bd-sidebar-secondary {
width: auto !important;
}

.bd-article-container {
max-width: none !important;
}
238 changes: 238 additions & 0 deletions docs/guides/about_the_framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
# About the framework

You've been using built-in task functions like `bug_fix` and `question_answer`.
This page explains how they work — useful if you want to write custom eval logic
or just understand what happens when you run `devals run`.

---

## Architecture overview

When you run an eval, data flows through three layers:

```
YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI
```

| Layer | Package | What it does |
|-------|---------|-------------|
| **YAML config** | — | Your `task.yaml` and `job.yaml` files |
| **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest |
| **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` |
| **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers |

The `devals` CLI (Dart) orchestrates steps 1–2, then hands off to `run-evals`
(Python) for steps 3–4.

---

## The `dash_evals` package

### Entry point

The Python CLI entry point is `run-evals`, defined in
`dash_evals/main.py`. It supports two modes:

```bash
# Mode 1: From a JSON manifest (what devals uses)
run-evals --json ./eval_set.json

# Mode 2: Direct CLI arguments (what you used in Part 1)
run-evals --task question_answer --model google/gemini-2.0-flash --dataset samples.json
```

### JSON runner

When using `--json` mode, `json_runner.py` does the heavy lifting:

1. Reads the manifest file
2. For each task definition, resolves the task function by name
3. Builds an Inspect AI `MemoryDataset` from the inline samples
4. Calls the task function with the dataset and config
5. Collects all `Task` objects and calls `inspect_ai.eval_set()`

### Task resolution

The `func` field in your `task.yaml` is resolved to a Python function. Three
formats are supported:

| Format | Example | How it resolves |
|--------|---------|----------------|
| **Short name** | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` |
| **Colon syntax** | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` |
| **Dotted path** | `my_package.tasks.my_task.my_task` | Last segment is the function name |

Short names work for all built-in tasks. Use colon syntax or dotted paths for
custom tasks in external packages.

---

## Anatomy of a task function

Every task function follows the same pattern. Here's `question_answer` —
the simplest built-in task:

```python
from inspect_ai import Task, task
from inspect_ai.dataset import Dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought

@task
def question_answer(dataset: Dataset, config: dict) -> Task:
system_msg = config.get("system_message") or DEFAULT_QA_SYSTEM_MESSAGE

solver_chain = [
add_system_message(system_msg), # 1. Set the system prompt
# context_injector(...) # 2. Inject context files (if variant has them)
chain_of_thought(), # 3. Ask for step-by-step reasoning
# generate() or react(tools=...) # 4. Get the model's response
]

return Task(
name=config["task_name"],
dataset=dataset,
solver=solver_chain,
scorer=model_graded_fact(),
time_limit=300,
)
```

**Key ingredients:**

| Part | Purpose |
|------|---------|
| `@task` | Decorator that registers this function with Inspect AI |
| `dataset` | An Inspect `Dataset` built from your samples |
| `config` | A dict with everything from the JSON manifest — variant, system_message, sandbox_type, etc. |
| **Solver chain** | A list of steps that process the prompt and generate a response |
| **Scorer** | Evaluates the model's output against the `target` |

### Solver chain patterns

Most tasks build their solver chain from shared helpers in `task_helpers.py`:

```python
def _build_solver(config, system_msg):
chain = [add_system_message(system_msg)]

# Inject context files from the variant
append_context_injection(chain, config)

# Add chain-of-thought reasoning
chain.append(chain_of_thought())

# If the variant has MCP servers → use react() agent
# Otherwise → use plain generate()
append_model_interaction(chain, config)

return chain
```

This means that variants automatically affect the solver chain — if a variant
defines `mcp_servers`, the task switches from a simple generate call to a
full ReAct agent loop with tool access.

### Agentic vs. non-agentic tasks

| Pattern | Tasks that use it | What happens |
|---------|-------------------|-------------|
| **Non-agentic** | `question_answer`, `code_gen` | System message → chain of thought → single generate |
| **Agentic** | `bug_fix`, `analyze_codebase`, `mcp_tool` | System message → ReAct loop with tools (bash, text editor, MCP) |

Agentic tasks give the model tools (`bash_session()`, `text_editor()`, MCP servers)
and run in a `react()` loop where the model can take multiple actions before
calling `submit()`.

---

## Shared helpers

The `task_helpers.py` module contains functions used across all tasks:

| Helper | What it does |
|--------|-------------|
| `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` |
| `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) |
| `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured |
| `build_task_metadata(config)` | Builds the metadata dict for the `Task` object |
| `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config |
| `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local |

These helpers mean that most of the variant logic (context injection, MCP tools,
skills) is handled **automatically**. You just need to define the core solver
pattern for your task.

---

## Writing your own task

1. **Create a file** at `packages/dash_evals/src/dash_evals/runner/tasks/your_task.py`

2. **Write the task function:**

```python
from inspect_ai import Task, task
from inspect_ai.dataset import Dataset
from inspect_ai.scorer import model_graded_fact

from .task_helpers import (
append_context_injection,
append_model_interaction,
build_task_metadata,
)
from ..solvers import add_system_message

@task
def your_task(dataset: Dataset, config: dict) -> Task:
chain = [add_system_message("You are a helpful assistant.")]
append_context_injection(chain, config)
append_model_interaction(chain, config)

return Task(
name=config["task_name"],
dataset=dataset,
solver=chain,
scorer=model_graded_fact(),
metadata=build_task_metadata(config),
)
```

3. **Export it** from `runner/tasks/__init__.py`:

```python
from .your_task import your_task
```

4. **Reference it** in `task.yaml`:

```yaml
func: your_task
```

That's it — the JSON runner resolves the short name automatically.

---

## Built-in tasks

| Task function | Type | What it evaluates |
|--------------|------|-------------------|
| `question_answer` | Non-agentic | Q&A knowledge and reasoning |
| `code_gen` | Non-agentic | Code generation with structured output |
| `flutter_code_gen` | Non-agentic | Flutter-specific code gen (wraps `code_gen`) |
| `bug_fix` | Agentic | Diagnosing and fixing bugs with bash + editor |
| `flutter_bug_fix` | Agentic | Flutter-specific bug fix (wraps `bug_fix`) |
| `analyze_codebase` | Agentic | Exploring and answering questions about code |
| `mcp_tool` | Agentic | Testing MCP tool usage |
| `skill_test` | Agentic | Testing skill file usage in sandboxes |

---

## Further reading

- {doc}`/reference/yaml_config` — complete field-by-field YAML reference
- {doc}`/reference/configuration_reference` — directory structure and examples
- {doc}`/reference/cli` — full CLI command reference
- [Inspect AI documentation](https://inspect.aisi.org.uk/) — the underlying
evaluation framework
Loading
Loading