-
Notifications
You must be signed in to change notification settings - Fork 0
Yardstick config updates #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
4e75faf
feat: Introduce a dedicated `yaml_config.md` for detailed configurati…
ericwindmill 3cce708
updates in flight
ericwindmill b9af6a2
rename func
ericwindmill 28fba88
adds task level fields and updates parser
ericwindmill fe24d91
feat: allow configurable sandbox and SDK channel mappings in dataset …
ericwindmill 2eb1104
feat: Introduce tag-based filtering, refined task function references…
ericwindmill 757acb7
feat: Add variant filtering and propagate image prefix and job task a…
ericwindmill 9c52fd1
feat: Generalize SDK channel to 'branch', consolidate sandbox configu…
ericwindmill 254f6a1
feat: Introduce `resolve_from_parsed` for explicit configuration reso…
ericwindmill 4ce2824
refactor: Consolidate sandbox configuration and Inspect AI eval_set a…
ericwindmill 147319d
feat: Refactor variant configuration to use explicit include/exclude …
ericwindmill 9d522c1
feat: Replace task and sample `workspace` and `tests` fields with `fi…
ericwindmill 87e053e
docs: simplify `inspect_task_args` documentation by replacing a detai…
ericwindmill fc23421
docs: Update changelog to detail new job and task configuration optio…
ericwindmill be6c4c5
refactor: adjust config parsing for `mcp_servers` string shorthand, r…
ericwindmill 506b9f4
address code review comment
ericwindmill f9c4273
docs: Overhaul and reorganize documentation guides, replacing quick s…
ericwindmill 868cce3
feat: Add HTTP transport support for MCP servers, update configuratio…
ericwindmill 03926f5
feat: Introduce flexible dataset configuration supporting inline, JSO…
ericwindmill 3cf7fb6
dart format
ericwindmill 58a21c3
remove old meta doc
ericwindmill File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| # Changelog | ||
|
|
||
| ## Unreleased | ||
|
|
||
| ### New | ||
|
|
||
| - **`Job.description`.** Optional human-readable description field on Job. | ||
|
|
||
| - **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files. | ||
|
|
||
| - **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at two levels: | ||
| - `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags | ||
| - `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags | ||
|
|
||
| - **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks. | ||
ericwindmill marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level. | ||
|
|
||
| - **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration. | ||
|
|
||
| - **`Task.files` / `Task.setup`.** Task-level file and setup declarations. Task-level `files` stack with sample-level `files` (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`. | ||
|
|
||
| - **Variant `task_parameters`.** Variants can now declare `task_parameters`, an arbitrary dict merged into the task config at runtime. | ||
|
|
||
| - **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks. | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep. | ||
|
|
||
| - **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations. | ||
|
|
||
| - **Removed `workspace` and `tests` from task and sample YAML.** Replaced by `files` (a `{destination: source}` map) and `setup` (a shell command string). These are Inspect AI's native `Sample` fields. The old `workspace:` / `tests:` keys and their path/git/template sub-formats are no longer supported. | ||
|
|
||
| - **Consolidated sandbox config.** `Job.sandboxEnvironment`, `Job.sandboxParameters`, `Job.imagePrefix` collapsed into a single `Job.sandbox` map (keys: `environment`, `parameters`, `image_prefix`). | ||
|
|
||
| - **Consolidated Inspect AI eval arguments.** Individual top-level Job fields (`retryAttempts`, `failOnError`, `logLevel`, `maxTasks`, etc.) collapsed into a single `Job.inspectEvalArguments` / `Job.inspect_eval_arguments` pass-through dict. | ||
|
|
||
| - **`inspect_task_args` is now a pass-through dict.** Individual sub-fields (`model`, `epochs`, `time_limit`, etc.) are no longer typed on the `Task` model. The entire `inspect_task_args` section is passed through as-is to Inspect AI's `Task()` constructor. | ||
|
|
||
| - **Removed `JobTask.systemMessage`.** System message is now set at the task level via `Task.systemMessage`. | ||
|
|
||
| - **Variant field renames.** `context_files` → `files`, `skill_paths` → `skills`. Variant-level task restriction uses `include-variants` / `exclude-variants` on the job's `tasks.<id>` object instead of task-level `allowed_variants`. | ||
|
|
||
| ### Documentation | ||
|
|
||
| - Added `docs/reference/yaml_config.md` with complete field-by-field reference tables. | ||
| - Updated `docs/reference/configuration_reference.md` with new examples and directory structure. | ||
| - Updated `docs/guides/config.md`. | ||
|
|
||
| ## 11 March, 2025 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| ### New | ||
|
|
||
| - **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML. | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package. | ||
|
|
||
| - **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package. | ||
|
|
||
| ## 28 February, 2025 | ||
|
|
||
| ### New | ||
|
|
||
| - **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`. | ||
|
|
||
| - **Dual-mode eval runner.** The Python runner now supports two invocation modes: | ||
| - `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI | ||
| - `run-evals --task <name> --model <model>` — run a single task directly from CLI arguments | ||
|
|
||
| - **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`. | ||
|
|
||
| - **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure. | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically). | ||
|
|
||
| - **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model). | ||
|
|
||
| - **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package. | ||
|
|
||
| - **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models. | ||
|
|
||
| - **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`. | ||
|
|
||
| ## 25 February, 2025 | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`. | ||
|
|
||
| **Before (list format):** | ||
| ```yaml | ||
| variants: | ||
| - baseline | ||
| - { mcp_servers: [dart] } | ||
| ``` | ||
|
|
||
| **After (named map format):** | ||
| ```yaml | ||
| # job.yaml | ||
| variants: | ||
| baseline: {} | ||
| mcp_only: { mcp_servers: [dart] } | ||
| context_only: { context_files: [./context_files/flutter.md] } | ||
| full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] } | ||
| ``` | ||
|
|
||
| ```yaml | ||
| # task.yaml (optional — omit to accept all job variants) | ||
| allowed_variants: [baseline, mcp_only] | ||
| ``` | ||
|
|
||
| - **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants. | ||
|
|
||
| - **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks.<id>.variants`) are replaced by task-level `allowed_variants` whitelists. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,238 @@ | ||
| # About the framework | ||
|
|
||
| You've been using built-in task functions like `bug_fix` and `question_answer`. | ||
| This page explains how they work — useful if you want to write custom eval logic | ||
| or just understand what happens when you run `devals run`. | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture overview | ||
|
|
||
| When you run an eval, data flows through three layers: | ||
|
|
||
| ``` | ||
| YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI | ||
| ``` | ||
|
|
||
| | Layer | Package | What it does | | ||
| |-------|---------|-------------| | ||
| | **YAML config** | — | Your `task.yaml` and `job.yaml` files | | ||
| | **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest | | ||
| | **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` | | ||
| | **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers | | ||
|
|
||
| The `devals` CLI (Dart) orchestrates steps 1–2, then hands off to `run-evals` | ||
| (Python) for steps 3–4. | ||
|
|
||
| --- | ||
|
|
||
| ## The `dash_evals` package | ||
|
|
||
| ### Entry point | ||
|
|
||
| The Python CLI entry point is `run-evals`, defined in | ||
| `dash_evals/main.py`. It supports two modes: | ||
|
|
||
| ```bash | ||
| # Mode 1: From a JSON manifest (what devals uses) | ||
| run-evals --json ./eval_set.json | ||
|
|
||
| # Mode 2: Direct CLI arguments (what you used in Part 1) | ||
| run-evals --task question_answer --model google/gemini-2.0-flash --dataset samples.json | ||
| ``` | ||
|
|
||
| ### JSON runner | ||
|
|
||
| When using `--json` mode, `json_runner.py` does the heavy lifting: | ||
|
|
||
| 1. Reads the manifest file | ||
| 2. For each task definition, resolves the task function by name | ||
| 3. Builds an Inspect AI `MemoryDataset` from the inline samples | ||
| 4. Calls the task function with the dataset and config | ||
| 5. Collects all `Task` objects and calls `inspect_ai.eval_set()` | ||
|
|
||
| ### Task resolution | ||
|
|
||
| The `func` field in your `task.yaml` is resolved to a Python function. Three | ||
| formats are supported: | ||
|
|
||
| | Format | Example | How it resolves | | ||
| |--------|---------|----------------| | ||
| | **Short name** | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` | | ||
| | **Colon syntax** | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` | | ||
| | **Dotted path** | `my_package.tasks.my_task.my_task` | Last segment is the function name | | ||
|
|
||
| Short names work for all built-in tasks. Use colon syntax or dotted paths for | ||
| custom tasks in external packages. | ||
|
|
||
| --- | ||
|
|
||
| ## Anatomy of a task function | ||
|
|
||
| Every task function follows the same pattern. Here's `question_answer` — | ||
| the simplest built-in task: | ||
|
|
||
| ```python | ||
| from inspect_ai import Task, task | ||
| from inspect_ai.dataset import Dataset | ||
| from inspect_ai.scorer import model_graded_fact | ||
| from inspect_ai.solver import chain_of_thought | ||
|
|
||
| @task | ||
| def question_answer(dataset: Dataset, config: dict) -> Task: | ||
| system_msg = config.get("system_message") or DEFAULT_QA_SYSTEM_MESSAGE | ||
|
|
||
| solver_chain = [ | ||
| add_system_message(system_msg), # 1. Set the system prompt | ||
| # context_injector(...) # 2. Inject context files (if variant has them) | ||
| chain_of_thought(), # 3. Ask for step-by-step reasoning | ||
| # generate() or react(tools=...) # 4. Get the model's response | ||
| ] | ||
|
|
||
| return Task( | ||
| name=config["task_name"], | ||
| dataset=dataset, | ||
| solver=solver_chain, | ||
| scorer=model_graded_fact(), | ||
| time_limit=300, | ||
| ) | ||
| ``` | ||
|
|
||
| **Key ingredients:** | ||
|
|
||
| | Part | Purpose | | ||
| |------|---------| | ||
| | `@task` | Decorator that registers this function with Inspect AI | | ||
| | `dataset` | An Inspect `Dataset` built from your samples | | ||
| | `config` | A dict with everything from the JSON manifest — variant, system_message, sandbox_type, etc. | | ||
| | **Solver chain** | A list of steps that process the prompt and generate a response | | ||
| | **Scorer** | Evaluates the model's output against the `target` | | ||
|
|
||
| ### Solver chain patterns | ||
|
|
||
| Most tasks build their solver chain from shared helpers in `task_helpers.py`: | ||
|
|
||
| ```python | ||
| def _build_solver(config, system_msg): | ||
| chain = [add_system_message(system_msg)] | ||
|
|
||
| # Inject context files from the variant | ||
| append_context_injection(chain, config) | ||
|
|
||
| # Add chain-of-thought reasoning | ||
| chain.append(chain_of_thought()) | ||
|
|
||
| # If the variant has MCP servers → use react() agent | ||
| # Otherwise → use plain generate() | ||
| append_model_interaction(chain, config) | ||
|
|
||
| return chain | ||
| ``` | ||
|
|
||
| This means that variants automatically affect the solver chain — if a variant | ||
| defines `mcp_servers`, the task switches from a simple generate call to a | ||
| full ReAct agent loop with tool access. | ||
|
|
||
| ### Agentic vs. non-agentic tasks | ||
|
|
||
| | Pattern | Tasks that use it | What happens | | ||
| |---------|-------------------|-------------| | ||
| | **Non-agentic** | `question_answer`, `code_gen` | System message → chain of thought → single generate | | ||
| | **Agentic** | `bug_fix`, `analyze_codebase`, `mcp_tool` | System message → ReAct loop with tools (bash, text editor, MCP) | | ||
|
|
||
| Agentic tasks give the model tools (`bash_session()`, `text_editor()`, MCP servers) | ||
| and run in a `react()` loop where the model can take multiple actions before | ||
| calling `submit()`. | ||
|
|
||
| --- | ||
|
|
||
| ## Shared helpers | ||
|
|
||
| The `task_helpers.py` module contains functions used across all tasks: | ||
|
|
||
| | Helper | What it does | | ||
| |--------|-------------| | ||
| | `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` | | ||
| | `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) | | ||
| | `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured | | ||
| | `build_task_metadata(config)` | Builds the metadata dict for the `Task` object | | ||
| | `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config | | ||
| | `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local | | ||
|
|
||
| These helpers mean that most of the variant logic (context injection, MCP tools, | ||
| skills) is handled **automatically**. You just need to define the core solver | ||
| pattern for your task. | ||
|
|
||
| --- | ||
|
|
||
| ## Writing your own task | ||
|
|
||
| 1. **Create a file** at `packages/dash_evals/src/dash_evals/runner/tasks/your_task.py` | ||
|
|
||
| 2. **Write the task function:** | ||
|
|
||
| ```python | ||
| from inspect_ai import Task, task | ||
| from inspect_ai.dataset import Dataset | ||
| from inspect_ai.scorer import model_graded_fact | ||
|
|
||
| from .task_helpers import ( | ||
| append_context_injection, | ||
| append_model_interaction, | ||
| build_task_metadata, | ||
| ) | ||
| from ..solvers import add_system_message | ||
|
|
||
| @task | ||
| def your_task(dataset: Dataset, config: dict) -> Task: | ||
| chain = [add_system_message("You are a helpful assistant.")] | ||
| append_context_injection(chain, config) | ||
| append_model_interaction(chain, config) | ||
|
|
||
| return Task( | ||
| name=config["task_name"], | ||
| dataset=dataset, | ||
| solver=chain, | ||
| scorer=model_graded_fact(), | ||
| metadata=build_task_metadata(config), | ||
| ) | ||
| ``` | ||
|
|
||
| 3. **Export it** from `runner/tasks/__init__.py`: | ||
|
|
||
| ```python | ||
| from .your_task import your_task | ||
| ``` | ||
|
|
||
| 4. **Reference it** in `task.yaml`: | ||
|
|
||
| ```yaml | ||
| func: your_task | ||
| ``` | ||
|
|
||
| That's it — the JSON runner resolves the short name automatically. | ||
|
|
||
| --- | ||
|
|
||
| ## Built-in tasks | ||
|
|
||
| | Task function | Type | What it evaluates | | ||
| |--------------|------|-------------------| | ||
| | `question_answer` | Non-agentic | Q&A knowledge and reasoning | | ||
| | `code_gen` | Non-agentic | Code generation with structured output | | ||
| | `flutter_code_gen` | Non-agentic | Flutter-specific code gen (wraps `code_gen`) | | ||
| | `bug_fix` | Agentic | Diagnosing and fixing bugs with bash + editor | | ||
| | `flutter_bug_fix` | Agentic | Flutter-specific bug fix (wraps `bug_fix`) | | ||
| | `analyze_codebase` | Agentic | Exploring and answering questions about code | | ||
| | `mcp_tool` | Agentic | Testing MCP tool usage | | ||
| | `skill_test` | Agentic | Testing skill file usage in sandboxes | | ||
|
|
||
| --- | ||
|
|
||
| ## Further reading | ||
|
|
||
| - {doc}`/reference/yaml_config` — complete field-by-field YAML reference | ||
| - {doc}`/reference/configuration_reference` — directory structure and examples | ||
| - {doc}`/reference/cli` — full CLI command reference | ||
| - [Inspect AI documentation](https://inspect.aisi.org.uk/) — the underlying | ||
| evaluation framework |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.