diff --git a/.github/workflows/config_parity.yml b/.github/workflows/config_parity.yml index a2338af..4ef71af 100644 --- a/.github/workflows/config_parity.yml +++ b/.github/workflows/config_parity.yml @@ -43,4 +43,4 @@ jobs: run: pip install -e packages/dataset_config_python - name: Verify config parity - run: dart run tool/config_parity/bin/config_partiy.dart + run: dart run tool/config_parity/bin/config_parity.dart diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..891c523 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,119 @@ +# Changelog + +## Unreleased + +### New + +- **`Job.description`.** Optional human-readable description field on Job. + +- **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files. + +- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at two levels: + - `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags + - `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags + +- **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks. + +- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level. + +- **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration. + +- **`Task.files` / `Task.setup`.** Task-level file and setup declarations. Task-level `files` stack with sample-level `files` (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`. + +- **Variant `task_parameters`.** Variants can now declare `task_parameters`, an arbitrary dict merged into the task config at runtime. + +- **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks. + +### Breaking Changes + +- **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep. + +- **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations. + +- **Removed `workspace` and `tests` from task and sample YAML.** Replaced by `files` (a `{destination: source}` map) and `setup` (a shell command string). These are Inspect AI's native `Sample` fields. The old `workspace:` / `tests:` keys and their path/git/template sub-formats are no longer supported. + +- **Consolidated sandbox config.** `Job.sandboxEnvironment`, `Job.sandboxParameters`, `Job.imagePrefix` collapsed into a single `Job.sandbox` map (keys: `environment`, `parameters`, `image_prefix`). + +- **Consolidated Inspect AI eval arguments.** Individual top-level Job fields (`retryAttempts`, `failOnError`, `logLevel`, `maxTasks`, etc.) collapsed into a single `Job.inspectEvalArguments` / `Job.inspect_eval_arguments` pass-through dict. + +- **`inspect_task_args` is now a pass-through dict.** Individual sub-fields (`model`, `epochs`, `time_limit`, etc.) are no longer typed on the `Task` model. The entire `inspect_task_args` section is passed through as-is to Inspect AI's `Task()` constructor. + +- **Removed `JobTask.systemMessage`.** System message is now set at the task level via `Task.systemMessage`. + +- **Variant field renames.** `context_files` → `files`, `skill_paths` → `skills`. Variant-level task restriction uses `include-variants` / `exclude-variants` on the job's `tasks.` object instead of task-level `allowed_variants`. + +### Documentation + +- Added `docs/reference/yaml_config.md` with complete field-by-field reference tables. +- Updated `docs/reference/configuration_reference.md` with new examples and directory structure. +- Updated `docs/guides/config.md`. + +## 11 March, 2025 + +### New + +- **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML. + +### Breaking Changes + +- **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package. + +- **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package. + +## 28 February, 2025 + +### New + +- **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`. + +- **Dual-mode eval runner.** The Python runner now supports two invocation modes: + - `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI + - `run-evals --task --model ` — run a single task directly from CLI arguments + +- **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`. + +- **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure. + +### Breaking Changes + +- **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically). + +- **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model). + +- **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package. + +- **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models. + +- **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`. + +## 25 February, 2025 + +### Breaking Changes + +- **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`. + + **Before (list format):** + ```yaml + variants: + - baseline + - { mcp_servers: [dart] } + ``` + + **After (named map format):** + ```yaml + # job.yaml + variants: + baseline: {} + mcp_only: { mcp_servers: [dart] } + context_only: { context_files: [./context_files/flutter.md] } + full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] } + ``` + + ```yaml + # task.yaml (optional — omit to accept all job variants) + allowed_variants: [baseline, mcp_only] + ``` + +- **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants. + +- **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks..variants`) are replaced by task-level `allowed_variants` whitelists. \ No newline at end of file diff --git a/docs/_static/custom.css b/docs/_static/custom.css index 9243a6b..a0d57ff 100644 --- a/docs/_static/custom.css +++ b/docs/_static/custom.css @@ -418,3 +418,22 @@ html[data-theme="dark"] .sig > span.pre:not(:first-child) { html[data-theme="dark"] .sig-paren { color: #888888; } + + +/* ============================================ + COLLAPSIBLE SIDEBARS ON WIDE SCREENS + ============================================ */ + +.bd-sidebar-primary { + padding-right: 30px; + width: auto !important; +} + + +.bd-sidebar-secondary { + width: auto !important; +} + +.bd-article-container { + max-width: none !important; +} \ No newline at end of file diff --git a/docs/guides/about_the_framework.md b/docs/guides/about_the_framework.md new file mode 100644 index 0000000..7eed13e --- /dev/null +++ b/docs/guides/about_the_framework.md @@ -0,0 +1,238 @@ +# About the framework + +You've been using built-in task functions like `bug_fix` and `question_answer`. +This page explains how they work — useful if you want to write custom eval logic +or just understand what happens when you run `devals run`. + +--- + +## Architecture overview + +When you run an eval, data flows through three layers: + +``` +YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI +``` + +| Layer | Package | What it does | +|-------|---------|-------------| +| **YAML config** | — | Your `task.yaml` and `job.yaml` files | +| **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest | +| **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` | +| **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers | + +The `devals` CLI (Dart) orchestrates steps 1–2, then hands off to `run-evals` +(Python) for steps 3–4. + +--- + +## The `dash_evals` package + +### Entry point + +The Python CLI entry point is `run-evals`, defined in +`dash_evals/main.py`. It supports two modes: + +```bash +# Mode 1: From a JSON manifest (what devals uses) +run-evals --json ./eval_set.json + +# Mode 2: Direct CLI arguments (what you used in Part 1) +run-evals --task question_answer --model google/gemini-2.0-flash --dataset samples.json +``` + +### JSON runner + +When using `--json` mode, `json_runner.py` does the heavy lifting: + +1. Reads the manifest file +2. For each task definition, resolves the task function by name +3. Builds an Inspect AI `MemoryDataset` from the inline samples +4. Calls the task function with the dataset and config +5. Collects all `Task` objects and calls `inspect_ai.eval_set()` + +### Task resolution + +The `func` field in your `task.yaml` is resolved to a Python function. Three +formats are supported: + +| Format | Example | How it resolves | +|--------|---------|----------------| +| **Short name** | `question_answer` | Looks up `dash_evals.runner.tasks.question_answer` | +| **Colon syntax** | `my_package.tasks:my_task` | Imports `my_package.tasks`, gets `my_task` | +| **Dotted path** | `my_package.tasks.my_task.my_task` | Last segment is the function name | + +Short names work for all built-in tasks. Use colon syntax or dotted paths for +custom tasks in external packages. + +--- + +## Anatomy of a task function + +Every task function follows the same pattern. Here's `question_answer` — +the simplest built-in task: + +```python +from inspect_ai import Task, task +from inspect_ai.dataset import Dataset +from inspect_ai.scorer import model_graded_fact +from inspect_ai.solver import chain_of_thought + +@task +def question_answer(dataset: Dataset, config: dict) -> Task: + system_msg = config.get("system_message") or DEFAULT_QA_SYSTEM_MESSAGE + + solver_chain = [ + add_system_message(system_msg), # 1. Set the system prompt + # context_injector(...) # 2. Inject context files (if variant has them) + chain_of_thought(), # 3. Ask for step-by-step reasoning + # generate() or react(tools=...) # 4. Get the model's response + ] + + return Task( + name=config["task_name"], + dataset=dataset, + solver=solver_chain, + scorer=model_graded_fact(), + time_limit=300, + ) +``` + +**Key ingredients:** + +| Part | Purpose | +|------|---------| +| `@task` | Decorator that registers this function with Inspect AI | +| `dataset` | An Inspect `Dataset` built from your samples | +| `config` | A dict with everything from the JSON manifest — variant, system_message, sandbox_type, etc. | +| **Solver chain** | A list of steps that process the prompt and generate a response | +| **Scorer** | Evaluates the model's output against the `target` | + +### Solver chain patterns + +Most tasks build their solver chain from shared helpers in `task_helpers.py`: + +```python +def _build_solver(config, system_msg): + chain = [add_system_message(system_msg)] + + # Inject context files from the variant + append_context_injection(chain, config) + + # Add chain-of-thought reasoning + chain.append(chain_of_thought()) + + # If the variant has MCP servers → use react() agent + # Otherwise → use plain generate() + append_model_interaction(chain, config) + + return chain +``` + +This means that variants automatically affect the solver chain — if a variant +defines `mcp_servers`, the task switches from a simple generate call to a +full ReAct agent loop with tool access. + +### Agentic vs. non-agentic tasks + +| Pattern | Tasks that use it | What happens | +|---------|-------------------|-------------| +| **Non-agentic** | `question_answer`, `code_gen` | System message → chain of thought → single generate | +| **Agentic** | `bug_fix`, `analyze_codebase`, `mcp_tool` | System message → ReAct loop with tools (bash, text editor, MCP) | + +Agentic tasks give the model tools (`bash_session()`, `text_editor()`, MCP servers) +and run in a `react()` loop where the model can take multiple actions before +calling `submit()`. + +--- + +## Shared helpers + +The `task_helpers.py` module contains functions used across all tasks: + +| Helper | What it does | +|--------|-------------| +| `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` | +| `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) | +| `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured | +| `build_task_metadata(config)` | Builds the metadata dict for the `Task` object | +| `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config | +| `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local | + +These helpers mean that most of the variant logic (context injection, MCP tools, +skills) is handled **automatically**. You just need to define the core solver +pattern for your task. + +--- + +## Writing your own task + +1. **Create a file** at `packages/dash_evals/src/dash_evals/runner/tasks/your_task.py` + +2. **Write the task function:** + + ```python + from inspect_ai import Task, task + from inspect_ai.dataset import Dataset + from inspect_ai.scorer import model_graded_fact + + from .task_helpers import ( + append_context_injection, + append_model_interaction, + build_task_metadata, + ) + from ..solvers import add_system_message + + @task + def your_task(dataset: Dataset, config: dict) -> Task: + chain = [add_system_message("You are a helpful assistant.")] + append_context_injection(chain, config) + append_model_interaction(chain, config) + + return Task( + name=config["task_name"], + dataset=dataset, + solver=chain, + scorer=model_graded_fact(), + metadata=build_task_metadata(config), + ) + ``` + +3. **Export it** from `runner/tasks/__init__.py`: + + ```python + from .your_task import your_task + ``` + +4. **Reference it** in `task.yaml`: + + ```yaml + func: your_task + ``` + + That's it — the JSON runner resolves the short name automatically. + +--- + +## Built-in tasks + +| Task function | Type | What it evaluates | +|--------------|------|-------------------| +| `question_answer` | Non-agentic | Q&A knowledge and reasoning | +| `code_gen` | Non-agentic | Code generation with structured output | +| `flutter_code_gen` | Non-agentic | Flutter-specific code gen (wraps `code_gen`) | +| `bug_fix` | Agentic | Diagnosing and fixing bugs with bash + editor | +| `flutter_bug_fix` | Agentic | Flutter-specific bug fix (wraps `bug_fix`) | +| `analyze_codebase` | Agentic | Exploring and answering questions about code | +| `mcp_tool` | Agentic | Testing MCP tool usage | +| `skill_test` | Agentic | Testing skill file usage in sandboxes | + +--- + +## Further reading + +- {doc}`/reference/yaml_config` — complete field-by-field YAML reference +- {doc}`/reference/configuration_reference` — directory structure and examples +- {doc}`/reference/cli` — full CLI command reference +- [Inspect AI documentation](https://inspect.aisi.org.uk/) — the underlying + evaluation framework diff --git a/docs/guides/config.md b/docs/guides/config.md deleted file mode 100644 index aef6aba..0000000 --- a/docs/guides/config.md +++ /dev/null @@ -1,43 +0,0 @@ -# Config guide - -Evals uses a layered YAML configuration system. You define **what** to evaluate (tasks and samples), **how** to run it (jobs), and **where** code executes (sandboxes). The CLI resolves these files into a single manifest and hands it to the Python runner — so most of the time you're just editing YAML. - -This page walks through the main concepts and how they connect. - -## **Dataset** - -The Dataset is the collection of Tasks and Samples that are run through the python tool. A -Sample is, at a minimum, an input and target. These are essentially test cases. - -In evals, the definition of dataset is expanded to include all fixtures of running evals, and all of these definitions exist in the dataset directory of the github. - -| 🗒️ Note! The following diagrams provide a mental model. (They also provide a literal representation of how it works, but…) A lot of this is hidden from you, the user or sample author, so don’t let it overwhelm! | -| :---- | - -![A](/_static/images/evals-dataset.png) - -* **Samples** - individual eval case -* **Models** we run against -* **Variants** - Different configurations for the agent being evaluated, e.g. with Dart MCP, with or without skills, with and without rules files, and every combination of those things. -* **Tasks** - A task is a Python function entrypoint for one “type” of evals. For example, “question_answer”, “code_gen”, “mcp_create_project” are a few of the tasks we support. Each task generally takes a list of specific samples that are configured to run for that task. -* **Workspaces** (The codebase that the agent is tinkering with in an eval) -* **Sandbox definitions** (host machine, podman, docker) -* **Default runtime configurations** - -### **Tasks are the basic unit of defining eval runs.** - -![A](/_static/images/task.png) - -### **Job files are run configuration** - -![A](/_static/images/job.png) - -### **Then evals run based on that job file:** - -![A](/_static/images/eval-set.png) - -This means you care about job files and task files. Job files might look like this: - -- job/main.yaml (runs the whole thing) -- job/ci.yaml (a job that runs as part of ci) -- job/local_dev.yaml (a job that is .gitignored, used for quick iteration) diff --git a/docs/guides/configuring_jobs.md b/docs/guides/configuring_jobs.md new file mode 100644 index 0000000..c3def62 --- /dev/null +++ b/docs/guides/configuring_jobs.md @@ -0,0 +1,340 @@ +# Configure jobs + +In {doc}`Part 1 ` and {doc}`Part 2 ` you +wrote tasks and jobs by following a recipe. Now let's understand the full +configuration model so you can build your own from scratch. + +This page walks through every piece of the YAML configuration — building +each file up incrementally. + +--- + +## The three config files + +Everything lives under your `evals/` directory: + +| File | Purpose | +|------|---------| +| `tasks//task.yaml` | Defines **what** to evaluate — the task function, samples, workspace, prompt | +| `jobs/.yaml` | Defines **how** to run it — models, variants, filters, sandbox, limits | +| Context files (optional) | Markdown files injected into the model's prompt via variants | + +The `devals` CLI resolves these into a single JSON manifest and hands it to the +Python runner. Most of the time, you're just editing YAML. + +--- + +## Building a task.yaml + +Let's build a task file from scratch, adding one concept at a time. + +### Start minimal + +The only required field is the task function: + +```yaml +func: question_answer +``` + +This is enough to define a task, but it has no samples — nothing to evaluate yet. + +### Add a sample + +Samples go under `dataset.samples.inline`. Each sample needs at minimum an `id`, +`input` (the prompt), and `target` (grading criteria): + +```yaml +func: question_answer + +dataset: + samples: + inline: + - id: explain_null_safety + input: | + Explain Dart's sound null safety. How does it prevent + null reference errors at compile time? + target: | + Should explain nullable vs non-nullable types, the `?` + suffix, null-aware operators, and how the analyzer enforces + null checks at compile time. +``` + +### Add a system message + +A `system_message` customizes the prompt sent to the model before your sample input: + +```yaml +func: question_answer +system_message: | + You are an expert Dart developer. Answer questions with code + examples where appropriate. Be concise. + +dataset: + samples: + inline: + - id: explain_null_safety + # ... +``` + +### Add files and setup + +For agentic tasks that run code in a sandbox, use `files` and `setup`: + +```yaml +func: bug_fix + +# Copy a project into the sandbox — key = destination, value = source +files: + /workspace: ../../workspaces/my_dart_package +setup: "cd /workspace && dart pub get" + +dataset: + samples: + inline: + - id: fix_the_bug + input: | + The tests are failing. Find and fix the bug. + target: | + All tests should pass after the fix. +``` + +`files` and `setup` at the task level are **inherited by all samples**. A sample +can override them: + +```yaml +dataset: + samples: + inline: + - id: fix_the_bug + files: + /workspace: ./custom_project # overrides task-level files + setup: "cd /workspace && pub get" # overrides task-level setup + input: ... +``` + +> [!NOTE] +> File paths in `files` values are resolved **relative to the task directory**. +> Task-level `files` stack with sample-level `files` — on a key conflict, the +> sample wins. + +### Add metadata for filtering + +Samples can carry `metadata` with `tags` and `difficulty`. Jobs use these for filtering: + +```yaml +dataset: + samples: + inline: + - id: fix_the_bug + metadata: + difficulty: medium + tags: [dart, bug-fix, async] + input: ... + target: ... +``` + +### Use external sample files + +For large datasets, you can keep samples in separate files and reference them +with glob patterns: + +```yaml +func: question_answer + +dataset: + samples: + paths: + - samples/*.yaml # loads every .yaml in the samples/ subdirectory +``` + +Each external file contains a list of sample objects in the same format as +`dataset.samples.inline`. + +--- + +## Building a job.yaml + +Jobs control **how** tasks run. Let's build one up. + +### Start with models and tasks + +The bare minimum — which models and which tasks: + +```yaml +models: + - google/gemini-2.5-flash + +tasks: + inline: + explain_null_safety: {} # run all samples with default settings +``` + +### Add variants + +Variants let you test the same task under different conditions. Each variant is a named +map — an empty map `{}` means "no extras" (the baseline): + +```yaml +models: + - google/gemini-2.5-flash + +variants: + baseline: {} + context_only: + files: [./context_files/dart_docs.md] + mcp_only: + mcp_servers: + - name: dart + command: dart + args: [mcp-server] + full: + files: [./context_files/dart_docs.md] + mcp_servers: + - name: dart + command: dart + args: [mcp-server] + +tasks: + inline: + explain_null_safety: {} +``` + +This produces 4 runs per sample (one per variant) × however many models you list. + +**Variant sub-fields:** + +| Field | What it does | +|-------|-------------| +| `files` | Context files injected into the prompt | +| `mcp_servers` | MCP tool servers the model can call (stdio, HTTP, or Python ref) | +| `skills` | Skill directories copied into the sandbox | +| `task_parameters` | Extra parameters merged into the task config at runtime | + +### Filter tasks and samples + +Use `task_filters` and `sample_filters` to select subsets by tag: + +```yaml +task_filters: + include_tags: [dart] # only tasks tagged "dart" + exclude_tags: [deprecated] # skip deprecated tasks + +sample_filters: + include_tags: [bug-fix] # only samples tagged "bug-fix" +``` + +- **`include_tags`** — an item must have *all* listed tags to be included +- **`exclude_tags`** — an item is excluded if it has *any* listed tag + +You can also filter per-task using `include-samples` and `exclude-samples`: + +```yaml +tasks: + inline: + fix_math_utils: + include-samples: [fix_factorial] # only run this sample + include-variants: [baseline] # only run this variant +``` + +### Add sandbox configuration + +For tasks that need container execution: + +```yaml +sandbox: podman # or "docker" +``` + +You can also pass additional sandbox parameters: + +```yaml +sandbox: + environment: podman + image_prefix: us-central1-docker.pkg.dev/my-project/repo/ +``` + +### Add Inspect AI parameters + +The `inspect_eval_arguments` section passes settings through to Inspect AI's +`eval_set()`: + +```yaml +inspect_eval_arguments: + retry_attempts: 20 + fail_on_error: 0.05 + log_level: info + + # Defaults applied to every task in this job + task_defaults: + time_limit: 600 + message_limit: 50 +``` + +--- + +## Putting it all together + +Here's a complete job file using everything above: + +```{code-block} yaml +--- +caption: evals/jobs/full_example.yaml +--- +models: + - google/gemini-2.5-flash + - anthropic/claude-sonnet-4-20250514 + +sandbox: podman +max_connections: 15 + +variants: + baseline: {} + context_only: + files: [./context_files/dart_docs.md] + with_mcp: + mcp_servers: + - name: dart + command: dart + args: [mcp-server] + +task_filters: + include_tags: [dart] + +tasks: + inline: + fix_math_utils: + exclude-variants: [with_mcp] # MCP not relevant for this task + dart_question_answer: {} + +inspect_eval_arguments: + retry_attempts: 10 + task_defaults: + time_limit: 300 + message_limit: 30 +``` + +This will run: + +- 2 models × 2 applicable variants × all matching samples in `fix_math_utils` +- 2 models × 3 variants × all matching samples in `dart_question_answer` + +--- + +## Summary + +| Concept | Where it lives | What it controls | +|---------|---------------|-----------------| +| **Task** | `tasks//task.yaml` | What to evaluate: function, prompt, workspace, samples | +| **Job** | `jobs/.yaml` | How to run: models, variants, filters, sandbox, limits | +| **Variant** | Inside job YAML | Different configurations for the agent being evaluated | +| **Sample** | Inside task YAML (or external files) | Individual test cases with input/target pairs | +| **Context file** | Referenced by variants | Extra information injected into the model's prompt | + +For the complete field-by-field reference, see {doc}`/reference/yaml_config`. + +--- + +## Next steps + +Now that you understand the configuration model, {doc}`Part 4 ` +shows how the `devals` CLI can **generate** most of this config for you — and +what you need to customize in the output. diff --git a/docs/guides/get_started.md b/docs/guides/get_started.md new file mode 100644 index 0000000..5250a96 --- /dev/null +++ b/docs/guides/get_started.md @@ -0,0 +1,226 @@ +# Install and run evals + +By the end of this page you'll have installed everything, +run an evaluation with Python and Inspect AI, and (optionally) seen how the +`devals` CLI wraps all of that into a single workflow. + +## Prerequisites + +| Tool | Details | Notes | +|------|---------|---------| +| [Dart SDK*](https://dart.dev/get-dart) | Ver. 3.10+ | Runs the `devals` CLI | +| [Python](https://www.python.org/) | Ver. 3.13+ | Runs the `dash_evals` evaluation runner | +| API keys | `GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY` | Requires at least one model provider key. | + +\*Dart isn't required. It powers the CLI, which assists in authoring various YAML files +and hides some complexity of the framework. However, the framework is entirely useable +without the CLI. + +--- + +## 1. Set up your project + +Create a directory for your evals work and set up a Python virtual environment: + +```bash +mkdir my-evals && cd my-evals +python3 -m venv .venv +source .venv/bin/activate +``` + +### Install `dash_evals` (Python — required) + +Install the evaluation runner and its config library from git: + +```bash +pip install "dash-evals @ git+https://github.com/flutter/evals.git#subdirectory=packages/dash_evals" +pip install "dataset-config-python @ git+https://github.com/flutter/evals.git#subdirectory=packages/dataset_config_python" +``` + +This gives you **`dash_evals`** — the runtime that drives +[Inspect AI](https://inspect.aisi.org.uk/) to run evaluations. Its CLI entry +point is `run-evals`. + +### Install `devals` CLI (Dart — optional) + +If you have the Dart SDK installed, you can also install the CLI, which automates some of the configuration and eval authoring. + +```bash +dart pub global activate devals --source git https://github.com/flutter/evals.git --git-path packages/devals_cli +``` + +**`devals`** resolves YAML configuration, scaffolds new tasks and jobs, and +wraps `run-evals` and `inspect view` into a single workflow. It reduces the +learning curve but is entirely optional — everything it does can be done with +vanilla Python and Inspect AI commands. + +## 2. Configure an API key + +```bash +export GEMINI_API_KEY=your_key_here +``` + +Set at least one provider key. You can also add it to a `.env` file in your +project directory — `dash_evals` loads it automatically. + +--- + +## 3. Create a minimal dataset + +The basic unit of evals is the [*sample*](). A sample is a single 'test case', which includes, at a minimum, an prompt (called 'input') and a target, which is used to grade the evaluation. + +Create a file called `my_first_sample.json` in your project directory: + +```{code-block} json +--- +caption: my-evals/my_first_sample.json +--- +[ + { + "input": "Explain the difference between `Future.then()` and `async/await` in Dart. When should you prefer one over the other?", + "target": "The answer should explain that both are mechanisms for handling asynchronous code; async/await is syntactic sugar over Futures. It should note that async/await is generally preferred for readability, while .then() can be useful for simple one-off transformations." + } +] +``` + +### 3.2 Run it + +```bash +run-evals \ + --task question_answer \ + --model google/gemini-2.5-flash \ + --dataset ./my_first_sample.json +``` + +This runs that one sample. Here's what just happened: + +1. `run-evals` loaded the `question_answer` [*task*]() function from `dash_evals`. A task + is a Python function that desribes the logic required to run a sample. Tasks are the generic, + reusable logic that know how to run your bespoke samples. Some other task examples are + [`generate_code`][] and [`bug_fix`][]. +2. Your dataset, or collection of samples, was passed to the task, which executes its *solver chain*, the + instructions given to the agent being evaluated. +3. Inspect AI drives the agent, collects the response, and scores it with a *scorer*. Scorers vary by task. + In this case, the scorer is [`model_graded_fact`][], a scorer provided by Inspect that asks a second agent + to compare the generated response to our target response. +4. Finally, A log file was written to `./logs/` + +### 3.3 View the results + +Inspect AI ships with a robust log viewer. Launch it: + +```bash +inspect view +``` + +This opens a local web UI where you can browse the run, see the full +conversation transcript, and check how the response was scored. + +> [!TIP] +> `inspect view` finds logs in the current directory by default. Pass a path +> to point it elsewhere: `inspect view ./path/to/logs`. + +--- + +## 4. That's what `devals` wraps + +The commands above — `run-evals`, `inspect view` — are the raw building blocks. +The **`devals` CLI** wraps all of them, helps manage your runtime environment, +and manages the YAML configuration layer we've put on top of Inspect AI, +which replaces the Samples JSON and *many* configuration +options that are otherwise be passed in as CLI flags. All of the quality-of-life +improvements provided by the CLI are described in the [Using the CLI guide][]. + +Importantly, you can still use the Yaml configuration layer without Dart and the CLI, +it's just less automated and requires you writing a bit of python glue code. + +Let's try the `devals` workflow now. + +As a reminder, the install script is: + +```bash +dart pub global activate devals --source git https://github.com/flutter/evals.git --git-path packages/devals_cli +``` + +### 4.1 Check your environment + +```bash +devals doctor +``` + +This verifies Dart, Python, `dash_evals`, API keys, and optional tools like +Podman and Flutter. Fix any errors it reports; warnings are safe to ignore for now. + +### 4.2 Initialize a dataset + +Run `devals init` from your project directory (the `my-evals` directory you +created in step 1): + +```bash +devals init +``` + +This creates: + +``` +my-evals/ +├── devals.yaml # marker file +└── evals/ + ├── tasks/ + │ └── get_started/ + │ └── task.yaml # starter task + sample + └── jobs/ + └── local_dev.yaml # ready-to-run job +``` + +The starter task uses the `analyze_codebase` task function — it asks the model +to explore your project and suggest an improvement. It's a good smoke test that +doesn't require a sandbox. + +### 4.3 Run the eval + +```bash +devals run local_dev +``` + +Behind the scenes, this: + +1. Resolves your YAML config (job + tasks + samples) into a JSON manifest +2. Passes the manifest to `run-evals` (the Python `dash_evals` runner) +3. `dash_evals` calls Inspect AI's `eval_set()`, which sends prompts, scores results, + and writes logs + +To preview the resolved config without making API calls: + +```bash +devals run local_dev --dry-run +``` + +### 4.4 View results + +```bash +devals view +``` + +This is the same Inspect AI log viewer from before, but `devals` automatically +finds your `logs/` directory based on `devals.yaml`. + +--- + +## Recap + +You've now seen the two layers of the system: + +| Layer | What it does | +|-------|-------------| +| **`dash_evals` + Inspect AI** | The engine. Runs tasks, sends prompts, scores responses. | +| **`devals` CLI** | The convenience layer. YAML config, scaffolding, log discovery. | + + +--- + +## Next steps + +You're set up and you've seen the framework in action. In +{doc}`Part 2 `, you'll author a more complex, agentic +evaluation from scratch. diff --git a/docs/guides/index.md b/docs/guides/index.md index 73e04a8..13e1f66 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -1,11 +1,13 @@ # Guides -Get started with evals — learn how to author and run your own evaluations. +Learn how to install and use the evals framework. ```{toctree} :maxdepth: 1 -quick_start -tutorial -config +get_started +write_your_first_eval +configuring_jobs +using_the_cli +about_the_framework ``` diff --git a/docs/guides/quick_start.md b/docs/guides/quick_start.md deleted file mode 100644 index ed93d0e..0000000 --- a/docs/guides/quick_start.md +++ /dev/null @@ -1,136 +0,0 @@ -# Get started - -A guide to using evals as a framework for the local development of your own evals. - -## Prerequisites - -| Tool | Version | Purpose | -|------|---------|---------| -| [Dart SDK](https://dart.dev/get-dart) | 3.10+ | Runs the `devals` CLI | -| [Python](https://www.python.org/) | 3.13+ | Runs the `dash_evals` runner | - -You'll also need an API key for at least one model provider (`GOOGLE_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY`). - -## 1. Install the packages - -```bash -git clone https://github.com/flutter/evals.git && cd evals -python3 -m venv .venv -source .venv/bin/activate -pip install -e "packages/dash_evals[dev]" -pip install -e "packages/dataset_config_python[dev]" -dart pub global activate devals --source path packages/devals_cli -``` - -This installs two things: - -- **`devals`** (Dart) — the CLI you'll use for every command. It resolves YAML configuration into a JSON manifest and delegates execution. -- **`dash_evals`** (Python) — the runtime that receives the manifest and drives [Inspect AI](https://inspect.aisi.org.uk/)'s `eval_set()` to actually run evaluations. - -## 2. Check your environment - -```bash -devals doctor -``` - -This runs a series of prerequisite checks — Dart SDK, Python version, whether `dash_evals` is installed, API keys, and optional tools like Podman and Flutter. Fix any errors it reports before continuing; warnings are safe to ignore for now. - -## 3. Set up Podman (optional) - -If your evals use containerized execution (`sandbox_type: podman` in a job YAML), you need Podman installed and a container image built. You can skip this step for basic evals that run locally. - -**Install Podman** (macOS): - -```bash -brew install podman -podman machine init -podman machine start -``` - -**Build the Flutter sandbox image:** - -```bash -cd /examples/evals-dataset/evals/sandboxes/podman -podman build -t flutter-sandbox:latest . -``` - -This builds `localhost/flutter-sandbox:latest`, which includes Ubuntu 24.04 and the Flutter SDK. The build takes a few minutes. - -> **Tip:** To target a different Flutter channel, pass `--build-arg FLUTTER_CHANNEL=beta` (or `main`). - -## 4. Configure API keys - -Make sure you have at least one model provider API key set as an environment variable. You can set them in your shell profile or in a `.env` file in your project root. - -```bash -export GEMINI_API_KEY=your_key_here -``` - -## 5. Initialize your dataset - -Run `devals init` from the root of the project you want to evaluate. This is typically a Dart or Flutter project — the scaffolded starter task will point back at your project as its workspace. - -```bash -cd ~/my-flutter-app -devals init -``` - -This creates two things: - -- **`devals.yaml`** in your project root — a marker file that tells the CLI where your eval dataset lives (defaults to `./evals`). -- **`evals/`** directory with the following structure: - -``` -my-flutter-app/ -├── devals.yaml # ← marker file -└── evals/ - ├── tasks/ - │ └── get_started/ - │ └── task.yaml # starter task + sample - └── jobs/ - └── local_dev.yaml # job ready to run -``` - -The starter task uses the `analyze_codebase` task function, which asks the model to -explore your project and suggest an improvement. It's a good smoke-test that -doesn't require a sandbox or any extra setup. - - -## 6. Run your first eval - -```bash -devals run local_dev -``` - -Behind the scenes, this: - -1. Resolves your YAML config (job + tasks + samples) into an EvalSet JSON manifest -2. Passes the manifest to the Python `dash_evals` runner -3. `dash_evals` calls Inspect AI's `eval_set()`, which sends prompts, collects responses, and scores results -4. Logs are written to a `logs/` directory (a sibling of `evals/`) - -To preview the resolved configuration without actually making API calls: - -```bash -devals run local_dev --dry-run -``` - -This prints every task × model × variant combination that would execute, so you can verify your setup before spending API credits. - -## 7. View results - -```bash -devals view -``` - -This launches the [Inspect AI log viewer](https://inspect.aisi.org.uk/log-viewer.html) — a local web UI where you can browse runs, inspect individual samples, view scores, and read full conversation transcripts. It automatically finds your `logs/` directory based on `devals.yaml`. - ---- - -## Next steps - -- **Add more samples** — `devals create sample` -- **Add tasks** — `devals create task` -- **Create targeted jobs** — `devals create job` -- **Interactive walkthrough** — `devals create pipeline` guides you through creating a sample, task, and job in one go -- **[Follow the tutorial](tutorial.md)** — a hands-on walkthrough of authoring a code-generation task from scratch diff --git a/docs/guides/tutorial.md b/docs/guides/tutorial.md deleted file mode 100644 index fcf8b19..0000000 --- a/docs/guides/tutorial.md +++ /dev/null @@ -1,287 +0,0 @@ -# Author evals - -This tutorial picks up where [Get Started](quick_start.md) left off. -By the end, you'll have: - -1. Authored a task file with two **code-generation** samples -2. Created a job file that targets your new task -3. Run the job and watched Inspect AI execute it -4. Opened the Inspect log viewer to review results - -> [!NOTE] -> This guide assumes you've already completed the [Get Started](quick_start.md) guide and -> have a working `devals` installation with at least one model API key configured. - ---- - -## 1. Create the task - -A **task** tells the framework *what* to evaluate. Each task lives in its own subdirectory -under `evals/tasks/` and contains a `task.yaml` file. - -### 1.1 Set up a workspace - -Code-generation tasks need a **workspace** — a starter project the model writes code into -and where tests run. Create a minimal Dart package to use as a template: - -``` -evals/ -└── workspaces/ - └── dart_package/ - ├── pubspec.yaml - └── lib/ - └── main.dart -``` - -```{code-block} yaml ---- -caption: evals/workspaces/dart_package/pubspec.yaml ---- -name: dart_package_template -description: Minimal Dart package template -version: 1.0.0 -publish_to: none - -environment: - sdk: '>=3.0.0 <4.0.0' - -dev_dependencies: - test: ^1.24.0 -``` - -```{code-block} dart ---- -caption: evals/workspaces/dart_package/lib/main.dart ---- -// Starter file — the model will overwrite this. -``` - -> [!TIP] -> You can also point `workspace` at your existing project root, a Flutter app, -> or any directory that already has a `pubspec.yaml`. - -### 1.2 Write a test file - -Each sample can have its own test file that the scorer runs automatically. Create a -test for the first sample: - -``` -evals/ -└── tasks/ - └── dart_code_gen/ - ├── task.yaml ← (you'll create this next) - └── tests/ - └── fizzbuzz_test.dart -``` - -```{code-block} dart ---- -caption: evals/tasks/dart_code_gen/tests/fizzbuzz_test.dart ---- -import 'package:test/test.dart'; -import 'package:dart_package_template/main.dart'; - -void main() { - test('fizzBuzz returns correct values', () { - expect(fizzBuzz(3), 'Fizz'); - expect(fizzBuzz(5), 'Buzz'); - expect(fizzBuzz(15), 'FizzBuzz'); - expect(fizzBuzz(7), '7'); - }); - - test('fizzBuzz handles 1', () { - expect(fizzBuzz(1), '1'); - }); -} -``` - -### 1.3 Write the task.yaml - -Now create the task definition with two inline samples: - -```{code-block} yaml ---- -caption: evals/tasks/dart_code_gen/task.yaml ---- -# ============================================================ -# Task: Dart Code Generation -# ============================================================ -# Uses the built-in `code_gen` task function which: -# 1. Sends the prompt to the model -# 2. Parses the structured code response -# 3. Writes the code into the sandbox workspace -# 4. Runs tests and scores the result - -func: code_gen -workspace: ../../workspaces/dart_package - -samples: - inline: - # ── Sample 1: FizzBuzz ────────────────────────────────── - - id: fizzbuzz - difficulty: easy - tags: [dart, functions] - input: | - Write a top-level function called `fizzBuzz` that takes an - integer `n` and returns a String: - - "Fizz" if n is divisible by 3 - - "Buzz" if n is divisible by 5 - - "FizzBuzz" if divisible by both - - The number as a string otherwise - - Write the complete lib/main.dart file. - target: | - The code must define a top-level `String fizzBuzz(int n)` function - that returns the correct value for all cases. - It must pass the tests in test/. - tests: - path: ./tests/fizzbuzz_test.dart - - # ── Sample 2: Stack implementation ────────────────────── - - id: stack_class - difficulty: medium - tags: [dart, data-structures, classes] - input: | - Implement a generic Stack class in Dart with the - following methods: - - push(T item) — adds an item to the top - - T pop() — removes and returns the top item, - throws StateError if empty - - T peek() — returns the top item without removing it, - throws StateError if empty - - bool get isEmpty - - int get length - - Write the complete lib/main.dart file. - target: | - The code must define a generic Stack class with push, - pop, peek, isEmpty, and length. pop and peek must throw - StateError when the stack is empty. -``` - -**Key fields explained:** - -| Field | What it does | -|-------|-------------| -| `func` | The Python `@task` function that runs the evaluation. `code_gen` is a built-in generic code-generation task. | -| `workspace` | Path to the starter project (relative to the task directory). | -| `samples.inline` | A list of test cases, each with an `input` prompt and a `target` grading criteria. | -| `tests.path` | Path to test files the scorer runs against the generated code. | - -> [!NOTE] -See [Tasks](../reference/configuration_reference.md#task-files) and [Samples](../reference/configuration_reference.md#sample-files) for the -> complete field reference. - ---- - -## 2. Create the job - -A **job** controls *how* to run your tasks — which models to use, how many -connections, and which tasks/variants to include. - -Create `evals/jobs/tutorial.yaml`: - -```{code-block} yaml ---- -caption: evals/jobs/tutorial.yaml ---- -# ============================================================ -# Job: tutorial -# ============================================================ -# A focused job for the tutorial walkthrough. - -# Which model(s) to evaluate -models: - - google/gemini-2.5-flash - -# Only run the code-gen task we just created -tasks: - inline: - dart_code_gen: {} -``` - -That's the minimal job — it will: - -- Evaluate `google/gemini-2.5-flash` -- Run every sample in the `dart_code_gen` task -- Use the default `baseline` variant (no extra tools or context) - -> [!TIP] -> You can add **variants** to test the model with additional context or tools. -> For example: -> ```yaml -> variants: -> baseline: {} -> with_context: -> context_files: [./context_files/dart_docs.md] -> ``` -> See [Configuration Overview](../reference/configuration_reference.md#variants) for details. - ---- - -## 3. Run the job - -Make sure you're in your project directory (the one containing `devals.yaml`), then run: - -```bash -devals run tutorial -``` - -What happens behind the scenes: - -1. The Dart `dataset_config_dart` package resolves your YAML into an EvalSet JSON manifest -2. The Python `dash_evals` reads the manifest and calls Inspect AI's `eval_set()` -3. Inspect AI creates a sandbox, sets up the workspace, sends prompts, runs tests, and scores results -4. Logs are written to the `logs/` directory - -### Dry run first - -To preview the resolved configuration without making any API calls: - -```bash -devals run tutorial --dry-run -``` - -This prints a summary of every task × model × variant combination that would -execute, so you can verify everything looks right before spending API credits. - -### What to expect - -When the eval runs, you'll see Inspect AI's interactive terminal display showing -progress for each sample. A typical run with two samples against one model takes -1–3 minutes, depending on the model's response time. - ---- - -## 4. View the results - -After the run completes, launch the Inspect AI log viewer: - -```bash -devals view -``` - -This opens a local web UI (powered by Inspect AI) where you can: - -- **Browse runs** — see each task × model × variant combination -- **Inspect samples** — view the model's generated code, scores, and any test output -- **Compare variants** — if you defined multiple variants, compare how they performed side-by-side - -The viewer automatically points at your `logs/` directory. To view logs from a -specific directory: - -```bash -devals view path/to/logs -``` - ---- - -## Next steps - -Now that you've run your first custom evaluation, here are some things to try: - -- **Add more samples** to your task: `devals create sample` -- **Try different task types** — `question_answer`, `bug_fix`, or `flutter_code_gen`. See [all available task functions](../contributing/packages/dash_evals.md). -- **Add variants** to test how context files or MCP tools affect performance. See [Variants](config/about.md#variants). -- **Run multiple models** by adding more entries to the `models` list in your job file -- **Read the config reference** for [Jobs](../reference/configuration_reference.md#job-files), [Tasks](../reference/configuration_reference.md#task-files), and [Samples](../reference/configuration_reference.md#sample-files) \ No newline at end of file diff --git a/docs/guides/using_the_cli.md b/docs/guides/using_the_cli.md new file mode 100644 index 0000000..b105d91 --- /dev/null +++ b/docs/guides/using_the_cli.md @@ -0,0 +1,216 @@ +# Use the CLI + +You've written tasks and jobs by hand. The `devals` CLI can generate most of +that configuration for you — this page shows how, and what you'll want to +customize afterward. + +--- + +## Scaffolding commands + +### `devals init` + +Initializes a fresh project for evals: + +```bash +cd ~/my-project +devals init +``` + +**What it creates:** + +``` +my-project/ +├── devals.yaml # marker file +└── evals/ + ├── tasks/ + │ └── get_started/ + │ └── task.yaml # starter task + └── jobs/ + └── local_dev.yaml # ready-to-run job +``` + +**What to customize:** + +- The starter task uses `func: analyze_codebase` — fine for a smoke test, but + you'll want to change `func` to match your eval type (`question_answer`, + `bug_fix`, `code_gen`, etc.) +- The job defaults to `google/gemini-2.0-flash`. Update `models:` to the + provider(s) you want to test. +- `files` points at `../../` (your project root). Update if your workspace + lives elsewhere. + +### `devals create pipeline` + +An interactive walkthrough that creates a sample, task, and job in one go. +Great for first-timers: + +```bash +devals create pipeline +``` + +It prompts you for: +1. A sample ID and prompt +2. Which task function to use +3. A job name and model selection + +The result is a fully wired-up set of YAML files ready to `devals run`. + +### `devals create task` + +Creates a new task directory with a starter `task.yaml`: + +```bash +devals create task +``` + +**Prompts for:** +- Task ID (becomes the directory name under `tasks/`) +- Task function (selected from the Python registry) +- Optional system message + +**What to customize after:** +- Add your `samples` — the generated file is a skeleton +- Add `files` and `setup` if your task needs a workspace +- Add `metadata` with tags for filtering + +### `devals create sample` + +Adds a new sample interactively: + +```bash +devals create sample +``` + +**Prompts for:** +- Sample ID (snake_case) +- Difficulty level +- Whether a workspace is needed + +**What to customize after:** +- Write a specific `input` prompt — the generated placeholder is generic +- Write grading criteria in `target` +- Add `metadata.tags` for filtering + +### `devals create job` + +Creates a new job YAML file: + +```bash +devals create job +``` + +**Prompts for:** +- Job name +- Which models, variants, and tasks to include + +**What to customize after:** +- Add or refine `variants` — the generated file may only include `baseline: {}` +- Add `task_filters` or `sample_filters` if you want to target a subset +- Configure `inspect_eval_arguments` for retry, timeout, and limit settings + +--- + +## Running evals + +### Basic run + +```bash +devals run +``` + +The CLI: +1. Reads `devals.yaml` to find the `evals/` directory +2. Resolves your YAML config into a JSON manifest +3. Passes the manifest to `run-evals` (the Python `dash_evals` runner) +4. `dash_evals` calls Inspect AI's `eval_set()` +5. Logs are written to `logs/` + +### Dry run + +Preview the resolved configuration without making API calls: + +```bash +devals run --dry-run +``` + +This prints every task × model × variant combination that would execute. +Use it to verify your setup before spending API credits. + +> [!TIP] +> Always dry-run after editing YAML config. It catches typos, missing files, +> and bad task references before they cost you money. + +--- + +## Viewing results + +```bash +devals view +``` + +Launches the [Inspect AI log viewer](https://inspect.aisi.org.uk/log-viewer.html) +— a local web UI. `devals` automatically finds your `logs/` directory from +`devals.yaml`. + +To view logs from a specific location: + +```bash +devals view /path/to/logs +``` + +**What to look for in the viewer:** + +| Section | What it shows | +|---------|--------------| +| **Runs** | Each task × model × variant combination | +| **Transcript** | The full conversation, including every tool call | +| **Score** | Pass/fail, model-graded scores, test results | +| **Metadata** | Timing, token usage, cost | + +--- + +## Troubleshooting + +### `devals doctor` + +Checks all prerequisites: + +```bash +devals doctor +``` + +It verifies: +- **Dart SDK** — required for the CLI itself +- **Python 3.13+** — required for `dash_evals` +- **`dash_evals`** — the Python evaluation package +- **Podman/Docker** — container runtime for sandboxed tasks +- **Flutter SDK** — needed for Flutter-based eval tasks +- **API Keys** — checks for configured provider keys + +Fix any errors before running evals. Warnings (like a missing Flutter SDK) +are safe to ignore if your evals don't need that tool. + +--- + +## Quick reference + +| Command | What it does | +|---------|-------------| +| `devals init` | Initialize a new dataset in the current directory | +| `devals doctor` | Check prerequisites | +| `devals create pipeline` | Interactive walkthrough: sample → task → job | +| `devals create task` | Create a new task directory | +| `devals create sample` | Create a new sample | +| `devals create job` | Create a new job file | +| `devals run ` | Run an evaluation | +| `devals run --dry-run` | Preview without executing | +| `devals view [path]` | Launch the Inspect AI log viewer | + +--- + +## Next steps + +You now know the full CLI workflow. {doc}`Part 5 ` looks +under the hood at the `dash_evals` Python package — useful if you ever want +to write custom task logic. \ No newline at end of file diff --git a/docs/guides/write_your_first_eval.md b/docs/guides/write_your_first_eval.md new file mode 100644 index 0000000..36e22fd --- /dev/null +++ b/docs/guides/write_your_first_eval.md @@ -0,0 +1,324 @@ +# Author your first eval + +In {doc}`Part 1 ` you installed the tools and ran a pre-built eval. +Now you'll write one from scratch — an **agentic** evaluation where the model +explores a codebase, diagnoses a bug, and fixes it. + +By the end of this page you'll have: + +1. Created a workspace with a deliberate bug +2. Written a task file that uses the `bug_fix` task function +3. Run the eval and reviewed the model's fix +4. Added a **variant** to see how extra context changes the result + +> [!NOTE] +> This guide assumes you've completed {doc}`Part 1 ` and have +> a working installation with at least one model API key configured. + +--- + +## 1. Set up a workspace + +Agentic tasks need a **workspace** — a project that gets copied into a sandbox +for the model to work with. Let's create a small Dart package with a deliberate bug. + +Inside your project (the directory with `devals.yaml`), create: + +``` +evals/ +└── workspaces/ + └── buggy_dart_package/ + ├── pubspec.yaml + ├── lib/ + │ └── math_utils.dart + └── test/ + └── math_utils_test.dart +``` + +```{code-block} yaml +--- +caption: evals/workspaces/buggy_dart_package/pubspec.yaml +--- +name: buggy_dart_package +description: A Dart package with a deliberate bug for eval testing. +version: 1.0.0 +publish_to: none + +environment: + sdk: '>=3.0.0 <4.0.0' + +dev_dependencies: + test: ^1.24.0 +``` + +```{code-block} dart +--- +caption: evals/workspaces/buggy_dart_package/lib/math_utils.dart +--- +/// Returns the factorial of [n]. +/// +/// Throws [ArgumentError] if [n] is negative. +int factorial(int n) { + if (n < 0) throw ArgumentError('n must be non-negative'); + if (n <= 1) return 1; + // BUG: should be n * factorial(n - 1) + return n + factorial(n - 1); +} + +/// Returns true if [n] is a prime number. +bool isPrime(int n) { + if (n < 2) return false; + for (var i = 2; i * i <= n; i++) { + if (n % i == 0) return false; + } + return true; +} +``` + +```{code-block} dart +--- +caption: evals/workspaces/buggy_dart_package/test/math_utils_test.dart +--- +import 'package:test/test.dart'; +import 'package:buggy_dart_package/math_utils.dart'; + +void main() { + group('factorial', () { + test('factorial(0) = 1', () => expect(factorial(0), 1)); + test('factorial(1) = 1', () => expect(factorial(1), 1)); + test('factorial(5) = 120', () => expect(factorial(5), 120)); + test('factorial(10) = 3628800', () => expect(factorial(10), 3628800)); + test('negative throws', () { + expect(() => factorial(-1), throwsArgumentError); + }); + }); + + group('isPrime', () { + test('2 is prime', () => expect(isPrime(2), true)); + test('4 is not prime', () => expect(isPrime(4), false)); + test('17 is prime', () => expect(isPrime(17), true)); + }); +} +``` + +The bug is in `factorial` — it uses `+` instead of `*`. The tests will catch it. + +--- + +## 2. Write the task + +Create a task directory with a `task.yaml`: + +``` +evals/ +└── tasks/ + └── fix_math_utils/ + └── task.yaml +``` + +```{code-block} yaml +--- +caption: evals/tasks/fix_math_utils/task.yaml +--- +# Task: Fix a buggy Dart package +# +# Uses the built-in `bug_fix` task function, which: +# 1. Copies the workspace into a sandbox +# 2. Gives the model bash and text-editor access +# 3. Lets it explore, edit, and test until it calls submit() +# 4. Scores based on test results and code quality + +func: bug_fix + +# Copy the workspace into /workspace in the sandbox +files: + /workspace: ../../workspaces/buggy_dart_package +setup: "cd /workspace && dart pub get" + +dataset: + samples: + inline: + - id: fix_factorial + metadata: + difficulty: easy + tags: [dart, math, bug-fix] + input: | + The `factorial` function in `lib/math_utils.dart` is returning + wrong values. Tests are failing. Find and fix the bug. + + Run the tests with `dart test` to verify your fix. + target: | + The fix should change the `+` operator to `*` in the factorial + function's recursive case. All tests should pass after the fix. +``` + +**What's new here compared to Part 1:** + +| Field | What it does | +|-------|-------------| +| `func: bug_fix` | An *agentic* task. The model gets `bash_session()` and `text_editor()` tools and runs in a `react()` loop — it can explore, edit, and test code autonomously. | +| `files` | Copies a local directory into the sandbox filesystem. The key (`/workspace`) is the destination path inside the sandbox. | +| `setup` | A shell command run *before* the model gets control. Use it to install dependencies. | + +> [!IMPORTANT] +> The `bug_fix` task requires a container sandbox (Docker or Podman) because +> `bash_session()` and `text_editor()` inject helper scripts that only work on +> Linux. We'll configure this in the job file. + +--- + +## 3. Create a job + +```{code-block} yaml +--- +caption: evals/jobs/tutorial_bugfix.yaml +--- +# Job: tutorial bug fix +# +# Runs our fix_math_utils task in a Podman sandbox. + +models: + - google/gemini-2.5-flash + +sandbox: podman + +tasks: + inline: + fix_math_utils: {} +``` + +If you don't have Podman set up yet: + +```bash +brew install podman +podman machine init +podman machine start +``` + +> [!TIP] +> If you'd rather use Docker, change `sandbox: podman` to `sandbox: docker`. +> The task functions work identically with either runtime. + +--- + +## 4. Run it + +Dry run first to check your config: + +```bash +devals run tutorial_bugfix --dry-run +``` + +Then run for real: + +```bash +devals run tutorial_bugfix +``` + +The `bug_fix` task uses a ReAct agent loop. You'll see the model: + +1. Explore the project structure (`ls`, `cat`) +2. Read the failing test output (`dart test`) +3. Edit `math_utils.dart` to fix the bug +4. Re-run tests to verify the fix +5. Call `submit()` with an explanation + +A typical run takes 1–3 minutes. + +--- + +## 5. View results + +```bash +devals view +``` + +In the Inspect log viewer, open the run and look at: + +- **Transcript** — the full conversation, including every tool call the model made +- **Score** — whether the fix passed `dart analyze` and `dart test` +- **Metadata** — timing, token usage, and tool call counts + +--- + +## 6. Add a variant + +What if we gave the model some context about Dart best practices? Would it +produce a better fix, or fix it faster? **Variants** let you test this. + +First, create a context file: + +```{code-block} markdown +--- +caption: evals/context_files/dart_best_practices.md +--- +--- +title: "Dart Best Practices" +version: "1.0.0" +description: "Common Dart patterns and debugging tips" +--- + +## Debugging Tips + +- Always run `dart test` after making changes to verify your fix. +- Use `dart analyze` to catch static errors. +- Read test expectations carefully — they tell you what the correct behavior should be. +- Check operator precedence when arithmetic results look wrong. +``` + +Now update your job to define two variants: + +```{code-block} yaml +--- +caption: evals/jobs/tutorial_bugfix.yaml (updated) +--- +models: + - google/gemini-2.5-flash + +sandbox: podman + +# Test with and without context +variants: + baseline: {} + with_context: + files: [./context_files/dart_best_practices.md] + +tasks: + inline: + fix_math_utils: {} +``` + +Run again: + +```bash +devals run tutorial_bugfix +``` + +This time, the framework runs *two* evaluations: + +- `fix_math_utils` × `baseline` — no extra context +- `fix_math_utils` × `with_context` — the context file is injected into the prompt + +In `devals view`, you can compare the two runs side by side. Did the context +help? Did the model find the bug faster? + +--- + +## Recap + +You've now written an agentic eval from scratch. Here's what you learned: + +| Concept | What it means | +|---------|---------------| +| **Workspace** | A project directory copied into the sandbox for the model to work with | +| **`files` + `setup`** | How to get code into the sandbox and prepare it | +| **`bug_fix` (agentic task)** | A task where the model gets tools and runs in a ReAct loop | +| **Variants** | Different configurations for the *same* task — great for A/B testing | + +--- + +## Next steps + +Now that you've written tasks and jobs by hand, {doc}`Part 3 ` +dives deeper into the configuration model — every field in `task.yaml` and +`job.yaml`, and how they all fit together. diff --git a/docs/reference/configuration_reference.md b/docs/reference/configuration_reference.md index deb2193..288caf5 100644 --- a/docs/reference/configuration_reference.md +++ b/docs/reference/configuration_reference.md @@ -8,10 +8,12 @@ The evaluation framework uses the `eval/` directory as its entry point. It conta - Task definitions autodiscovered from `tasks/*/task.yaml` - Job files in `jobs/` that control what to run -- Shared resources (context files, sandboxes, workspaces) +- Shared resources (context files, sandboxes) Configuration is parsed and resolved by the Dart `dataset_config_dart` package, which produces an EvalSet JSON manifest consumed by the Python `dash_evals`. +> **See also:** [YAML Configuration Fields](yaml_config.md) for a complete field-by-field reference with Dart and Python cross-references. + ## Directory Structure ``` @@ -22,7 +24,7 @@ eval/ ├── tasks/ # Task definitions (autodiscovered) │ ├── flutter_bug_fix/ │ │ ├── task.yaml # Task config with inline samples -│ │ └── project/ # Workspace files (if applicable) +│ │ └── project/ # Project files (if applicable) │ ├── dart_question_answer/ │ │ └── task.yaml │ └── generate_flutter_app/ @@ -30,14 +32,10 @@ eval/ │ └── todo_tests/ # Test files for a sample ├── context_files/ # Context files injected into prompts │ └── flutter.md -├── sandboxes/ # Container configurations -│ └── podman/ -│ ├── Containerfile -│ └── compose.yaml -└── workspaces/ # Reusable project templates - ├── dart_package/ - ├── flutter_app/ - └── jaspr_app/ +└── sandboxes/ # Container configurations + └── podman/ + ├── Containerfile + └── compose.yaml ``` --- @@ -52,131 +50,53 @@ func: flutter_bug_fix system_message: | You are an expert Flutter developer. Fix the bug and explain your changes. -# Task-level workspace (inherited by all samples) -workspace: - path: ./project - -# Task-level tests (inherited by all samples) -tests: - path: ./tests - -# Restrict which job-level variants apply to this task (optional) -allowed_variants: [baseline, mcp_only] - -samples: - inline: - - id: flutter_bloc_cart_mutation_001 - difficulty: medium - tags: [bloc, state] - input: | - Fix the bug where adding items to cart doesn't update the total. - target: | - The fix should modify the BLoC to emit a new state instead of mutating. - - - id: navigation_crash - difficulty: hard - tags: [navigation] - workspace: - path: ./nav_project # Override task-level workspace - input: | - Fix the crash when navigating back from the detail screen. - target: | - The fix should handle the disposed controller properly. +# Task-level files copied into sandbox (inherited by all samples) +files: + /workspace: ./project +setup: "cd /workspace && flutter pub get" + +dataset: + samples: + inline: + - id: flutter_bloc_cart_mutation_001 + input: | + Fix the bug where adding items to cart doesn't update the total. + target: | + The fix should modify the BLoC to emit a new state instead of mutating. + metadata: + difficulty: medium + tags: [bloc, state] + + - id: navigation_crash + files: + /workspace: ./nav_project # Override task-level files + input: | + Fix the crash when navigating back from the detail screen. + target: | + The fix should handle the disposed controller properly. + metadata: + difficulty: hard + tags: [navigation] ``` -### Task-Level Fields - -#### Core Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `func` | string | Yes | Name of the `@task` function (resolved dynamically via `importlib`) | -| `description` | string | No | Human-readable description | -| `samples` | object | Yes | Samples config with `inline` and/or `paths` keys | -| `allowed_variants` | list | No | Whitelist of variant names this task accepts (omit to accept all) | -| `system_message` | string | No | Custom system prompt for this task | -| `workspace` | object | No | Default workspace for all samples | -| `tests` | object | No | Default test files for all samples | - -#### Inspect AI Task Parameters - -These map directly to [Inspect AI's `Task` constructor](https://inspect.aisi.org.uk/reference/inspect_ai.html#task). All are optional and override any `task_defaults` set in the job file. - -| Field | Type | Description | -|-------|------|-------------| -| `model` | string | Default model for this task (overrides the eval model) | -| `config` | object | Model generation config (e.g., `{temperature: 0.2, max_tokens: 4096}`) | -| `model_roles` | object | Named roles for use in `get_model()` | -| `sandbox` | string/object | Sandbox environment type or `[type, config_path]` | -| `approval` | string/object | Tool use approval policies | -| `epochs` | int/object | Number of times to repeat each sample (optionally with score reducer) | -| `fail_on_error` | number/bool | `true` = fail on first error, `0.0–1.0` = fail if proportion exceeds threshold | -| `continue_on_fail` | bool | Continue running if `fail_on_error` condition is met | -| `message_limit` | int | Max total messages per sample | -| `token_limit` | int | Max total tokens per sample | -| `time_limit` | int | Max clock time (seconds) per sample | -| `working_limit` | int | Max working time (seconds) per sample (excludes wait time) | -| `cost_limit` | float | Max cost (dollars) per sample | -| `early_stopping` | string/object | Early stopping callbacks | -| `display_name` | string | Task display name (e.g., for plotting) | -| `version` | int | Version of task spec (to distinguish evolutions) | -| `metadata` | object | Additional metadata to associate with the task | - -### Samples Object - -| Field | Type | Description | -|-------|------|-------------| -| `inline` | list | Inline sample definitions | -| `paths` | list | Glob patterns for external sample YAML files (relative to task dir) | - -### Sample Fields (inline in task.yaml) - -#### Core Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `id` | string | Yes | Unique sample identifier | -| `input` | string | Yes | The prompt given to the model | -| `target` | string | Yes | Expected output or grading criteria | -| `difficulty` | string | No | `easy`, `medium`, or `hard` | -| `tags` | list | No | Categories for filtering | -| `system_message` | string | No | Override system prompt for this sample | -| `metadata` | object | No | Arbitrary metadata | -| `workspace` | object | No | Override task-level workspace | -| `tests` | object | No | Override task-level tests | - -#### Inspect AI Sample Parameters - -These map directly to [Inspect AI's `Sample`](https://inspect.aisi.org.uk/reference/inspect_ai.dataset.html#sample). +For the complete list of task fields (including Inspect AI `Task` parameters), see the [Task fields table](yaml_config.md#task). -| Field | Type | Description | -|-------|------|-------------| -| `choices` | list | Answer choices for multiple-choice evaluations | -| `sandbox` | string/object | Override sandbox environment for this sample | -| `files` | object | Files to copy into the sandbox (`{destination: source}`) | -| `setup` | string | Setup script to run in the sandbox before evaluation | - -### Workspace/Tests References +### Files and Setup ```yaml -# Reference a reusable template -workspace: - template: flutter_app - -# Reference a path relative to task directory -workspace: - path: ./project - -# Clone from git -workspace: - git: https://github.com/example/repo.git - -# Shorthand (equivalent to path:) -workspace: ./project +# Copy a local directory into the sandbox +files: + /workspace: ./project +setup: "cd /workspace && flutter pub get" + +# Copy individual files +files: + /workspace/lib/main.dart: ./fixtures/main.dart + /workspace/test/widget_test.dart: ./fixtures/test.dart ``` > [!NOTE] -> Paths in `workspace` and `tests` are resolved **relative to the task directory** (e.g., `tasks/flutter_bug_fix/`). +> Paths in `files` values are resolved **relative to the task directory** (e.g., `tasks/flutter_bug_fix/`). Task-level `files` and `setup` are inherited by all samples. Sample-level `files` stack on top (sample wins on key conflict). Sample-level `setup` overrides task-level `setup`. --- @@ -186,49 +106,23 @@ A sample is a single test case containing an input prompt, expected output (grad ```yaml # Inline in task.yaml -samples: - inline: - - id: dart_async_await_001 - difficulty: medium - tags: [async, dart] - input: | - Explain the difference between Future.then() and async/await in Dart. - target: | - The answer should cover both approaches, explain that they are - functionally equivalent, and note when each is preferred. - metadata: - added: 2025-02-04 - category: language_fundamentals +dataset: + samples: + inline: + - id: dart_async_await_001 + input: | + Explain the difference between Future.then() and async/await in Dart. + target: | + The answer should cover both approaches, explain that they are + functionally equivalent, and note when each is preferred. + metadata: + difficulty: medium + tags: [async, dart] + added: 2025-02-04 + category: language_fundamentals ``` ---- - -### Core Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `id` | string | Yes | Unique sample identifier | -| `input` | string | Yes | The prompt given to the model | -| `target` | string | Yes | Expected output or grading criteria | -| `difficulty` | string | No | `easy`, `medium`, or `hard` | -| `tags` | list | No | Categories for filtering | -| `system_message` | string | No | Override system prompt for this sample | -| `metadata` | object | No | Arbitrary metadata | -| `workspace` | object | No | Override task-level workspace | -| `tests` | object | No | Override task-level tests | - ---- - -### Inspect AI Sample Parameters - -These map directly to [Inspect AI's `Sample`](https://inspect.aisi.org.uk/reference/inspect_ai.dataset.html#sample). - -| Field | Type | Description | -|-------|------|-------------| -| `choices` | list | Answer choices for multiple-choice evaluations | -| `sandbox` | string/object | Override sandbox environment for this sample | -| `files` | object | Files to copy into the sandbox (`{destination: source}`) | -| `setup` | string | Setup script to run in the sandbox before evaluation | +For the complete list of sample fields, see the [Sample fields table](yaml_config.md#sample). ### Multiple Choice Example @@ -256,33 +150,6 @@ These map directly to [Inspect AI's `Sample`](https://inspect.aisi.org.uk/refere setup: "cd /workspace && flutter pub get" ``` ---- - -### Workspace & Tests References - -Workspaces and test paths can be specified at task level (inherited by all samples) or per-sample (overrides task level). - -```yaml -# Reference a reusable template -workspace: - template: flutter_app - -# Reference a path relative to task directory -workspace: - path: ./project - -# Clone from git -workspace: - git: https://github.com/example/repo.git - -# Shorthand (equivalent to path:) -workspace: ./project -``` - -> [!NOTE] -> Paths in `workspace` and `tests` are resolved **relative to the task directory** (e.g., `tasks/flutter_bug_fix/`). - - --- ## Job files @@ -293,15 +160,17 @@ Job files define **what to run** and can **override built-in runtime defaults**. # jobs/local_dev.yaml name: local_dev +# Sandbox configuration (string shorthand or object) +sandbox: + environment: podman + # Override runtime defaults -sandbox_type: podman max_connections: 15 -max_retries: 10 # Save the agent's final workspace output to logs//examples/ # save_examples: true -# Filter what to run (optional - omit to run all) +# Filter what to run (required) models: - google/gemini-2.5-flash @@ -309,148 +178,54 @@ models: # Each key is a variant name; the value is the variant configuration. variants: baseline: {} - context_only: { context_files: [./context_files/flutter.md] } - mcp_only: { mcp_servers: [dart] } - full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] } - -# Inspect AI eval_set() parameters (all optional) -retry_attempts: 20 -fail_on_error: 0.05 -log_level: info -tags: [nightly] - -# Default Task-level overrides applied to every task -task_defaults: - time_limit: 600 - message_limit: 50 - -# Additional eval_set() parameters not covered above -# eval_set_overrides: -# bundle_dir: ./bundle -# log_images: true + context_only: { files: [./context_files/flutter.md] } + mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] } + full: { files: [./context_files/flutter.md], mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] } + +# Inspect AI eval_set() parameters (all optional, nested under inspect_eval_arguments) +inspect_eval_arguments: + retry_attempts: 20 + fail_on_error: 0.05 + log_level: info + tags: [nightly] + + # Default Task-level overrides applied to every task + task_defaults: + time_limit: 600 + message_limit: 50 + + # Additional eval_set() parameters not covered above + # eval_set_overrides: + # bundle_dir: ./bundle + # log_images: true ``` - -### Core Job Fields - -| Field | Type | Description | -|-------|------|-------------| -| `logs_dir` | string | Override logs directory (default: `../logs`) | -| `sandbox_type` | string | Sandbox type: `local`, `docker`, or `podman` (default: `local`) | -| `max_connections` | int | Max concurrent API connections (default: `10`) | -| `max_retries` | int | Max retry attempts for failed samples (default: `3`) | -| `save_examples` | bool | If `true`, copies the agent's final workspace to `//examples/` after each sample. (default: `false`) | -| `models` | list | Filter to specific models — omit to run all | -| `variants` | map | Named variant definitions (see Variants section) — omit to run all defined in task files | -| `tasks` | object | Task discovery and overrides (see below) | - -### Inspect AI eval_set() Parameters - -All [Inspect AI `eval_set()` parameters](https://inspect.aisi.org.uk/reference/inspect_ai.html#eval_set) are available as top-level keys in the job file. These control retry behavior, concurrency, logging, and more. - -#### Retry & Error Handling - -| Field | Type | Default | Description | -|-------|------|---------|-------------| -| `retry_attempts` | int | `10` | Max retry attempts before giving up | -| `retry_wait` | float | `60` | Seconds between retries (exponential backoff) | -| `retry_connections` | float | `0.5` | Reduce max_connections at this rate per retry | -| `retry_cleanup` | bool | `true` | Cleanup failed log files after retries | -| `retry_on_error` | int | — | Retry samples on error (per-sample) | -| `fail_on_error` | float | `0.05` | Fail if error proportion exceeds threshold | -| `continue_on_fail` | bool | — | Continue running even if fail_on_error is met | -| `debug_errors` | bool | `false` | Raise task errors for debugging | - -#### Concurrency - -| Field | Type | Default | Description | -|-------|------|---------|-------------| -| `max_samples` | int | `max_connections` | Max concurrent samples per task | -| `max_tasks` | int | `max(4, models)` | Max tasks to run in parallel | -| `max_subprocesses` | int | `cpu_count` | Max subprocesses in parallel | -| `max_sandboxes` | int | — | Max sandboxes per-provider in parallel | - -#### Logging - -| Field | Type | Default | Description | -|-------|------|---------|-------------| -| `log_level` | string | `info` | Console log level (`debug`, `info`, `warning`, `error`) | -| `log_level_transcript` | string | `info` | Log file level | -| `log_format` | string | `json` | Log format (`eval` or `json`) | -| `log_samples` | bool | `true` | Log detailed samples and scores | -| `log_realtime` | bool | `true` | Log events in realtime | -| `log_images` | bool | `false` | Log base64-encoded images | -| `log_buffer` | int | — | Samples to buffer before log write | -| `log_shared` | int | — | Sync sample events for realtime viewing | -| `log_dir_allow_dirty` | bool | `false` | Allow log dir with unrelated logs | - -#### Model Configuration - -| Field | Type | Description | -|-------|------|-------------| -| `model_base_url` | string | Base URL for the model API | -| `model_args` | object | Model creation arguments | -| `model_roles` | object | Named roles for `get_model()` | -| `task_args` | object | Task creation arguments | -| `model_cost_config` | object | Model prices for cost tracking | - -#### Sample Control - -| Field | Type | Description | -|-------|------|-------------| -| `limit` | int/list | Limit samples (count or `[start, end]` range) | -| `sample_id` | string/list | Evaluate specific sample(s) | -| `sample_shuffle` | bool/int | Shuffle samples (pass seed for deterministic order) | -| `epochs` | int/object | Repeat samples and optional score reducer | - -#### Limits (Applied to All Samples) - -| Field | Type | Description | -|-------|------|-------------| -| `message_limit` | int | Max messages per sample | -| `token_limit` | int | Max tokens per sample | -| `time_limit` | int | Max clock time (seconds) per sample | -| `working_limit` | int | Max working time (seconds) per sample | -| `cost_limit` | float | Max cost (dollars) per sample | - -#### Miscellaneous - -| Field | Type | Description | -|-------|------|-------------| -| `tags` | list | Tags for this evaluation run | -| `metadata` | object | Metadata for this evaluation run | -| `trace` | bool | Trace model interactions to terminal | -| `display` | string | Task display type (default: `full`) | -| `score` | bool | Score output (default: `true`) | -| `approval` | string/object | Tool use approval policies | -| `solver` | string/object | Alternative solver(s) | -| `sandbox_cleanup` | bool | Cleanup sandbox after task (default: `true`) | -| `bundle_dir` | string | Directory for bundled logs + viewer | -| `bundle_overwrite` | bool | Overwrite files in bundle_dir | -| `eval_set_id` | string | Custom ID for the eval set | +For the complete list of job fields (including all Inspect AI `eval_set()` parameters), see the [Job fields table](yaml_config.md#job). ### Pass-Through Sections #### `task_defaults` -Default [Task parameters](#inspect-ai-task-parameters) applied to **every task** in this job. Per-task overrides from `task.yaml` take precedence. +Default [Task parameters](yaml_config.md#task) applied to **every task** in this job. Per-task overrides from `task.yaml` take precedence. Nested under `inspect_eval_arguments`: ```yaml -task_defaults: - time_limit: 600 - message_limit: 50 - cost_limit: 2.0 - epochs: 3 +inspect_eval_arguments: + task_defaults: + time_limit: 600 + message_limit: 50 + cost_limit: 2.0 + epochs: 3 ``` #### `eval_set_overrides` -Arbitrary `eval_set()` kwargs for parameters not covered by the named fields above. Top-level fields take precedence over overrides. +Arbitrary `eval_set()` kwargs for parameters not covered by the named fields above. Top-level `inspect_eval_arguments` fields take precedence over overrides. Nested under `inspect_eval_arguments`: ```yaml -eval_set_overrides: - bundle_dir: ./bundle - log_images: true +inspect_eval_arguments: + eval_set_overrides: + bundle_dir: ./bundle + log_images: true ``` ### Tasks Object @@ -462,16 +237,11 @@ tasks: # Per-task overrides (keys must match directory names in tasks/) inline: flutter_bug_fix: - allowed_variants: [baseline] # Override variants for this task - include-samples: [sample_001] # Only run these samples - exclude-samples: [slow_test] # Exclude these samples + include-variants: [baseline] # Only run these variants for this task + include-samples: [sample_001] # Only run these samples + exclude-samples: [slow_test] # Exclude these samples ``` -| Field | Type | Description | -|-------|------|-------------| -| `paths` | list | Glob patterns for discovering task directories | -| `inline` | object | Per-task configuration overrides | - --- ## Variants @@ -481,26 +251,53 @@ Variants modify how tasks execute, controlling context injection, tool availabil ```yaml variants: baseline: {} - context_only: { context_files: [./context_files/flutter.md] } - mcp_only: { mcp_servers: [dart] } - full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] } + context_only: { files: [./context_files/flutter.md] } + mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] } + full: { files: [./context_files/flutter.md], mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] } ``` -| Field | Type | Default | Description | -|-------|------|---------|-------------| -| `context_files` | list | `[]` | Paths or glob patterns to context files (relative to task dir) | -| `skills` | list | `[]` | Paths or glob patterns to skill directories (relative to task dir) | -| `mcp_servers` | list | `[]` | MCP server identifiers | +Variant sub-fields (`files`, `mcp_servers`, `skills`, `task_parameters`) are documented in the [Job fields table](yaml_config.md#job). -Tasks can optionally restrict which variants apply to them via `allowed_variants` in their `task.yaml`: +Jobs can restrict which variants apply to specific tasks via `include-variants` and `exclude-variants` on the `tasks.` object: ```yaml -# task.yaml — only run baseline and mcp_only variants for this task -allowed_variants: [baseline, mcp_only] +# job.yaml — only run baseline and mcp_only variants for flutter_bug_fix +tasks: + inline: + flutter_bug_fix: + include-variants: [baseline, mcp_only] ``` Glob patterns (containing `*`, `?`, or `[`) are expanded automatically. For skills, only directories containing `SKILL.md` are included. +### MCP Server Modes + +MCP servers in variants support three modes: + +```yaml +variants: + # 1. Declarative stdio/sandbox — command-based + with_dart_mcp: + mcp_servers: + - name: dart + command: dart + args: [mcp-server] + + # 2. Declarative HTTP — url-based + with_http_mcp: + mcp_servers: + - name: my-api + url: https://mcp.example.com/api + authorization: "bearer-token-here" # optional OAuth Bearer token + headers: # optional extra headers + X-Custom-Header: value + + # 3. Python ref — import a pre-built MCPServer + with_custom_mcp: + mcp_servers: + - ref: "my_package.mcp:staging_server" +``` + > [!IMPORTANT] > The `skills` feature requires a sandbox (docker/podman). Skill directories are copied into the sandbox filesystem by Inspect AI's built-in `skill()` tool. Each skill directory must contain a `SKILL.md` file. @@ -523,7 +320,7 @@ updated: "2025-12-24" ## Flutter Best Practices Content here is injected into the model's context when the variant -has context_files pointing to this file. +has files pointing to this file. ``` | Field | Type | Required | Description | diff --git a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md index 460a6a4..fc7e1e9 100644 --- a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md +++ b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md @@ -768,17 +768,28 @@ Resolves parsed task configs and job into fully-resolved This is the resolution engine. It: 1. Resolves models, sandboxes, and variants 2. Expands task × variant combinations into [Task] entries -3. Groups by flutter_channel (one [EvalSet] per group) -4. Propagates job-level and task-level settings to the output +3. Propagates job-level and task-level settings to the output ### Constructors #### `EvalSetResolver` ```dart -EvalSetResolver() +EvalSetResolver({Map> sandboxRegistry}) ``` +Creates a resolver with optional sandbox configuration. + +If [sandboxRegistry] is not provided, it defaults to an empty map +(no sandbox resolution). Pass [kDefaultSandboxRegistry] for the +Flutter-specific sandbox setup. + +### Properties + +- **`sandboxRegistry`** → `Map>` *(final)* + + Named sandbox configurations (e.g. `'podman'` → compose file path). + ### Methods #### `resolve` @@ -789,8 +800,6 @@ List resolve(List datasetTasks, Job job, String datasetRoot Resolve task configs and job into [EvalSet] objects. -Groups by flutter_channel so each gets its own sandbox. - **Parameters:** - `datasetTasks` (`List`) *(required)* @@ -972,27 +981,25 @@ be specified there and will be passed through to the Python runner. Example YAML: ```yaml log_dir: ./logs/my_run -sandbox: podman +sandbox: + environment: podman max_connections: 10 models: - google/gemini-2.5-flash variants: baseline: {} context_only: - context_files: [./context_files/flutter.md] + files: [./context_files/flutter.md] tasks: dart_qa: include-samples: [sample_1] -# Pass-through to eval_set() -eval_set_overrides: +# All Inspect AI eval_set() parameters +inspect_eval_arguments: retry_attempts: 20 log_level: debug - -# Default Task-level overrides applied to every task -task_defaults: - time_limit: 600 - message_limit: 50 + task_defaults: + time_limit: 600 ``` ### Constructors @@ -1000,7 +1007,7 @@ task_defaults: #### `Job` ```dart -Job({required String logDir, String sandboxType, int maxConnections, List? models, Map>? variants, List? taskPaths, Map? tasks, bool saveExamples, int? retryAttempts, int? maxRetries, double? retryWait, double? retryConnections, bool? retryCleanup, double? failOnError, bool? continueOnFail, int? retryOnError, bool? debugErrors, int? maxSamples, int? maxTasks, int? maxSubprocesses, int? maxSandboxes, String? logLevel, String? logLevelTranscript, String? logFormat, List? tags, Map? metadata, bool? trace, String? display, bool? score, Object? limit, Object? sampleId, Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, bool? sandboxCleanup, String? modelBaseUrl, Map? modelArgs, Map? modelRoles, Map? taskArgs, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Map? modelCostConfig, bool? logSamples, bool? logRealtime, bool? logImages, int? logBuffer, int? logShared, String? bundleDir, bool? bundleOverwrite, bool? logDirAllowDirty, String? evalSetId, Map? evalSetOverrides, Map? taskDefaults}) +Job({String? description, required String logDir, int maxConnections, List? models, Map>? variants, List? taskPaths, Map? tasks, bool saveExamples, Map? sandbox, Map? inspectEvalArguments, TagFilter? taskFilters, TagFilter? sampleFilters}) ``` #### `Job.fromJson` @@ -1017,15 +1024,14 @@ Job.fromJson(Map json) Per-task configuration within a job. -Allows overriding which samples run for specific tasks and providing -a custom system message. +Allows overriding which samples and variants run for specific tasks. ### Constructors #### `JobTask` ```dart -JobTask({required String id, List? includeSamples, List? excludeSamples, String? systemMessage}) +JobTask({required String id, List? includeSamples, List? excludeSamples, List? includeVariants, List? excludeVariants, Map? args}) ``` #### `JobTask.fromJson` @@ -1042,9 +1048,6 @@ JobTask.fromYaml(String taskId, Map? data) Create a [JobTask] from parsed YAML data. -The [taskId] is the map key from the job YAML `tasks:` section. -The [data] may be `null` for a simple task reference with no overrides. - --- ## class `JsonParser` @@ -1200,14 +1203,14 @@ former `TaskConfig` model-package class. #### `ParsedTask` ```dart -ParsedTask({required String id, required String taskFunc, required List samples, required Variant variant, String sandboxType, String? systemMessage, List? allowedVariants, bool saveExamples, String? examplesDir, String? model, Map? config, Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map? metadata}) +ParsedTask({required String id, required String func, required List samples, required Variant variant, String sandboxType, String? systemMessage, bool saveExamples, String? examplesDir, Map? sandboxParameters, Map? taskFiles, String? taskSetup, String? model, Map? config, Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map? metadata}) ``` ### Properties - **`id`** → `String` *(final)* -- **`taskFunc`** → `String` *(final)* +- **`func`** → `String` *(final)* - **`samples`** → `List` *(final)* @@ -1217,12 +1220,22 @@ ParsedTask({required String id, required String taskFunc, required List - **`systemMessage`** → `String?` *(final)* -- **`allowedVariants`** → `List?` *(final)* - - **`saveExamples`** → `bool` *(final)* - **`examplesDir`** → `String?` *(final)* +- **`sandboxParameters`** → `Map?` *(final)* + + Pass-through dict for sandbox plugin configuration. + +- **`taskFiles`** → `Map?` *(final)* + + Task-level files to copy into sandbox. + +- **`taskSetup`** → `String?` *(final)* + + Task-level setup script. + - **`model`** → `String?` *(final)* Default model for this task. @@ -1296,7 +1309,7 @@ ParsedTask({required String id, required String taskFunc, required List #### `copyWith` ```dart -ParsedTask copyWith({String? id, String? taskFunc, List? samples, Variant? variant, String? sandboxType, String? systemMessage, List? allowedVariants, bool? saveExamples, String? examplesDir, String? model, Map? config, Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map? metadata}) +ParsedTask copyWith({String? id, String? func, List? samples, Variant? variant, String? sandboxType, String? systemMessage, bool? saveExamples, String? examplesDir, Map? sandboxParameters, Map? taskFiles, String? taskSetup, String? model, Map? config, Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map? metadata}) ``` Create a copy with overrides. @@ -1304,14 +1317,16 @@ Create a copy with overrides. **Parameters:** - `id` (`String?`) -- `taskFunc` (`String?`) +- `func` (`String?`) - `samples` (`List?`) - `variant` (`Variant?`) - `sandboxType` (`String?`) - `systemMessage` (`String?`) -- `allowedVariants` (`List?`) - `saveExamples` (`bool?`) - `examplesDir` (`String?`) +- `sandboxParameters` (`Map?`) +- `taskFiles` (`Map?`) +- `taskSetup` (`String?`) - `model` (`String?`) - `config` (`Map?`) - `modelRoles` (`Map?`) @@ -1460,6 +1475,28 @@ Score.fromJson(Map json) --- +## abstract class `TagFilter` + +**Mixins:** `_$TagFilter` + +Tag-based filter for including/excluding items by their tags. + +### Constructors + +#### `TagFilter` + +```dart +TagFilter({List? includeTags, List? excludeTags}) +``` + +#### `TagFilter.fromJson` + +```dart +TagFilter.fromJson(Map json) +``` + +--- + ## abstract class `Task` **Mixins:** `_$Task` @@ -1475,7 +1512,7 @@ constructor. #### `Task` ```dart -Task({Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, String? taskFunc, String? name, Object version, Map? metadata}) +Task({Dataset? dataset, Map? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, String? func, String? systemMessage, Map? sandboxParameters, String? name, Object version, Map? metadata}) ``` #### `Task.fromJson` @@ -1519,12 +1556,12 @@ TaskInfo.fromJson(Map json) #### `TaskMetadata` ```dart -TaskMetadata(String taskFunc, Map additional) +TaskMetadata(String func, Map additional) ``` ### Properties -- **`taskFunc`** → `String` *(final)* +- **`func`** → `String` *(final)* - **`additional`** → `Map` *(final)* @@ -1596,9 +1633,10 @@ Variants define different testing configurations to compare model performance with and without specific tooling or context. Features are implied by field presence — no explicit feature list needed: -- [contextFiles] populated → context injection enabled +- [files] populated → context injection enabled - [mcpServers] populated → MCP tools enabled -- [skillPaths] populated → agent skills enabled +- [skills] populated → agent skills enabled +- [taskParameters] populated → extra parameters passed to the task - all empty → baseline variant Example YAML: @@ -1606,10 +1644,13 @@ Example YAML: variants: baseline: {} context_only: - context_files: [./context_files/flutter.md] + files: [./context_files/flutter.md] full: - context_files: [./context_files/flutter.md] - mcp_servers: [dart] + files: [./context_files/flutter.md] + mcp_servers: + - name: dart + command: dart + args: [mcp-server] skills: [./skills/flutter_docs_ui] ``` @@ -1618,7 +1659,7 @@ variants: #### `Variant` ```dart -Variant({String name, List contextFiles, List mcpServers, List skillPaths, String? flutterChannel}) +Variant({String name, List files, List> mcpServers, List skills, Map taskParameters}) ``` #### `Variant.fromJson` @@ -1721,6 +1762,25 @@ Throws [FileSystemException] if the job file is not found. --- +## `matchesTagFilter` + +```dart +bool matchesTagFilter(List itemTags, TagFilter filter) +``` + +Check whether a set of [itemTags] matches the given [filter]. + +Returns `true` if: +- All include_tags (if any) are present in [itemTags] +- No exclude_tags (if any) are present in [itemTags] + +**Parameters:** + +- `itemTags` (`List`) *(required)* +- `filter` (`TagFilter`) *(required)* + +--- + ## `readYamlFile` ```dart diff --git a/docs/reference/index.md b/docs/reference/index.md index 1576729..86879cb 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -8,6 +8,7 @@ API documentation, CLI usage, and other reference material. glossary cli configuration_reference +yaml_config ``` ```{toctree} diff --git a/docs/reference/yaml_config.md b/docs/reference/yaml_config.md new file mode 100644 index 0000000..4eeaf2c --- /dev/null +++ b/docs/reference/yaml_config.md @@ -0,0 +1,411 @@ +# YAML Configuration Fields + +This page provides a complete field-by-field reference for each YAML configuration file type, cross-referenced with the corresponding Dart and Python object field names. + +## Job + +Job files define runtime settings for an evaluation run, including sandbox configuration, rate limits, model selection, variant definitions, tag-based filtering, and pass-through parameters for Inspect AI's `eval_set()` and `Task` constructors. Located in `eval/jobs/`. + +```{list-table} +:header-rows: 1 +:widths: 20 8 5 12 12 43 + +* - Field name + - YAML type + - Optional + - Dart field + - Python field + - Description +* - `description` + - string + - Y + - `description` + - `description` + - Human-readable description of the job +* - `log_dir` + - string + - N + - `logDir` + - `log_dir` + - Directory to write evaluation logs to +* - `sandbox` + - string/object + - Y + - `sandbox` + - `sandbox` + - Sandbox configuration. String shorthand (e.g. `podman`) is equivalent to `{environment: podman}` +* - `sandbox` \ +   `.environment` + - string + - Y + - + - + - Sandbox type: `local`, `docker`, or `podman` (default: `local`) +* - `sandbox` \ +   `.parameters` + - object + - Y + - + - + - Pass-through parameters for sandbox plugin configuration +* - `sandbox` \ +   `.image_prefix` + - string + - Y + - + - + - Registry prefix prepended to image names during sandbox resolution (e.g. `us-central1-docker.pkg.dev/project/repo/`) +* - `max_connections` + - int + - Y + - `maxConnections` + - `max_connections` + - Maximum concurrent API connections (default: `10`) +* - `models` + - list + - N + - `models` + - `models` + - List of model identifiers to evaluate (required — at least one model must be specified) +* - `variants` + - map + - Y + - `variants` + - `variants` + - Named variant definitions (keys are names, values are config maps). Can also be a list of paths to external variant files. +* - `variants` \ +   `.` \ +   `.files` + - list + - Y + - + - + - Paths or glob patterns to context files +* - `variants` \ +   `.` \ +   `.mcp_servers` + - list + - Y + - + - + - MCP server configurations. Each entry is one of: (1) an object with `command`/`args` for stdio/sandbox, (2) an object with `url` for HTTP, or (3) a `ref:` string pointing to a Python MCPServer object. Common sub-fields: `name`, `transport`. Stdio sub-fields: `command`, `args`, `env`, `cwd`. HTTP sub-fields: `url`, `authorization`, `headers`. +* - `variants` \ +   `.` \ +   `.skills` + - list + - Y + - + - + - Paths or glob patterns to skill directories +* - `variants` \ +   `.` \ +   `.task_parameters` + - object + - Y + - + - + - Optional parameters merged into the task config dict at runtime +* - `task_filters` + - object + - Y + - `taskFilters` + - `task_filters` + - Tag-based task selection filter +* - `task_filters` \ +   `.include_tags` + - list + - Y + - `TagFilter.includeTags` + - `TagFilter.include_tags` + - Only run tasks whose metadata tags include **all** of these +* - `task_filters` \ +   `.exclude_tags` + - list + - Y + - `TagFilter.excludeTags` + - `TagFilter.exclude_tags` + - Exclude tasks whose metadata tags include **any** of these +* - `sample_filters` + - object + - Y + - `sampleFilters` + - `sample_filters` + - Tag-based sample selection filter (same schema as `task_filters`) +* - `task_paths` + - list + - Y + - `taskPaths` + - `task_paths` + - Glob patterns for discovering task directories (relative to dataset root) +* - `tasks` + - object + - Y + - `tasks` + - `tasks` + - Per-task configurations with inline overrides +* - `tasks` \ +   `.` \ +   `.include-samples` + - list + - Y + - `JobTask.includeSamples` + - `JobTask.include_samples` + - Only run these sample IDs +* - `tasks` \ +   `.` \ +   `.exclude-samples` + - list + - Y + - `JobTask.excludeSamples` + - `JobTask.exclude_samples` + - Exclude these sample IDs +* - `tasks` \ +   `.` \ +   `.args` + - object + - Y + - `JobTask.args` + - `JobTask.args` + - Per-task argument overrides passed to the task function +* - `tasks` \ +   `.` \ +   `.include-variants` + - list + - Y + - `JobTask.includeVariants` + - `JobTask.include_variants` + - Only run these variant names for this task +* - `tasks` \ +   `.` \ +   `.exclude-variants` + - list + - Y + - `JobTask.excludeVariants` + - `JobTask.exclude_variants` + - Exclude these variant names for this task +* - `save_examples` + - bool + - Y + - `saveExamples` + - `save_examples` + - Copy final workspace to `/examples/` after each sample (default: `false`) +* - `inspect_eval_arguments` + - object + - Y + - `inspectEvalArguments` + - `inspect_eval_arguments` + - Pass-through dict of any valid Inspect AI `eval_set()` kwargs (e.g. `retry_attempts`, `log_level`, `max_tasks`, `tags`, `task_defaults`, `eval_set_overrides`, etc.). See [Inspect AI docs](https://inspect.ai-safety-institute.org.uk/) for the full list of supported parameters. +``` + +## Task + +Task files define a single evaluation task with its samples, prompt configuration, and optional Inspect AI `Task` parameter overrides. Located in `eval/tasks//task.yaml`. + +Task-level Inspect AI `Task` parameters (model, limits, sandbox, etc.) are nested under `inspect_task_args`. + +```{list-table} +:header-rows: 1 +:widths: 20 8 5 12 12 43 + +* - Field name + - YAML type + - Optional + - Dart field + - Python field + - Description +* - `func` + - string + - Y + - `func` + - `func` + - Name of the `@task` function or `module:function` reference (defaults to directory name) +* - `id` + - string + - Y + - + - + - Task identifier (defaults to directory name) +* - `description` + - string + - Y + - `description` + - `description` + - Human-readable description +* - `dataset` + - object + - Y + - + - + - Dataset configuration. Must contain exactly one of `samples`, `json`, or `csv`. +* - `dataset` \ +   `.samples` + - object + - Y + - + - + - Inline/file-based sample definitions (see `samples.inline` and `samples.paths` below) +* - `dataset` \ +   `.samples` \ +   `.inline` + - list + - Y + - + - + - Inline sample definitions (list of sample objects) +* - `dataset` \ +   `.samples` \ +   `.paths` + - list + - Y + - + - + - Glob patterns for external sample YAML files (relative to task dir) +* - `dataset` \ +   `.json` + - string + - Y + - + - + - Path or URL to a JSON/JSONL dataset file (maps to Inspect's `json_dataset()`) +* - `dataset` \ +   `.csv` + - string + - Y + - + - + - Path to a CSV dataset file (maps to Inspect's `csv_dataset()`) +* - `dataset` \ +   `.args` + - object + - Y + - `Dataset.args` + - `Dataset.args` + - Additional arguments passed through to the dataset constructor (e.g. `auto_id`, `shuffle`, `delimiter`) +* - `system_message` + - string + - Y + - `systemMessage` + - `system_message` + - Custom system prompt for this task +* - `files` + - object + - Y + - `files` + - `files` + - Files to copy into sandbox for all samples (`{destination: source}`). Task-level files stack with sample-level files (sample wins on key conflict). +* - `setup` + - string + - Y + - `setup` + - `setup` + - Setup script to run in sandbox before evaluation (overridden by sample-level `setup`) +* - `display_name` + - string + - Y + - `displayName` + - `display_name` + - Task display name (e.g. for plotting) +* - `version` + - int + - Y + - `version` + - `version` + - Version of task spec +* - `metadata` + - object + - Y + - `metadata` + - `metadata` + - Additional metadata to associate with the task +* - `inspect_task_args` + - object + - Y + - + - + - Pass-through dict of any valid Inspect AI `Task()` kwargs (e.g. `model`, `time_limit`, `message_limit`, `epochs`, `sandbox`, etc.). See [Inspect AI docs](https://inspect.ai-safety-institute.org.uk/) for the full list. +``` + +## Sample + +Samples are individual test cases defined either inline in `task.yaml` under `dataset.samples.inline`, or in external YAML files referenced via `dataset.samples.paths`. Fields like `difficulty` and `tags` should be nested inside the sample's `metadata` dict. + +```{list-table} +:header-rows: 1 +:widths: 20 8 5 12 12 43 + +* - Field name + - YAML type + - Optional + - Dart field + - Python field + - Description +* - `id` + - string + - N + - `id` + - `id` + - Unique sample identifier +* - `input` + - string + - N + - `input` + - `input` + - The prompt given to the model +* - `target` + - string + - N + - `target` + - `target` + - Expected output or grading criteria +* - `metadata` \ +   `.difficulty` + - string + - Y + - + - + - `easy`, `medium`, or `hard` +* - `metadata` \ +   `.tags` + - list + - Y + - + - + - Categories for filtering +* - `metadata` \ +   `.system_message` + - string + - Y + - + - + - Override system prompt for this sample +* - `choices` + - list + - Y + - `choices` + - `choices` + - Answer choices for multiple-choice evaluations +* - `metadata` + - object + - Y + - `metadata` + - `metadata` + - Arbitrary metadata +* - `sandbox` + - string/object + - Y + - `sandbox` + - `sandbox` + - Override sandbox environment for this sample +* - `files` + - object + - Y + - `files` + - `files` + - Files to copy into sandbox (`{destination: source}`) +* - `setup` + - string + - Y + - `setup` + - `setup` + - Setup script to run in sandbox before evaluation +``` diff --git a/packages/dash_evals/src/dash_evals/runner/json_runner.py b/packages/dash_evals/src/dash_evals/runner/json_runner.py index a5d7a5b..828d048 100644 --- a/packages/dash_evals/src/dash_evals/runner/json_runner.py +++ b/packages/dash_evals/src/dash_evals/runner/json_runner.py @@ -11,7 +11,7 @@ from pathlib import Path import inspect_ai -from inspect_ai.dataset import MemoryDataset, Sample +from inspect_ai.dataset import MemoryDataset, Sample, csv_dataset, json_dataset from dash_evals.utils.logging import capture_output, setup_logging @@ -27,6 +27,7 @@ def _resolve_task_func(name: str): Supports: - Short names: "flutter_code_gen" → dash_evals.runner.tasks.flutter_code_gen + - Colon syntax: "my_package.tasks:my_task" → import my_package.tasks, get my_task - Dotted paths: "dash_evals.runner.tasks.flutter_code_gen.flutter_code_gen" For short names, first tries to import a module with the same name. @@ -36,6 +37,21 @@ def _resolve_task_func(name: str): Returns the callable task function. """ + # Colon syntax: "module.path:function_name" + if ":" in name: + module_path, func_name = name.split(":", 1) + try: + module = importlib.import_module(module_path) + except ModuleNotFoundError: + raise ValueError( + f"Could not find module '{module_path}' for task function '{name}'. " + f"Check that the module exists and is importable." + ) + func = getattr(module, func_name, None) + if func is None: + raise ValueError(f"Module '{module_path}' does not have a function '{func_name}'.") + return func + if "." not in name: # Short name: try module with the same name first module_path = f"dash_evals.runner.tasks.{name}" @@ -78,32 +94,73 @@ def _resolve_task_func(name: str): return func -def _build_dataset_from_inline(task_def: dict) -> MemoryDataset: - """Build an Inspect AI MemoryDataset from inline dataset in the task def. +def _build_dataset(task_def: dict): + """Build an Inspect AI dataset from a task definition. + + Dispatches on ``task_def["dataset"]["format"]``: - The task_def["dataset"]["samples"] contains a list of InspectSample dicts. + - ``"memory"`` (default): builds a ``MemoryDataset`` from inline samples. + - ``"json"``: delegates to ``inspect_ai.dataset.json_dataset(source, **args)``. + - ``"csv"``: delegates to ``inspect_ai.dataset.csv_dataset(source, **args)``. + + Args: + task_def: A task entry from the EvalSet JSON manifest. + + Returns: + An Inspect AI dataset object. + + Raises: + ValueError: If the dataset format is unrecognized or required fields + (e.g. ``source`` for json/csv) are missing. """ dataset_def = task_def.get("dataset") + task_name = task_def.get("name", "") + if not dataset_def: - return MemoryDataset([], name=task_def.get("name", "")) - - raw_samples = dataset_def.get("samples", []) - samples = [] - for raw in raw_samples: - sample = Sample( - input=raw["input"], - target=raw.get("target", ""), - id=raw.get("id"), - metadata=raw.get("metadata"), - files=raw.get("files"), - setup=raw.get("setup"), - sandbox=raw.get("sandbox"), + return MemoryDataset([], name=task_name) + + fmt = dataset_def.get("format", "memory") + extra_args: dict = dataset_def.get("args") or {} + + if fmt == "json": + source = dataset_def.get("source") + if not source: + raise ValueError( + f"Task '{task_name}': dataset format 'json' requires a 'source' field." + ) + return json_dataset(source, **extra_args) + + if fmt == "csv": + source = dataset_def.get("source") + if not source: + raise ValueError( + f"Task '{task_name}': dataset format 'csv' requires a 'source' field." + ) + return csv_dataset(source, **extra_args) + + if fmt == "memory": + raw_samples = dataset_def.get("samples", []) + samples = [] + for raw in raw_samples: + sample = Sample( + input=raw["input"], + target=raw.get("target", ""), + id=raw.get("id"), + metadata=raw.get("metadata"), + files=raw.get("files"), + setup=raw.get("setup"), + sandbox=raw.get("sandbox"), + ) + samples.append(sample) + + return MemoryDataset( + samples, + name=dataset_def.get("name", task_name), ) - samples.append(sample) - return MemoryDataset( - samples, - name=dataset_def.get("name", task_def.get("name", "")), + raise ValueError( + f"Task '{task_name}': unknown dataset format '{fmt}'. " + f"Expected one of: 'memory', 'json', 'csv'." ) @@ -141,19 +198,17 @@ def _run_single_manifest(manifest: dict) -> bool: Path(log_dir).mkdir(parents=True, exist_ok=True) job_logger, log_file_path = setup_logging(Path(log_dir), name="dash_evals") - # Build Task objects from inline datasets + # Build Task objects from task definitions task_defs = manifest["tasks"] task_instances: list[inspect_ai.Task] = [] for task_def in task_defs: - task_func_name = task_def.get("task_func") + task_func_name = task_def.get("func") task_name = task_def.get("name", task_func_name or "(unknown)") if not task_func_name: # Mode 2: hydrate directly from JSON (future) - job_logger.warning( - f" ⚠ {task_name}: no task_func — Mode 2 hydration not yet supported" - ) + job_logger.warning(f" ⚠ {task_name}: no func — Mode 2 hydration not yet supported") continue try: @@ -162,8 +217,12 @@ def _run_single_manifest(manifest: dict) -> bool: job_logger.warning(f" ✗ {task_name}: {e}") continue - # Build inline dataset - dataset = _build_dataset_from_inline(task_def) + # Build dataset (dispatches on format: memory | json | csv) + try: + dataset = _build_dataset(task_def) + except ValueError as e: + job_logger.warning(f" ✗ {task_name}: {e}") + continue # Inject task_name into the config for task functions that expect it. # The Dart CLI emits "name" but task functions use "task_name". diff --git a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py index 9c2f61a..bca2517 100644 --- a/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py +++ b/packages/dash_evals/src/dash_evals/runner/tasks/task_helpers.py @@ -1,7 +1,7 @@ """Shared helper functions for building task components. These helpers encapsulate common patterns used across tasks: -- Creating the Dart MCP server +- Creating MCP servers from variant config - Building task metadata - Appending variant-driven solvers (context injection, MCP tools, skills) @@ -11,11 +11,19 @@ from __future__ import annotations +import importlib from typing import Any, cast from inspect_ai.agent import react from inspect_ai.solver import Solver, generate -from inspect_ai.tool import MCPServer, Tool, mcp_server_stdio, skill +from inspect_ai.tool import ( + MCPServer, + Tool, + mcp_server_http, + mcp_server_sandbox, + mcp_server_stdio, + skill, +) from dash_evals.runner.solvers import context_injector @@ -58,32 +66,137 @@ def validate_sandbox_tools(config: dict, tool_names: list[str]) -> None: ) -def create_mcp_server(config: dict | None = None): - """ - Create an MCP server tool from config. +def _resolve_mcp_ref(ref: str) -> MCPServer: + """Resolve a Python import reference to an MCPServer object. - Reads 'mcp_server_command' and 'mcp_server_args' from config. - Defaults to the Dart MCP server if not specified. + Supports ``"module.path:variable_name"`` format. + + Args: + ref: Import reference (e.g. ``"my_package.mcp:staging_server"``). + + Returns: + The resolved MCPServer object. + """ + if ":" not in ref: + raise ValueError( + f"Invalid MCP server ref '{ref}'. Expected format: 'module.path:variable_name'" + ) + module_path, attr_name = ref.rsplit(":", 1) + try: + module = importlib.import_module(module_path) + except ImportError as e: + raise ImportError( + f"Could not import module '{module_path}' for MCP server ref '{ref}': {e}" + ) from e + try: + server = getattr(module, attr_name) + except AttributeError as e: + raise AttributeError( + f"Module '{module_path}' has no attribute '{attr_name}' " + f"(referenced by MCP server ref '{ref}')" + ) from e + return server + + +def create_mcp_servers( + mcp_configs: list[dict], + sandbox_type: str = "local", +) -> list[MCPServer]: + """Create MCP server objects from variant config. + + Supports three modes per entry: + - **Declarative stdio/sandbox**: dict with ``command``, ``args``, etc. + - **Declarative HTTP**: dict with ``url``, and optionally ``authorization``/``headers``. + - **Python ref**: dict with ``ref`` key pointing to a pre-built MCPServer. + + Transport is auto-selected when not explicit: + - If ``url`` is present → ``mcp_server_http`` + - If sandbox is non-local → ``mcp_server_sandbox`` + - Otherwise → ``mcp_server_stdio`` Args: - config: Task config with optional 'mcp_server_command' and - 'mcp_server_args' keys. + mcp_configs: List of MCP server config dicts from variant_config. + sandbox_type: The sandbox type for the current eval run. Returns: - MCP server stdio tool. + List of MCPServer objects. """ - config = config or {} - command = config.get("mcp_server_command", "dart") - args = config.get("mcp_server_args", ["mcp-server", "--force-roots-fallback"]) - name = config.get("mcp_server_name", "Dart") + servers: list[MCPServer] = [] + for cfg in mcp_configs: + # Ref mode — import a pre-built MCPServer from Python + if cfg.get("ref"): + servers.append(_resolve_mcp_ref(cfg["ref"])) + continue + + # HTTP mode — url-based server + url = cfg.get("url") + if url: + name = cfg.get("name", url) + authorization = cfg.get("authorization") or cfg.get("auth") + headers = cfg.get("headers") + servers.append( + mcp_server_http( + url=url, + name=name, + authorization=authorization, + headers=headers, + ) + ) + continue + + # Stdio / sandbox mode — command-based server + command = cfg.get("command") + if not command: + raise ValueError( + f"MCP server config missing 'command' or 'url' for server " + f"'{cfg.get('name', 'unknown')}': {cfg}" + ) + + name = cfg.get("name", command) + args = cfg.get("args", []) + env = cfg.get("env") + cwd = cfg.get("cwd") + + transport = cfg.get("transport") + if transport is None: + transport = "sandbox" if sandbox_type != "local" else "stdio" + + if transport == "stdio": + servers.append( + mcp_server_stdio( + name=name, + command=command, + args=args, + env=env, + cwd=cwd, + ) + ) + elif transport == "sandbox": + servers.append( + mcp_server_sandbox( + name=name, + command=command, + args=args, + env=env, + cwd=cwd, + ) + ) + else: + raise ValueError(f"Unknown MCP transport '{transport}' for server '{name}'") + + return servers + + +# Backwards-compatible alias +def create_mcp_server(config: dict | None = None): + """Create the default Dart MCP server (backwards-compatible alias).""" return mcp_server_stdio( - name=name, - command=command, - args=args, + name="Dart", + command="dart", + args=["mcp-server", "--force-roots-fallback"], ) -# Backwards-compatible alias def create_dart_mcp_server(): """Create the standard Dart MCP server tool (backwards-compatible alias).""" return create_mcp_server() @@ -119,7 +232,8 @@ def append_context_injection(solver_chain: list, config: dict) -> None: config: Task manifest entry with 'variant' key. """ variant = config.get("variant", {}) - context_files = variant.get("context_files", []) + # Support both old "context_files" and new "files" key + context_files = variant.get("files") or variant.get("context_files", []) if context_files: solver_chain.append(context_injector(context_files)) @@ -134,7 +248,8 @@ def get_skill_tool(config: dict) -> Tool | None: The skill Tool, or None if no skills are configured. """ variant = config.get("variant", {}) - skill_paths = variant.get("skill_paths", []) + # Support both old "skill_paths" and new "skills" key + skill_paths = variant.get("skills") or variant.get("skill_paths", []) if skill_paths: return skill(skill_paths) return None @@ -155,8 +270,11 @@ def append_model_interaction( """ tools: list[Tool | MCPServer] = [] variant = config.get("variant", {}) - if variant.get("mcp_servers"): - tools.append(create_mcp_server(config)) + mcp_servers_config = variant.get("mcp_servers", []) + + if mcp_servers_config: + sandbox_type = config.get("sandbox_type", "local") + tools.extend(create_mcp_servers(mcp_servers_config, sandbox_type)) skill_tool = get_skill_tool(config) if skill_tool: diff --git a/packages/dash_evals/tests/test_json_runner.py b/packages/dash_evals/tests/test_json_runner.py new file mode 100644 index 0000000..067f30c --- /dev/null +++ b/packages/dash_evals/tests/test_json_runner.py @@ -0,0 +1,217 @@ +"""Tests for json_runner._build_dataset() — dataset format dispatch.""" + +from __future__ import annotations + +from unittest.mock import MagicMock, patch + +import pytest +from inspect_ai.dataset import MemoryDataset + +from dash_evals.runner.json_runner import _build_dataset + + +class TestBuildDatasetMemoryFormat: + """Tests for inline MemoryDataset (format='memory').""" + + def test_no_dataset_returns_empty_memory_dataset(self): + """Tasks without a dataset key produce an empty MemoryDataset.""" + task_def = {"name": "my_task:baseline", "func": "question_answer"} + result = _build_dataset(task_def) + assert isinstance(result, MemoryDataset) + assert len(result) == 0 + + def test_empty_dataset_dict_returns_empty_memory_dataset(self): + """An empty dataset dict produces an empty MemoryDataset.""" + task_def = {"name": "my_task:baseline", "dataset": {}} + result = _build_dataset(task_def) + assert isinstance(result, MemoryDataset) + assert len(result) == 0 + + def test_memory_format_explicit(self): + """Explicit format='memory' builds a MemoryDataset from inline samples.""" + task_def = { + "name": "my_task:baseline", + "dataset": { + "format": "memory", + "samples": [ + {"id": "s1", "input": "What is Dart?", "target": "A language"}, + ], + }, + } + result = _build_dataset(task_def) + assert isinstance(result, MemoryDataset) + assert len(result) == 1 + assert result[0].input == "What is Dart?" + assert result[0].target == "A language" + assert result[0].id == "s1" + + def test_memory_format_default_when_format_absent(self): + """Omitting 'format' defaults to memory format.""" + task_def = { + "name": "my_task:baseline", + "dataset": { + "samples": [ + {"id": "s1", "input": "q", "target": "a"}, + ], + }, + } + result = _build_dataset(task_def) + assert isinstance(result, MemoryDataset) + assert len(result) == 1 + + def test_memory_format_preserves_optional_sample_fields(self): + """Optional sample fields (metadata, files, setup, sandbox) are passed through.""" + task_def = { + "name": "t:v", + "dataset": { + "samples": [ + { + "id": "s1", + "input": "q", + "target": "a", + "metadata": {"difficulty": "hard"}, + "files": {"/workspace": "./proj"}, + "setup": "dart pub get", + "sandbox": "docker", + } + ], + }, + } + result = _build_dataset(task_def) + sample = result[0] + assert sample.metadata == {"difficulty": "hard"} + assert sample.files == {"/workspace": "./proj"} + assert sample.setup == "dart pub get" + # Inspect AI normalises string sandbox values to SandboxEnvironmentSpec + sandbox = sample.sandbox + sandbox_type = sandbox.type if hasattr(sandbox, "type") else sandbox + assert sandbox_type == "docker" + + def test_memory_format_dataset_name(self): + """Dataset name falls back to task name when not set in dataset dict.""" + task_def = { + "name": "dart_qa:baseline", + "dataset": { + "samples": [], + }, + } + result = _build_dataset(task_def) + assert isinstance(result, MemoryDataset) + # Name is set (MemoryDataset stores it) + assert result.name == "dart_qa:baseline" + + def test_memory_format_explicit_dataset_name_wins(self): + """Explicit dataset name takes precedence over task name.""" + task_def = { + "name": "dart_qa:baseline", + "dataset": { + "name": "custom_name", + "samples": [], + }, + } + result = _build_dataset(task_def) + assert result.name == "custom_name" + + +class TestBuildDatasetJsonFormat: + """Tests for JSON file-backed dataset (format='json').""" + + def test_json_format_calls_json_dataset(self): + """format='json' calls inspect_ai.dataset.json_dataset(source).""" + task_def = { + "name": "my_task:baseline", + "dataset": { + "format": "json", + "source": "gs://bucket/data.jsonl", + }, + } + mock_ds = MagicMock(name="json_dataset_result") + with patch("dash_evals.runner.json_runner.json_dataset", return_value=mock_ds) as mock_fn: + result = _build_dataset(task_def) + + mock_fn.assert_called_once_with("gs://bucket/data.jsonl") + assert result is mock_ds + + def test_json_format_passes_extra_args(self): + """Extra args from dataset.args are passed as kwargs to json_dataset().""" + task_def = { + "name": "t:v", + "dataset": { + "format": "json", + "source": "./data.jsonl", + "args": {"auto_id": True, "shuffle": True}, + }, + } + with patch("dash_evals.runner.json_runner.json_dataset") as mock_fn: + _build_dataset(task_def) + + mock_fn.assert_called_once_with("./data.jsonl", auto_id=True, shuffle=True) + + def test_json_format_missing_source_raises(self): + """format='json' without a source raises ValueError.""" + task_def = { + "name": "my_task:baseline", + "dataset": {"format": "json"}, + } + with pytest.raises(ValueError, match="requires a 'source' field"): + _build_dataset(task_def) + + +class TestBuildDatasetCsvFormat: + """Tests for CSV file-backed dataset (format='csv').""" + + def test_csv_format_calls_csv_dataset(self): + """format='csv' calls inspect_ai.dataset.csv_dataset(source).""" + task_def = { + "name": "my_task:baseline", + "dataset": { + "format": "csv", + "source": "./data.csv", + }, + } + mock_ds = MagicMock(name="csv_dataset_result") + with patch("dash_evals.runner.json_runner.csv_dataset", return_value=mock_ds) as mock_fn: + result = _build_dataset(task_def) + + mock_fn.assert_called_once_with("./data.csv") + assert result is mock_ds + + def test_csv_format_passes_extra_args(self): + """Extra args from dataset.args are passed as kwargs to csv_dataset().""" + task_def = { + "name": "t:v", + "dataset": { + "format": "csv", + "source": "./data.csv", + "args": {"delimiter": "\t", "encoding": "utf-8"}, + }, + } + with patch("dash_evals.runner.json_runner.csv_dataset") as mock_fn: + _build_dataset(task_def) + + mock_fn.assert_called_once_with("./data.csv", delimiter="\t", encoding="utf-8") + + def test_csv_format_missing_source_raises(self): + """format='csv' without a source raises ValueError.""" + task_def = { + "name": "my_task:baseline", + "dataset": {"format": "csv"}, + } + with pytest.raises(ValueError, match="requires a 'source' field"): + _build_dataset(task_def) + + +class TestBuildDatasetUnknownFormat: + """Tests for unknown dataset formats.""" + + def test_unknown_format_raises(self): + """An unrecognized format string raises ValueError.""" + task_def = { + "name": "my_task:baseline", + "dataset": { + "format": "parquet", + "source": "./data.parquet", + }, + } + with pytest.raises(ValueError, match="unknown dataset format 'parquet'"): + _build_dataset(task_def) diff --git a/packages/dataset_config_dart/lib/src/models/context_file.g.dart b/packages/dataset_config_dart/lib/src/models/context_file.g.dart index fcea90e..7489275 100644 --- a/packages/dataset_config_dart/lib/src/models/context_file.g.dart +++ b/packages/dataset_config_dart/lib/src/models/context_file.g.dart @@ -37,7 +37,7 @@ _ContextFile _$ContextFileFromJson(Map json) => _ContextFile( Map _$ContextFileToJson(_ContextFile instance) => { - 'metadata': instance.metadata.toJson(), + 'metadata': instance.metadata, 'content': instance.content, 'file_path': instance.filePath, }; diff --git a/packages/dataset_config_dart/lib/src/models/dataset.dart b/packages/dataset_config_dart/lib/src/models/dataset.dart index 874080e..0bd9970 100644 --- a/packages/dataset_config_dart/lib/src/models/dataset.dart +++ b/packages/dataset_config_dart/lib/src/models/dataset.dart @@ -17,7 +17,7 @@ part 'dataset.g.dart'; @freezed sealed class Dataset with _$Dataset { const factory Dataset({ - /// The list of sample objects. + /// The list of sample objects (only used when format is 'memory'). @Default([]) List samples, /// Dataset name. @@ -28,6 +28,15 @@ sealed class Dataset with _$Dataset { /// Whether the dataset was shuffled after reading. @Default(false) bool shuffled, + + /// Dataset format: 'memory' (inline samples), 'json', or 'csv'. + @Default('memory') String format, + + /// File path or URL for json/csv datasets. + String? source, + + /// Extra kwargs passed to json_dataset() or csv_dataset(). + Map? args, }) = _Dataset; factory Dataset.fromJson(Map json) => diff --git a/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart b/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart index fdd77dc..8c0c2d2 100644 --- a/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart +++ b/packages/dataset_config_dart/lib/src/models/dataset.freezed.dart @@ -15,11 +15,14 @@ T _$identity(T value) => value; /// @nodoc mixin _$Dataset { -/// The list of sample objects. +/// The list of sample objects (only used when format is 'memory'). List get samples;/// Dataset name. String? get name;/// Dataset location (file path or remote URL). String? get location;/// Whether the dataset was shuffled after reading. - bool get shuffled; + bool get shuffled;/// Dataset format: 'memory' (inline samples), 'json', or 'csv'. + String get format;/// File path or URL for json/csv datasets. + String? get source;/// Extra kwargs passed to json_dataset() or csv_dataset(). + Map? get args; /// Create a copy of Dataset /// with the given fields replaced by the non-null parameter values. @JsonKey(includeFromJson: false, includeToJson: false) @@ -32,16 +35,16 @@ $DatasetCopyWith get copyWith => _$DatasetCopyWithImpl(this as @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is Dataset&&const DeepCollectionEquality().equals(other.samples, samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is Dataset&&const DeepCollectionEquality().equals(other.samples, samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled)&&(identical(other.format, format) || other.format == format)&&(identical(other.source, source) || other.source == source)&&const DeepCollectionEquality().equals(other.args, args)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(samples),name,location,shuffled); +int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(samples),name,location,shuffled,format,source,const DeepCollectionEquality().hash(args)); @override String toString() { - return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled)'; + return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled, format: $format, source: $source, args: $args)'; } @@ -52,7 +55,7 @@ abstract mixin class $DatasetCopyWith<$Res> { factory $DatasetCopyWith(Dataset value, $Res Function(Dataset) _then) = _$DatasetCopyWithImpl; @useResult $Res call({ - List samples, String? name, String? location, bool shuffled + List samples, String? name, String? location, bool shuffled, String format, String? source, Map? args }); @@ -69,13 +72,16 @@ class _$DatasetCopyWithImpl<$Res> /// Create a copy of Dataset /// with the given fields replaced by the non-null parameter values. -@pragma('vm:prefer-inline') @override $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,}) { +@pragma('vm:prefer-inline') @override $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,Object? format = null,Object? source = freezed,Object? args = freezed,}) { return _then(_self.copyWith( samples: null == samples ? _self.samples : samples // ignore: cast_nullable_to_non_nullable as List,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable as String?,location: freezed == location ? _self.location : location // ignore: cast_nullable_to_non_nullable as String?,shuffled: null == shuffled ? _self.shuffled : shuffled // ignore: cast_nullable_to_non_nullable -as bool, +as bool,format: null == format ? _self.format : format // ignore: cast_nullable_to_non_nullable +as String,source: freezed == source ? _self.source : source // ignore: cast_nullable_to_non_nullable +as String?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable +as Map?, )); } @@ -157,10 +163,10 @@ return $default(_that);case _: /// } /// ``` -@optionalTypeArgs TResult maybeWhen(TResult Function( List samples, String? name, String? location, bool shuffled)? $default,{required TResult orElse(),}) {final _that = this; +@optionalTypeArgs TResult maybeWhen(TResult Function( List samples, String? name, String? location, bool shuffled, String format, String? source, Map? args)? $default,{required TResult orElse(),}) {final _that = this; switch (_that) { case _Dataset() when $default != null: -return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _: +return $default(_that.samples,_that.name,_that.location,_that.shuffled,_that.format,_that.source,_that.args);case _: return orElse(); } @@ -178,10 +184,10 @@ return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _: /// } /// ``` -@optionalTypeArgs TResult when(TResult Function( List samples, String? name, String? location, bool shuffled) $default,) {final _that = this; +@optionalTypeArgs TResult when(TResult Function( List samples, String? name, String? location, bool shuffled, String format, String? source, Map? args) $default,) {final _that = this; switch (_that) { case _Dataset(): -return $default(_that.samples,_that.name,_that.location,_that.shuffled);} +return $default(_that.samples,_that.name,_that.location,_that.shuffled,_that.format,_that.source,_that.args);} } /// A variant of `when` that fallback to returning `null` /// @@ -195,10 +201,10 @@ return $default(_that.samples,_that.name,_that.location,_that.shuffled);} /// } /// ``` -@optionalTypeArgs TResult? whenOrNull(TResult? Function( List samples, String? name, String? location, bool shuffled)? $default,) {final _that = this; +@optionalTypeArgs TResult? whenOrNull(TResult? Function( List samples, String? name, String? location, bool shuffled, String format, String? source, Map? args)? $default,) {final _that = this; switch (_that) { case _Dataset() when $default != null: -return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _: +return $default(_that.samples,_that.name,_that.location,_that.shuffled,_that.format,_that.source,_that.args);case _: return null; } @@ -210,12 +216,12 @@ return $default(_that.samples,_that.name,_that.location,_that.shuffled);case _: @JsonSerializable() class _Dataset implements Dataset { - const _Dataset({final List samples = const [], this.name, this.location, this.shuffled = false}): _samples = samples; + const _Dataset({final List samples = const [], this.name, this.location, this.shuffled = false, this.format = 'memory', this.source, final Map? args}): _samples = samples,_args = args; factory _Dataset.fromJson(Map json) => _$DatasetFromJson(json); -/// The list of sample objects. +/// The list of sample objects (only used when format is 'memory'). final List _samples; -/// The list of sample objects. +/// The list of sample objects (only used when format is 'memory'). @override@JsonKey() List get samples { if (_samples is EqualUnmodifiableListView) return _samples; // ignore: implicit_dynamic_type @@ -228,6 +234,21 @@ class _Dataset implements Dataset { @override final String? location; /// Whether the dataset was shuffled after reading. @override@JsonKey() final bool shuffled; +/// Dataset format: 'memory' (inline samples), 'json', or 'csv'. +@override@JsonKey() final String format; +/// File path or URL for json/csv datasets. +@override final String? source; +/// Extra kwargs passed to json_dataset() or csv_dataset(). + final Map? _args; +/// Extra kwargs passed to json_dataset() or csv_dataset(). +@override Map? get args { + final value = _args; + if (value == null) return null; + if (_args is EqualUnmodifiableMapView) return _args; + // ignore: implicit_dynamic_type + return EqualUnmodifiableMapView(value); +} + /// Create a copy of Dataset /// with the given fields replaced by the non-null parameter values. @@ -242,16 +263,16 @@ Map toJson() { @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is _Dataset&&const DeepCollectionEquality().equals(other._samples, _samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is _Dataset&&const DeepCollectionEquality().equals(other._samples, _samples)&&(identical(other.name, name) || other.name == name)&&(identical(other.location, location) || other.location == location)&&(identical(other.shuffled, shuffled) || other.shuffled == shuffled)&&(identical(other.format, format) || other.format == format)&&(identical(other.source, source) || other.source == source)&&const DeepCollectionEquality().equals(other._args, _args)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(_samples),name,location,shuffled); +int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(_samples),name,location,shuffled,format,source,const DeepCollectionEquality().hash(_args)); @override String toString() { - return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled)'; + return 'Dataset(samples: $samples, name: $name, location: $location, shuffled: $shuffled, format: $format, source: $source, args: $args)'; } @@ -262,7 +283,7 @@ abstract mixin class _$DatasetCopyWith<$Res> implements $DatasetCopyWith<$Res> { factory _$DatasetCopyWith(_Dataset value, $Res Function(_Dataset) _then) = __$DatasetCopyWithImpl; @override @useResult $Res call({ - List samples, String? name, String? location, bool shuffled + List samples, String? name, String? location, bool shuffled, String format, String? source, Map? args }); @@ -279,13 +300,16 @@ class __$DatasetCopyWithImpl<$Res> /// Create a copy of Dataset /// with the given fields replaced by the non-null parameter values. -@override @pragma('vm:prefer-inline') $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,}) { +@override @pragma('vm:prefer-inline') $Res call({Object? samples = null,Object? name = freezed,Object? location = freezed,Object? shuffled = null,Object? format = null,Object? source = freezed,Object? args = freezed,}) { return _then(_Dataset( samples: null == samples ? _self._samples : samples // ignore: cast_nullable_to_non_nullable as List,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable as String?,location: freezed == location ? _self.location : location // ignore: cast_nullable_to_non_nullable as String?,shuffled: null == shuffled ? _self.shuffled : shuffled // ignore: cast_nullable_to_non_nullable -as bool, +as bool,format: null == format ? _self.format : format // ignore: cast_nullable_to_non_nullable +as String,source: freezed == source ? _self.source : source // ignore: cast_nullable_to_non_nullable +as String?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable +as Map?, )); } diff --git a/packages/dataset_config_dart/lib/src/models/dataset.g.dart b/packages/dataset_config_dart/lib/src/models/dataset.g.dart index 0b281d8..f7ff71a 100644 --- a/packages/dataset_config_dart/lib/src/models/dataset.g.dart +++ b/packages/dataset_config_dart/lib/src/models/dataset.g.dart @@ -15,11 +15,17 @@ _Dataset _$DatasetFromJson(Map json) => _Dataset( name: json['name'] as String?, location: json['location'] as String?, shuffled: json['shuffled'] as bool? ?? false, + format: json['format'] as String? ?? 'memory', + source: json['source'] as String?, + args: json['args'] as Map?, ); Map _$DatasetToJson(_Dataset instance) => { - 'samples': instance.samples.map((e) => e.toJson()).toList(), + 'samples': instance.samples, 'name': instance.name, 'location': instance.location, 'shuffled': instance.shuffled, + 'format': instance.format, + 'source': instance.source, + 'args': instance.args, }; diff --git a/packages/dataset_config_dart/lib/src/models/eval_log.g.dart b/packages/dataset_config_dart/lib/src/models/eval_log.g.dart index f6fa452..d55efb0 100644 --- a/packages/dataset_config_dart/lib/src/models/eval_log.g.dart +++ b/packages/dataset_config_dart/lib/src/models/eval_log.g.dart @@ -39,17 +39,17 @@ _EvalLog _$EvalLogFromJson(Map json) => _EvalLog( Map _$EvalLogToJson(_EvalLog instance) => { 'version': instance.version, 'status': instance.status, - 'eval': instance.eval.toJson(), - 'plan': instance.plan?.toJson(), - 'results': instance.results?.toJson(), - 'stats': instance.stats?.toJson(), - 'error': instance.error?.toJson(), + 'eval': instance.eval, + 'plan': instance.plan, + 'results': instance.results, + 'stats': instance.stats, + 'error': instance.error, 'invalidated': instance.invalidated, - 'samples': instance.samples?.map((e) => e.toJson()).toList(), - 'reductions': instance.reductions?.map((e) => e.toJson()).toList(), + 'samples': instance.samples, + 'reductions': instance.reductions, 'location': instance.location, 'etag': instance.etag, - 'eval_set_info': instance.evalSetInfo?.toJson(), + 'eval_set_info': instance.evalSetInfo, }; _EvalSpec _$EvalSpecFromJson(Map json) => _EvalSpec( @@ -125,15 +125,15 @@ Map _$EvalSpecToJson(_EvalSpec instance) => { 'solver_args': instance.solverArgs, 'solver_args_passed': instance.solverArgsPassed, 'tags': instance.tags, - 'dataset': instance.dataset?.toJson(), + 'dataset': instance.dataset, 'sandbox': instance.sandbox, 'model': instance.model, - 'model_generate_config': instance.modelGenerateConfig?.toJson(), + 'model_generate_config': instance.modelGenerateConfig, 'model_base_url': instance.modelBaseUrl, 'model_args': instance.modelArgs, 'model_roles': instance.modelRoles, - 'config': instance.config.toJson(), - 'revision': instance.revision?.toJson(), + 'config': instance.config, + 'revision': instance.revision, 'packages': instance.packages, 'metadata': instance.metadata, 'scorers': instance.scorers, @@ -249,9 +249,9 @@ _EvalPlan _$EvalPlanFromJson(Map json) => _EvalPlan( Map _$EvalPlanToJson(_EvalPlan instance) => { 'name': instance.name, - 'steps': instance.steps.map((e) => e.toJson()).toList(), - 'finish': instance.finish?.toJson(), - 'config': instance.config.toJson(), + 'steps': instance.steps, + 'finish': instance.finish, + 'config': instance.config, }; _EvalPlanStep _$EvalPlanStepFromJson(Map json) => @@ -291,12 +291,10 @@ Map _$EvalResultsToJson(_EvalResults instance) => { 'total_samples': instance.totalSamples, 'completed_samples': instance.completedSamples, - 'early_stopping': instance.earlyStopping?.toJson(), - 'scores': instance.scores.map((e) => e.toJson()).toList(), + 'early_stopping': instance.earlyStopping, + 'scores': instance.scores, 'metadata': instance.metadata, - 'sample_reductions': instance.sampleReductions - ?.map((e) => e.toJson()) - .toList(), + 'sample_reductions': instance.sampleReductions, }; _EarlyStoppingSummary _$EarlyStoppingSummaryFromJson( @@ -338,7 +336,7 @@ Map _$EvalScoreToJson(_EvalScore instance) => 'scored_samples': instance.scoredSamples, 'unscored_samples': instance.unscoredSamples, 'params': instance.params, - 'metrics': instance.metrics.map((e) => e.toJson()).toList(), + 'metrics': instance.metrics, 'metadata': instance.metadata, }; @@ -372,7 +370,7 @@ Map _$EvalSampleReductionsToJson( ) => { 'scorer': instance.scorer, 'reducer': instance.reducer, - 'samples': instance.samples.map((e) => e.toJson()).toList(), + 'samples': instance.samples, }; _EvalStats _$EvalStatsFromJson(Map json) => _EvalStats( @@ -389,7 +387,7 @@ Map _$EvalStatsToJson(_EvalStats instance) => { 'started_at': instance.startedAt, 'completed_at': instance.completedAt, - 'model_usage': instance.modelUsage.map((k, e) => MapEntry(k, e.toJson())), + 'model_usage': instance.modelUsage, }; _EvalError _$EvalErrorFromJson(Map json) => _EvalError( @@ -470,22 +468,22 @@ Map _$EvalSampleToJson(_EvalSample instance) => 'sandbox': instance.sandbox, 'files': instance.files, 'setup': instance.setup, - 'messages': instance.messages.map((e) => e.toJson()).toList(), - 'output': instance.output.toJson(), - 'scores': instance.scores?.map((k, e) => MapEntry(k, e.toJson())), + 'messages': instance.messages, + 'output': instance.output, + 'scores': instance.scores, 'store': instance.store, 'events': instance.events, - 'model_usage': instance.modelUsage.map((k, e) => MapEntry(k, e.toJson())), + 'model_usage': instance.modelUsage, 'started_at': instance.startedAt, 'completed_at': instance.completedAt, 'total_time': instance.totalTime, 'working_time': instance.workingTime, 'uuid': instance.uuid, - 'invalidation': instance.invalidation?.toJson(), - 'error': instance.error?.toJson(), - 'error_retries': instance.errorRetries?.map((e) => e.toJson()).toList(), + 'invalidation': instance.invalidation, + 'error': instance.error, + 'error_retries': instance.errorRetries, 'attachments': instance.attachments, - 'limit': instance.limit?.toJson(), + 'limit': instance.limit, }; _ModelOutput _$ModelOutputFromJson(Map json) => _ModelOutput( @@ -511,14 +509,14 @@ _ModelOutput _$ModelOutputFromJson(Map json) => _ModelOutput( Map _$ModelOutputToJson(_ModelOutput instance) => { 'model': instance.model, - 'choices': instance.choices.map((e) => e.toJson()).toList(), - 'usage': instance.usage?.toJson(), + 'choices': instance.choices, + 'usage': instance.usage, 'completion': instance.completion, 'stop_reason': instance.stopReason, 'time': instance.time, 'metadata': instance.metadata, 'error': instance.error, - 'message': instance.message?.toJson(), + 'message': instance.message, }; _ChatCompletionChoice _$ChatCompletionChoiceFromJson( @@ -536,9 +534,9 @@ _ChatCompletionChoice _$ChatCompletionChoiceFromJson( Map _$ChatCompletionChoiceToJson( _ChatCompletionChoice instance, ) => { - 'message': instance.message.toJson(), + 'message': instance.message, 'stop_reason': instance.stopReason, - 'logprobs': instance.logprobs?.toJson(), + 'logprobs': instance.logprobs, }; _ModelUsage _$ModelUsageFromJson(Map json) => _ModelUsage( @@ -620,7 +618,7 @@ Map _$ChatMessageAssistantToJson( 'source': instance.source, 'metadata': instance.metadata, 'role': instance.role, - 'tool_calls': instance.toolCalls?.map((e) => e.toJson()).toList(), + 'tool_calls': instance.toolCalls, 'model': instance.model, }; @@ -647,7 +645,7 @@ Map _$ChatMessageToolToJson(ChatMessageTool instance) => 'role': instance.role, 'tool_call_id': instance.toolCallId, 'function': instance.function, - 'error': instance.error?.toJson(), + 'error': instance.error, }; ContentText _$ContentTextFromJson(Map json) => ContentText( @@ -932,7 +930,7 @@ _EvalSetInfo _$EvalSetInfoFromJson(Map json) => _EvalSetInfo( Map _$EvalSetInfoToJson(_EvalSetInfo instance) => { 'eval_set_id': instance.evalSetId, - 'tasks': instance.tasks.map((e) => e.toJson()).toList(), + 'tasks': instance.tasks, }; _EvalSetTask _$EvalSetTaskFromJson(Map json) => _EvalSetTask( diff --git a/packages/dataset_config_dart/lib/src/models/eval_set.g.dart b/packages/dataset_config_dart/lib/src/models/eval_set.g.dart index 7b0db55..4e91dab 100644 --- a/packages/dataset_config_dart/lib/src/models/eval_set.g.dart +++ b/packages/dataset_config_dart/lib/src/models/eval_set.g.dart @@ -64,7 +64,7 @@ _EvalSet _$EvalSetFromJson(Map json) => _EvalSet( ); Map _$EvalSetToJson(_EvalSet instance) => { - 'tasks': instance.tasks.map((e) => e.toJson()).toList(), + 'tasks': instance.tasks, 'log_dir': instance.logDir, 'retry_attempts': instance.retryAttempts, 'retry_wait': instance.retryWait, diff --git a/packages/dataset_config_dart/lib/src/models/job.dart b/packages/dataset_config_dart/lib/src/models/job.dart index 800f19c..793946b 100644 --- a/packages/dataset_config_dart/lib/src/models/job.dart +++ b/packages/dataset_config_dart/lib/src/models/job.dart @@ -1,4 +1,5 @@ import 'package:freezed_annotation/freezed_annotation.dart'; +import 'tag_filter.dart'; part 'job.freezed.dart'; part 'job.g.dart'; @@ -16,27 +17,25 @@ part 'job.g.dart'; /// Example YAML: /// ```yaml /// log_dir: ./logs/my_run -/// sandbox: podman +/// sandbox: +/// environment: podman /// max_connections: 10 /// models: /// - google/gemini-2.5-flash /// variants: /// baseline: {} /// context_only: -/// context_files: [./context_files/flutter.md] +/// files: [./context_files/flutter.md] /// tasks: /// dart_qa: /// include-samples: [sample_1] /// -/// # Pass-through to eval_set() -/// eval_set_overrides: +/// # All Inspect AI eval_set() parameters +/// inspect_eval_arguments: /// retry_attempts: 20 /// log_level: debug -/// -/// # Default Task-level overrides applied to every task -/// task_defaults: -/// time_limit: 600 -/// message_limit: 50 +/// task_defaults: +/// time_limit: 600 /// ``` @freezed sealed class Job with _$Job { @@ -45,17 +44,17 @@ sealed class Job with _$Job { // Core job settings // ------------------------------------------------------------------ + /// Human-readable description of this job. + String? description, + /// Directory to write evaluation logs to. @JsonKey(name: 'log_dir') required String logDir, - /// Sandbox type: `'local'`, `'docker'`, or `'podman'`. - @JsonKey(name: 'sandbox_type') @Default('local') String sandboxType, - /// Maximum concurrent API connections. @JsonKey(name: 'max_connections') @Default(10) int maxConnections, - /// Models to run. `null` means use defaults from registries. - List? models, + /// Models to run (required). + required List models, /// Named variant map. Keys are variant names, values are config dicts. /// `null` means baseline only. @@ -72,167 +71,29 @@ sealed class Job with _$Job { @JsonKey(name: 'save_examples') @Default(false) bool saveExamples, // ------------------------------------------------------------------ - // Promoted eval_set() parameters (convenience top-level keys) + // Sandbox configuration // ------------------------------------------------------------------ - /// Maximum retry attempts before giving up (defaults to 10). - @JsonKey(name: 'retry_attempts') int? retryAttempts, - - /// Maximum number of retry attempts for failed samples. - @JsonKey(name: 'max_retries') int? maxRetries, - - /// Time in seconds to wait between retry attempts (exponential backoff). - @JsonKey(name: 'retry_wait') double? retryWait, - - /// Reduce `max_connections` at this rate with each retry (default 1.0). - @JsonKey(name: 'retry_connections') double? retryConnections, - - /// Cleanup failed log files after retries (defaults to true). - @JsonKey(name: 'retry_cleanup') bool? retryCleanup, - - /// Fail on sample errors. - /// - /// `0.0–1.0` = fail if proportion exceeds threshold, - /// `>1` = fail if count exceeds threshold. - @JsonKey(name: 'fail_on_error') double? failOnError, - - /// Continue running even if `fail_on_error` condition is met. - @JsonKey(name: 'continue_on_fail') bool? continueOnFail, - - /// Number of times to retry samples on error (default: no retries). - @JsonKey(name: 'retry_on_error') int? retryOnError, - - /// Raise task errors for debugging (defaults to false). - @JsonKey(name: 'debug_errors') bool? debugErrors, - - /// Maximum samples to run in parallel (default is `max_connections`). - @JsonKey(name: 'max_samples') int? maxSamples, - - /// Maximum tasks to run in parallel. - @JsonKey(name: 'max_tasks') int? maxTasks, - - /// Maximum subprocesses to run in parallel. - @JsonKey(name: 'max_subprocesses') int? maxSubprocesses, - - /// Maximum sandboxes (per-provider) to run in parallel. - @JsonKey(name: 'max_sandboxes') int? maxSandboxes, - - /// Level for logging to the console (e.g. `"warning"`, `"info"`, `"debug"`). - @JsonKey(name: 'log_level') String? logLevel, - - /// Level for logging to the log file (defaults to `"info"`). - @JsonKey(name: 'log_level_transcript') String? logLevelTranscript, - - /// Format for writing log files (`"eval"` or `"json"`). - @JsonKey(name: 'log_format') String? logFormat, - - /// Tags to associate with this evaluation run. - List? tags, - - /// Metadata to associate with this evaluation run. - Map? metadata, - - /// Trace message interactions with evaluated model to terminal. - bool? trace, - - /// Task display type (defaults to `"full"`). - String? display, - - /// Score output (defaults to true). - bool? score, - - /// Limit evaluated samples (int count or `[start, end]` range). - Object? limit, - - /// Evaluate specific sample(s) from the dataset. - @JsonKey(name: 'sample_id') Object? sampleId, - - /// Shuffle order of samples (pass a seed to make order deterministic). - @JsonKey(name: 'sample_shuffle') Object? sampleShuffle, + /// Sandbox config with keys: environment, parameters, image_prefix. + Map? sandbox, - /// Epochs to repeat samples for and optional score reducer function(s). - Object? epochs, - - /// Tool use approval policies (string or config dict). - Object? approval, - - /// Alternative solver(s) for evaluating task(s) (string or config dict). - Object? solver, - - /// Sandbox cleanup after task completes (defaults to true). - @JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup, - - /// Base URL for communicating with the model API. - @JsonKey(name: 'model_base_url') String? modelBaseUrl, - - /// Model creation arguments. - @JsonKey(name: 'model_args') Map? modelArgs, - - /// Named roles for use in `get_model()`. - @JsonKey(name: 'model_roles') Map? modelRoles, - - /// Task creation arguments. - @JsonKey(name: 'task_args') Map? taskArgs, - - /// Limit on total messages per sample. - @JsonKey(name: 'message_limit') int? messageLimit, - - /// Limit on total tokens per sample. - @JsonKey(name: 'token_limit') int? tokenLimit, - - /// Limit on clock time (in seconds) per sample. - @JsonKey(name: 'time_limit') int? timeLimit, - - /// Limit on working time (in seconds) per sample. - @JsonKey(name: 'working_limit') int? workingLimit, - - /// Limit on total cost (in dollars) per sample. - @JsonKey(name: 'cost_limit') double? costLimit, - - /// JSON file with model prices for cost tracking. - @JsonKey(name: 'model_cost_config') Map? modelCostConfig, - - /// Log detailed samples and scores (defaults to true). - @JsonKey(name: 'log_samples') bool? logSamples, - - /// Log events in realtime (defaults to true). - @JsonKey(name: 'log_realtime') bool? logRealtime, - - /// Log base64-encoded images (defaults to false). - @JsonKey(name: 'log_images') bool? logImages, - - /// Number of samples to buffer before writing log file. - @JsonKey(name: 'log_buffer') int? logBuffer, - - /// Sync sample events for realtime viewing. - @JsonKey(name: 'log_shared') int? logShared, - - /// Directory to bundle logs and viewer into. - @JsonKey(name: 'bundle_dir') String? bundleDir, - - /// Overwrite files in `bundle_dir` (defaults to false). - @JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite, - - /// Allow log directory to contain unrelated logs (defaults to false). - @JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty, + // ------------------------------------------------------------------ + // Inspect eval arguments (passed through to eval_set()) + // ------------------------------------------------------------------ - /// ID for the eval set. Generated if not specified. - @JsonKey(name: 'eval_set_id') String? evalSetId, + /// All Inspect AI eval_set() parameters, nested under one key. + @JsonKey(name: 'inspect_eval_arguments') + Map? inspectEvalArguments, // ------------------------------------------------------------------ - // Pass-through overrides + // Tag-based filtering // ------------------------------------------------------------------ - /// Additional `eval_set()` kwargs not covered by top-level fields. - /// - /// Any valid `eval_set()` parameter can be specified here and will be - /// merged into the output JSON. Top-level fields take precedence. - @JsonKey(name: 'eval_set_overrides') Map? evalSetOverrides, + /// Tag filters applied to tasks. + @JsonKey(name: 'task_filters') TagFilter? taskFilters, - /// Default `Task` kwargs applied to every task in this job. - /// - /// Per-task overrides (from `task.yaml`) take precedence. - @JsonKey(name: 'task_defaults') Map? taskDefaults, + /// Tag filters applied to samples. + @JsonKey(name: 'sample_filters') TagFilter? sampleFilters, }) = _Job; factory Job.fromJson(Map json) => _$JobFromJson(json); @@ -240,8 +101,7 @@ sealed class Job with _$Job { /// Per-task configuration within a job. /// -/// Allows overriding which samples run for specific tasks and providing -/// a custom system message. +/// Allows overriding which samples and variants run for specific tasks. @freezed sealed class JobTask with _$JobTask { const factory JobTask({ @@ -254,17 +114,20 @@ sealed class JobTask with _$JobTask { /// Exclude these sample IDs. Mutually exclusive with [includeSamples]. @JsonKey(name: 'exclude_samples') List? excludeSamples, - /// Override system message for this task. - @JsonKey(name: 'system_message') String? systemMessage, + /// Only run these variant names for this task. + @JsonKey(name: 'include_variants') List? includeVariants, + + /// Exclude these variant names for this task. + @JsonKey(name: 'exclude_variants') List? excludeVariants, + + /// Per-task argument overrides passed to the task function. + @JsonKey(name: 'args') Map? args, }) = _JobTask; factory JobTask.fromJson(Map json) => _$JobTaskFromJson(json); /// Create a [JobTask] from parsed YAML data. - /// - /// The [taskId] is the map key from the job YAML `tasks:` section. - /// The [data] may be `null` for a simple task reference with no overrides. factory JobTask.fromYaml(String taskId, Map? data) { if (data == null) { return JobTask(id: taskId); @@ -273,7 +136,9 @@ sealed class JobTask with _$JobTask { id: taskId, includeSamples: (data['include-samples'] as List?)?.cast(), excludeSamples: (data['exclude-samples'] as List?)?.cast(), - systemMessage: data['system_message'] as String?, + includeVariants: (data['include-variants'] as List?)?.cast(), + excludeVariants: (data['exclude-variants'] as List?)?.cast(), + args: (data['args'] as Map?)?.cast(), ); } } diff --git a/packages/dataset_config_dart/lib/src/models/job.freezed.dart b/packages/dataset_config_dart/lib/src/models/job.freezed.dart index e249877..14cd2bb 100644 --- a/packages/dataset_config_dart/lib/src/models/job.freezed.dart +++ b/packages/dataset_config_dart/lib/src/models/job.freezed.dart @@ -18,80 +18,30 @@ mixin _$Job { // ------------------------------------------------------------------ // Core job settings // ------------------------------------------------------------------ -/// Directory to write evaluation logs to. -@JsonKey(name: 'log_dir') String get logDir;/// Sandbox type: `'local'`, `'docker'`, or `'podman'`. -@JsonKey(name: 'sandbox_type') String get sandboxType;/// Maximum concurrent API connections. -@JsonKey(name: 'max_connections') int get maxConnections;/// Models to run. `null` means use defaults from registries. - List? get models;/// Named variant map. Keys are variant names, values are config dicts. +/// Human-readable description of this job. + String? get description;/// Directory to write evaluation logs to. +@JsonKey(name: 'log_dir') String get logDir;/// Maximum concurrent API connections. +@JsonKey(name: 'max_connections') int get maxConnections;/// Models to run (required). + List get models;/// Named variant map. Keys are variant names, values are config dicts. /// `null` means baseline only. Map>? get variants;/// Glob patterns for discovering task directories (relative to dataset root). @JsonKey(name: 'task_paths') List? get taskPaths;/// Per-task configurations with inline overrides. /// `null` means run all tasks. Map? get tasks;/// If `true`, copy final workspace to `/examples/` after each sample. @JsonKey(name: 'save_examples') bool get saveExamples;// ------------------------------------------------------------------ -// Promoted eval_set() parameters (convenience top-level keys) +// Sandbox configuration // ------------------------------------------------------------------ -/// Maximum retry attempts before giving up (defaults to 10). -@JsonKey(name: 'retry_attempts') int? get retryAttempts;/// Maximum number of retry attempts for failed samples. -@JsonKey(name: 'max_retries') int? get maxRetries;/// Time in seconds to wait between retry attempts (exponential backoff). -@JsonKey(name: 'retry_wait') double? get retryWait;/// Reduce `max_connections` at this rate with each retry (default 1.0). -@JsonKey(name: 'retry_connections') double? get retryConnections;/// Cleanup failed log files after retries (defaults to true). -@JsonKey(name: 'retry_cleanup') bool? get retryCleanup;/// Fail on sample errors. -/// -/// `0.0–1.0` = fail if proportion exceeds threshold, -/// `>1` = fail if count exceeds threshold. -@JsonKey(name: 'fail_on_error') double? get failOnError;/// Continue running even if `fail_on_error` condition is met. -@JsonKey(name: 'continue_on_fail') bool? get continueOnFail;/// Number of times to retry samples on error (default: no retries). -@JsonKey(name: 'retry_on_error') int? get retryOnError;/// Raise task errors for debugging (defaults to false). -@JsonKey(name: 'debug_errors') bool? get debugErrors;/// Maximum samples to run in parallel (default is `max_connections`). -@JsonKey(name: 'max_samples') int? get maxSamples;/// Maximum tasks to run in parallel. -@JsonKey(name: 'max_tasks') int? get maxTasks;/// Maximum subprocesses to run in parallel. -@JsonKey(name: 'max_subprocesses') int? get maxSubprocesses;/// Maximum sandboxes (per-provider) to run in parallel. -@JsonKey(name: 'max_sandboxes') int? get maxSandboxes;/// Level for logging to the console (e.g. `"warning"`, `"info"`, `"debug"`). -@JsonKey(name: 'log_level') String? get logLevel;/// Level for logging to the log file (defaults to `"info"`). -@JsonKey(name: 'log_level_transcript') String? get logLevelTranscript;/// Format for writing log files (`"eval"` or `"json"`). -@JsonKey(name: 'log_format') String? get logFormat;/// Tags to associate with this evaluation run. - List? get tags;/// Metadata to associate with this evaluation run. - Map? get metadata;/// Trace message interactions with evaluated model to terminal. - bool? get trace;/// Task display type (defaults to `"full"`). - String? get display;/// Score output (defaults to true). - bool? get score;/// Limit evaluated samples (int count or `[start, end]` range). - Object? get limit;/// Evaluate specific sample(s) from the dataset. -@JsonKey(name: 'sample_id') Object? get sampleId;/// Shuffle order of samples (pass a seed to make order deterministic). -@JsonKey(name: 'sample_shuffle') Object? get sampleShuffle;/// Epochs to repeat samples for and optional score reducer function(s). - Object? get epochs;/// Tool use approval policies (string or config dict). - Object? get approval;/// Alternative solver(s) for evaluating task(s) (string or config dict). - Object? get solver;/// Sandbox cleanup after task completes (defaults to true). -@JsonKey(name: 'sandbox_cleanup') bool? get sandboxCleanup;/// Base URL for communicating with the model API. -@JsonKey(name: 'model_base_url') String? get modelBaseUrl;/// Model creation arguments. -@JsonKey(name: 'model_args') Map? get modelArgs;/// Named roles for use in `get_model()`. -@JsonKey(name: 'model_roles') Map? get modelRoles;/// Task creation arguments. -@JsonKey(name: 'task_args') Map? get taskArgs;/// Limit on total messages per sample. -@JsonKey(name: 'message_limit') int? get messageLimit;/// Limit on total tokens per sample. -@JsonKey(name: 'token_limit') int? get tokenLimit;/// Limit on clock time (in seconds) per sample. -@JsonKey(name: 'time_limit') int? get timeLimit;/// Limit on working time (in seconds) per sample. -@JsonKey(name: 'working_limit') int? get workingLimit;/// Limit on total cost (in dollars) per sample. -@JsonKey(name: 'cost_limit') double? get costLimit;/// JSON file with model prices for cost tracking. -@JsonKey(name: 'model_cost_config') Map? get modelCostConfig;/// Log detailed samples and scores (defaults to true). -@JsonKey(name: 'log_samples') bool? get logSamples;/// Log events in realtime (defaults to true). -@JsonKey(name: 'log_realtime') bool? get logRealtime;/// Log base64-encoded images (defaults to false). -@JsonKey(name: 'log_images') bool? get logImages;/// Number of samples to buffer before writing log file. -@JsonKey(name: 'log_buffer') int? get logBuffer;/// Sync sample events for realtime viewing. -@JsonKey(name: 'log_shared') int? get logShared;/// Directory to bundle logs and viewer into. -@JsonKey(name: 'bundle_dir') String? get bundleDir;/// Overwrite files in `bundle_dir` (defaults to false). -@JsonKey(name: 'bundle_overwrite') bool? get bundleOverwrite;/// Allow log directory to contain unrelated logs (defaults to false). -@JsonKey(name: 'log_dir_allow_dirty') bool? get logDirAllowDirty;/// ID for the eval set. Generated if not specified. -@JsonKey(name: 'eval_set_id') String? get evalSetId;// ------------------------------------------------------------------ -// Pass-through overrides +/// Sandbox config with keys: environment, parameters, image_prefix. + Map? get sandbox;// ------------------------------------------------------------------ +// Inspect eval arguments (passed through to eval_set()) // ------------------------------------------------------------------ -/// Additional `eval_set()` kwargs not covered by top-level fields. -/// -/// Any valid `eval_set()` parameter can be specified here and will be -/// merged into the output JSON. Top-level fields take precedence. -@JsonKey(name: 'eval_set_overrides') Map? get evalSetOverrides;/// Default `Task` kwargs applied to every task in this job. -/// -/// Per-task overrides (from `task.yaml`) take precedence. -@JsonKey(name: 'task_defaults') Map? get taskDefaults; +/// All Inspect AI eval_set() parameters, nested under one key. +@JsonKey(name: 'inspect_eval_arguments') Map? get inspectEvalArguments;// ------------------------------------------------------------------ +// Tag-based filtering +// ------------------------------------------------------------------ +/// Tag filters applied to tasks. +@JsonKey(name: 'task_filters') TagFilter? get taskFilters;/// Tag filters applied to samples. +@JsonKey(name: 'sample_filters') TagFilter? get sampleFilters; /// Create a copy of Job /// with the given fields replaced by the non-null parameter values. @JsonKey(includeFromJson: false, includeToJson: false) @@ -104,16 +54,16 @@ $JobCopyWith get copyWith => _$JobCopyWithImpl(this as Job, _$identity @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is Job&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other.models, models)&&const DeepCollectionEquality().equals(other.variants, variants)&&const DeepCollectionEquality().equals(other.taskPaths, taskPaths)&&const DeepCollectionEquality().equals(other.tasks, tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other.tags, tags)&&const DeepCollectionEquality().equals(other.metadata, metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other.modelArgs, modelArgs)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.taskArgs, taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.modelCostConfig, modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other.evalSetOverrides, evalSetOverrides)&&const DeepCollectionEquality().equals(other.taskDefaults, taskDefaults)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is Job&&(identical(other.description, description) || other.description == description)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other.models, models)&&const DeepCollectionEquality().equals(other.variants, variants)&&const DeepCollectionEquality().equals(other.taskPaths, taskPaths)&&const DeepCollectionEquality().equals(other.tasks, tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.inspectEvalArguments, inspectEvalArguments)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hashAll([runtimeType,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(models),const DeepCollectionEquality().hash(variants),const DeepCollectionEquality().hash(taskPaths),const DeepCollectionEquality().hash(tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(tags),const DeepCollectionEquality().hash(metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(modelArgs),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(evalSetOverrides),const DeepCollectionEquality().hash(taskDefaults)]); +int get hashCode => Object.hash(runtimeType,description,logDir,maxConnections,const DeepCollectionEquality().hash(models),const DeepCollectionEquality().hash(variants),const DeepCollectionEquality().hash(taskPaths),const DeepCollectionEquality().hash(tasks),saveExamples,const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(inspectEvalArguments),taskFilters,sampleFilters); @override String toString() { - return 'Job(logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults)'; + return 'Job(description: $description, logDir: $logDir, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, sandbox: $sandbox, inspectEvalArguments: $inspectEvalArguments, taskFilters: $taskFilters, sampleFilters: $sampleFilters)'; } @@ -124,11 +74,11 @@ abstract mixin class $JobCopyWith<$Res> { factory $JobCopyWith(Job value, $Res Function(Job) _then) = _$JobCopyWithImpl; @useResult $Res call({ -@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List? models, Map>? variants,@JsonKey(name: 'task_paths') List? taskPaths, Map? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List? tags, Map? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map? modelArgs,@JsonKey(name: 'model_roles') Map? modelRoles,@JsonKey(name: 'task_args') Map? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map? evalSetOverrides,@JsonKey(name: 'task_defaults') Map? taskDefaults + String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List models, Map>? variants,@JsonKey(name: 'task_paths') List? taskPaths, Map? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters }); - +$TagFilterCopyWith<$Res>? get taskFilters;$TagFilterCopyWith<$Res>? get sampleFilters; } /// @nodoc @@ -141,63 +91,48 @@ class _$JobCopyWithImpl<$Res> /// Create a copy of Job /// with the given fields replaced by the non-null parameter values. -@pragma('vm:prefer-inline') @override $Res call({Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,}) { +@pragma('vm:prefer-inline') @override $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = null,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) { return _then(_self.copyWith( -logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable -as String,sandboxType: null == sandboxType ? _self.sandboxType : sandboxType // ignore: cast_nullable_to_non_nullable +description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable +as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable -as int,models: freezed == models ? _self.models : models // ignore: cast_nullable_to_non_nullable -as List?,variants: freezed == variants ? _self.variants : variants // ignore: cast_nullable_to_non_nullable +as int,models: null == models ? _self.models : models // ignore: cast_nullable_to_non_nullable +as List,variants: freezed == variants ? _self.variants : variants // ignore: cast_nullable_to_non_nullable as Map>?,taskPaths: freezed == taskPaths ? _self.taskPaths : taskPaths // ignore: cast_nullable_to_non_nullable as List?,tasks: freezed == tasks ? _self.tasks : tasks // ignore: cast_nullable_to_non_nullable as Map?,saveExamples: null == saveExamples ? _self.saveExamples : saveExamples // ignore: cast_nullable_to_non_nullable -as bool,retryAttempts: freezed == retryAttempts ? _self.retryAttempts : retryAttempts // ignore: cast_nullable_to_non_nullable -as int?,maxRetries: freezed == maxRetries ? _self.maxRetries : maxRetries // ignore: cast_nullable_to_non_nullable -as int?,retryWait: freezed == retryWait ? _self.retryWait : retryWait // ignore: cast_nullable_to_non_nullable -as double?,retryConnections: freezed == retryConnections ? _self.retryConnections : retryConnections // ignore: cast_nullable_to_non_nullable -as double?,retryCleanup: freezed == retryCleanup ? _self.retryCleanup : retryCleanup // ignore: cast_nullable_to_non_nullable -as bool?,failOnError: freezed == failOnError ? _self.failOnError : failOnError // ignore: cast_nullable_to_non_nullable -as double?,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable -as bool?,retryOnError: freezed == retryOnError ? _self.retryOnError : retryOnError // ignore: cast_nullable_to_non_nullable -as int?,debugErrors: freezed == debugErrors ? _self.debugErrors : debugErrors // ignore: cast_nullable_to_non_nullable -as bool?,maxSamples: freezed == maxSamples ? _self.maxSamples : maxSamples // ignore: cast_nullable_to_non_nullable -as int?,maxTasks: freezed == maxTasks ? _self.maxTasks : maxTasks // ignore: cast_nullable_to_non_nullable -as int?,maxSubprocesses: freezed == maxSubprocesses ? _self.maxSubprocesses : maxSubprocesses // ignore: cast_nullable_to_non_nullable -as int?,maxSandboxes: freezed == maxSandboxes ? _self.maxSandboxes : maxSandboxes // ignore: cast_nullable_to_non_nullable -as int?,logLevel: freezed == logLevel ? _self.logLevel : logLevel // ignore: cast_nullable_to_non_nullable -as String?,logLevelTranscript: freezed == logLevelTranscript ? _self.logLevelTranscript : logLevelTranscript // ignore: cast_nullable_to_non_nullable -as String?,logFormat: freezed == logFormat ? _self.logFormat : logFormat // ignore: cast_nullable_to_non_nullable -as String?,tags: freezed == tags ? _self.tags : tags // ignore: cast_nullable_to_non_nullable -as List?,metadata: freezed == metadata ? _self.metadata : metadata // ignore: cast_nullable_to_non_nullable -as Map?,trace: freezed == trace ? _self.trace : trace // ignore: cast_nullable_to_non_nullable -as bool?,display: freezed == display ? _self.display : display // ignore: cast_nullable_to_non_nullable -as String?,score: freezed == score ? _self.score : score // ignore: cast_nullable_to_non_nullable -as bool?,limit: freezed == limit ? _self.limit : limit ,sampleId: freezed == sampleId ? _self.sampleId : sampleId ,sampleShuffle: freezed == sampleShuffle ? _self.sampleShuffle : sampleShuffle ,epochs: freezed == epochs ? _self.epochs : epochs ,approval: freezed == approval ? _self.approval : approval ,solver: freezed == solver ? _self.solver : solver ,sandboxCleanup: freezed == sandboxCleanup ? _self.sandboxCleanup : sandboxCleanup // ignore: cast_nullable_to_non_nullable -as bool?,modelBaseUrl: freezed == modelBaseUrl ? _self.modelBaseUrl : modelBaseUrl // ignore: cast_nullable_to_non_nullable -as String?,modelArgs: freezed == modelArgs ? _self.modelArgs : modelArgs // ignore: cast_nullable_to_non_nullable -as Map?,modelRoles: freezed == modelRoles ? _self.modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable -as Map?,taskArgs: freezed == taskArgs ? _self.taskArgs : taskArgs // ignore: cast_nullable_to_non_nullable -as Map?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable -as int?,tokenLimit: freezed == tokenLimit ? _self.tokenLimit : tokenLimit // ignore: cast_nullable_to_non_nullable -as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore: cast_nullable_to_non_nullable -as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable -as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable -as double?,modelCostConfig: freezed == modelCostConfig ? _self.modelCostConfig : modelCostConfig // ignore: cast_nullable_to_non_nullable -as Map?,logSamples: freezed == logSamples ? _self.logSamples : logSamples // ignore: cast_nullable_to_non_nullable -as bool?,logRealtime: freezed == logRealtime ? _self.logRealtime : logRealtime // ignore: cast_nullable_to_non_nullable -as bool?,logImages: freezed == logImages ? _self.logImages : logImages // ignore: cast_nullable_to_non_nullable -as bool?,logBuffer: freezed == logBuffer ? _self.logBuffer : logBuffer // ignore: cast_nullable_to_non_nullable -as int?,logShared: freezed == logShared ? _self.logShared : logShared // ignore: cast_nullable_to_non_nullable -as int?,bundleDir: freezed == bundleDir ? _self.bundleDir : bundleDir // ignore: cast_nullable_to_non_nullable -as String?,bundleOverwrite: freezed == bundleOverwrite ? _self.bundleOverwrite : bundleOverwrite // ignore: cast_nullable_to_non_nullable -as bool?,logDirAllowDirty: freezed == logDirAllowDirty ? _self.logDirAllowDirty : logDirAllowDirty // ignore: cast_nullable_to_non_nullable -as bool?,evalSetId: freezed == evalSetId ? _self.evalSetId : evalSetId // ignore: cast_nullable_to_non_nullable -as String?,evalSetOverrides: freezed == evalSetOverrides ? _self.evalSetOverrides : evalSetOverrides // ignore: cast_nullable_to_non_nullable -as Map?,taskDefaults: freezed == taskDefaults ? _self.taskDefaults : taskDefaults // ignore: cast_nullable_to_non_nullable -as Map?, +as bool,sandbox: freezed == sandbox ? _self.sandbox : sandbox // ignore: cast_nullable_to_non_nullable +as Map?,inspectEvalArguments: freezed == inspectEvalArguments ? _self.inspectEvalArguments : inspectEvalArguments // ignore: cast_nullable_to_non_nullable +as Map?,taskFilters: freezed == taskFilters ? _self.taskFilters : taskFilters // ignore: cast_nullable_to_non_nullable +as TagFilter?,sampleFilters: freezed == sampleFilters ? _self.sampleFilters : sampleFilters // ignore: cast_nullable_to_non_nullable +as TagFilter?, )); } +/// Create a copy of Job +/// with the given fields replaced by the non-null parameter values. +@override +@pragma('vm:prefer-inline') +$TagFilterCopyWith<$Res>? get taskFilters { + if (_self.taskFilters == null) { + return null; + } + + return $TagFilterCopyWith<$Res>(_self.taskFilters!, (value) { + return _then(_self.copyWith(taskFilters: value)); + }); +}/// Create a copy of Job +/// with the given fields replaced by the non-null parameter values. +@override +@pragma('vm:prefer-inline') +$TagFilterCopyWith<$Res>? get sampleFilters { + if (_self.sampleFilters == null) { + return null; + } + return $TagFilterCopyWith<$Res>(_self.sampleFilters!, (value) { + return _then(_self.copyWith(sampleFilters: value)); + }); +} } @@ -276,10 +211,10 @@ return $default(_that);case _: /// } /// ``` -@optionalTypeArgs TResult maybeWhen(TResult Function(@JsonKey(name: 'log_dir') String logDir, @JsonKey(name: 'sandbox_type') String sandboxType, @JsonKey(name: 'max_connections') int maxConnections, List? models, Map>? variants, @JsonKey(name: 'task_paths') List? taskPaths, Map? tasks, @JsonKey(name: 'save_examples') bool saveExamples, @JsonKey(name: 'retry_attempts') int? retryAttempts, @JsonKey(name: 'max_retries') int? maxRetries, @JsonKey(name: 'retry_wait') double? retryWait, @JsonKey(name: 'retry_connections') double? retryConnections, @JsonKey(name: 'retry_cleanup') bool? retryCleanup, @JsonKey(name: 'fail_on_error') double? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'retry_on_error') int? retryOnError, @JsonKey(name: 'debug_errors') bool? debugErrors, @JsonKey(name: 'max_samples') int? maxSamples, @JsonKey(name: 'max_tasks') int? maxTasks, @JsonKey(name: 'max_subprocesses') int? maxSubprocesses, @JsonKey(name: 'max_sandboxes') int? maxSandboxes, @JsonKey(name: 'log_level') String? logLevel, @JsonKey(name: 'log_level_transcript') String? logLevelTranscript, @JsonKey(name: 'log_format') String? logFormat, List? tags, Map? metadata, bool? trace, String? display, bool? score, Object? limit, @JsonKey(name: 'sample_id') Object? sampleId, @JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, @JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup, @JsonKey(name: 'model_base_url') String? modelBaseUrl, @JsonKey(name: 'model_args') Map? modelArgs, @JsonKey(name: 'model_roles') Map? modelRoles, @JsonKey(name: 'task_args') Map? taskArgs, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'model_cost_config') Map? modelCostConfig, @JsonKey(name: 'log_samples') bool? logSamples, @JsonKey(name: 'log_realtime') bool? logRealtime, @JsonKey(name: 'log_images') bool? logImages, @JsonKey(name: 'log_buffer') int? logBuffer, @JsonKey(name: 'log_shared') int? logShared, @JsonKey(name: 'bundle_dir') String? bundleDir, @JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id') String? evalSetId, @JsonKey(name: 'eval_set_overrides') Map? evalSetOverrides, @JsonKey(name: 'task_defaults') Map? taskDefaults)? $default,{required TResult orElse(),}) {final _that = this; +@optionalTypeArgs TResult maybeWhen(TResult Function( String? description, @JsonKey(name: 'log_dir') String logDir, @JsonKey(name: 'max_connections') int maxConnections, List models, Map>? variants, @JsonKey(name: 'task_paths') List? taskPaths, Map? tasks, @JsonKey(name: 'save_examples') bool saveExamples, Map? sandbox, @JsonKey(name: 'inspect_eval_arguments') Map? inspectEvalArguments, @JsonKey(name: 'task_filters') TagFilter? taskFilters, @JsonKey(name: 'sample_filters') TagFilter? sampleFilters)? $default,{required TResult orElse(),}) {final _that = this; switch (_that) { case _Job() when $default != null: -return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults);case _: +return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);case _: return orElse(); } @@ -297,10 +232,10 @@ return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models /// } /// ``` -@optionalTypeArgs TResult when(TResult Function(@JsonKey(name: 'log_dir') String logDir, @JsonKey(name: 'sandbox_type') String sandboxType, @JsonKey(name: 'max_connections') int maxConnections, List? models, Map>? variants, @JsonKey(name: 'task_paths') List? taskPaths, Map? tasks, @JsonKey(name: 'save_examples') bool saveExamples, @JsonKey(name: 'retry_attempts') int? retryAttempts, @JsonKey(name: 'max_retries') int? maxRetries, @JsonKey(name: 'retry_wait') double? retryWait, @JsonKey(name: 'retry_connections') double? retryConnections, @JsonKey(name: 'retry_cleanup') bool? retryCleanup, @JsonKey(name: 'fail_on_error') double? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'retry_on_error') int? retryOnError, @JsonKey(name: 'debug_errors') bool? debugErrors, @JsonKey(name: 'max_samples') int? maxSamples, @JsonKey(name: 'max_tasks') int? maxTasks, @JsonKey(name: 'max_subprocesses') int? maxSubprocesses, @JsonKey(name: 'max_sandboxes') int? maxSandboxes, @JsonKey(name: 'log_level') String? logLevel, @JsonKey(name: 'log_level_transcript') String? logLevelTranscript, @JsonKey(name: 'log_format') String? logFormat, List? tags, Map? metadata, bool? trace, String? display, bool? score, Object? limit, @JsonKey(name: 'sample_id') Object? sampleId, @JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, @JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup, @JsonKey(name: 'model_base_url') String? modelBaseUrl, @JsonKey(name: 'model_args') Map? modelArgs, @JsonKey(name: 'model_roles') Map? modelRoles, @JsonKey(name: 'task_args') Map? taskArgs, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'model_cost_config') Map? modelCostConfig, @JsonKey(name: 'log_samples') bool? logSamples, @JsonKey(name: 'log_realtime') bool? logRealtime, @JsonKey(name: 'log_images') bool? logImages, @JsonKey(name: 'log_buffer') int? logBuffer, @JsonKey(name: 'log_shared') int? logShared, @JsonKey(name: 'bundle_dir') String? bundleDir, @JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id') String? evalSetId, @JsonKey(name: 'eval_set_overrides') Map? evalSetOverrides, @JsonKey(name: 'task_defaults') Map? taskDefaults) $default,) {final _that = this; +@optionalTypeArgs TResult when(TResult Function( String? description, @JsonKey(name: 'log_dir') String logDir, @JsonKey(name: 'max_connections') int maxConnections, List models, Map>? variants, @JsonKey(name: 'task_paths') List? taskPaths, Map? tasks, @JsonKey(name: 'save_examples') bool saveExamples, Map? sandbox, @JsonKey(name: 'inspect_eval_arguments') Map? inspectEvalArguments, @JsonKey(name: 'task_filters') TagFilter? taskFilters, @JsonKey(name: 'sample_filters') TagFilter? sampleFilters) $default,) {final _that = this; switch (_that) { case _Job(): -return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults);} +return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);} } /// A variant of `when` that fallback to returning `null` /// @@ -314,10 +249,10 @@ return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models /// } /// ``` -@optionalTypeArgs TResult? whenOrNull(TResult? Function(@JsonKey(name: 'log_dir') String logDir, @JsonKey(name: 'sandbox_type') String sandboxType, @JsonKey(name: 'max_connections') int maxConnections, List? models, Map>? variants, @JsonKey(name: 'task_paths') List? taskPaths, Map? tasks, @JsonKey(name: 'save_examples') bool saveExamples, @JsonKey(name: 'retry_attempts') int? retryAttempts, @JsonKey(name: 'max_retries') int? maxRetries, @JsonKey(name: 'retry_wait') double? retryWait, @JsonKey(name: 'retry_connections') double? retryConnections, @JsonKey(name: 'retry_cleanup') bool? retryCleanup, @JsonKey(name: 'fail_on_error') double? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'retry_on_error') int? retryOnError, @JsonKey(name: 'debug_errors') bool? debugErrors, @JsonKey(name: 'max_samples') int? maxSamples, @JsonKey(name: 'max_tasks') int? maxTasks, @JsonKey(name: 'max_subprocesses') int? maxSubprocesses, @JsonKey(name: 'max_sandboxes') int? maxSandboxes, @JsonKey(name: 'log_level') String? logLevel, @JsonKey(name: 'log_level_transcript') String? logLevelTranscript, @JsonKey(name: 'log_format') String? logFormat, List? tags, Map? metadata, bool? trace, String? display, bool? score, Object? limit, @JsonKey(name: 'sample_id') Object? sampleId, @JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver, @JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup, @JsonKey(name: 'model_base_url') String? modelBaseUrl, @JsonKey(name: 'model_args') Map? modelArgs, @JsonKey(name: 'model_roles') Map? modelRoles, @JsonKey(name: 'task_args') Map? taskArgs, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'model_cost_config') Map? modelCostConfig, @JsonKey(name: 'log_samples') bool? logSamples, @JsonKey(name: 'log_realtime') bool? logRealtime, @JsonKey(name: 'log_images') bool? logImages, @JsonKey(name: 'log_buffer') int? logBuffer, @JsonKey(name: 'log_shared') int? logShared, @JsonKey(name: 'bundle_dir') String? bundleDir, @JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty, @JsonKey(name: 'eval_set_id') String? evalSetId, @JsonKey(name: 'eval_set_overrides') Map? evalSetOverrides, @JsonKey(name: 'task_defaults') Map? taskDefaults)? $default,) {final _that = this; +@optionalTypeArgs TResult? whenOrNull(TResult? Function( String? description, @JsonKey(name: 'log_dir') String logDir, @JsonKey(name: 'max_connections') int maxConnections, List models, Map>? variants, @JsonKey(name: 'task_paths') List? taskPaths, Map? tasks, @JsonKey(name: 'save_examples') bool saveExamples, Map? sandbox, @JsonKey(name: 'inspect_eval_arguments') Map? inspectEvalArguments, @JsonKey(name: 'task_filters') TagFilter? taskFilters, @JsonKey(name: 'sample_filters') TagFilter? sampleFilters)? $default,) {final _that = this; switch (_that) { case _Job() when $default != null: -return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.retryAttempts,_that.maxRetries,_that.retryWait,_that.retryConnections,_that.retryCleanup,_that.failOnError,_that.continueOnFail,_that.retryOnError,_that.debugErrors,_that.maxSamples,_that.maxTasks,_that.maxSubprocesses,_that.maxSandboxes,_that.logLevel,_that.logLevelTranscript,_that.logFormat,_that.tags,_that.metadata,_that.trace,_that.display,_that.score,_that.limit,_that.sampleId,_that.sampleShuffle,_that.epochs,_that.approval,_that.solver,_that.sandboxCleanup,_that.modelBaseUrl,_that.modelArgs,_that.modelRoles,_that.taskArgs,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.modelCostConfig,_that.logSamples,_that.logRealtime,_that.logImages,_that.logBuffer,_that.logShared,_that.bundleDir,_that.bundleOverwrite,_that.logDirAllowDirty,_that.evalSetId,_that.evalSetOverrides,_that.taskDefaults);case _: +return $default(_that.description,_that.logDir,_that.maxConnections,_that.models,_that.variants,_that.taskPaths,_that.tasks,_that.saveExamples,_that.sandbox,_that.inspectEvalArguments,_that.taskFilters,_that.sampleFilters);case _: return null; } @@ -329,27 +264,25 @@ return $default(_that.logDir,_that.sandboxType,_that.maxConnections,_that.models @JsonSerializable() class _Job implements Job { - const _Job({@JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'sandbox_type') this.sandboxType = 'local', @JsonKey(name: 'max_connections') this.maxConnections = 10, final List? models, final Map>? variants, @JsonKey(name: 'task_paths') final List? taskPaths, final Map? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, @JsonKey(name: 'retry_attempts') this.retryAttempts, @JsonKey(name: 'max_retries') this.maxRetries, @JsonKey(name: 'retry_wait') this.retryWait, @JsonKey(name: 'retry_connections') this.retryConnections, @JsonKey(name: 'retry_cleanup') this.retryCleanup, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'retry_on_error') this.retryOnError, @JsonKey(name: 'debug_errors') this.debugErrors, @JsonKey(name: 'max_samples') this.maxSamples, @JsonKey(name: 'max_tasks') this.maxTasks, @JsonKey(name: 'max_subprocesses') this.maxSubprocesses, @JsonKey(name: 'max_sandboxes') this.maxSandboxes, @JsonKey(name: 'log_level') this.logLevel, @JsonKey(name: 'log_level_transcript') this.logLevelTranscript, @JsonKey(name: 'log_format') this.logFormat, final List? tags, final Map? metadata, this.trace, this.display, this.score, this.limit, @JsonKey(name: 'sample_id') this.sampleId, @JsonKey(name: 'sample_shuffle') this.sampleShuffle, this.epochs, this.approval, this.solver, @JsonKey(name: 'sandbox_cleanup') this.sandboxCleanup, @JsonKey(name: 'model_base_url') this.modelBaseUrl, @JsonKey(name: 'model_args') final Map? modelArgs, @JsonKey(name: 'model_roles') final Map? modelRoles, @JsonKey(name: 'task_args') final Map? taskArgs, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'model_cost_config') final Map? modelCostConfig, @JsonKey(name: 'log_samples') this.logSamples, @JsonKey(name: 'log_realtime') this.logRealtime, @JsonKey(name: 'log_images') this.logImages, @JsonKey(name: 'log_buffer') this.logBuffer, @JsonKey(name: 'log_shared') this.logShared, @JsonKey(name: 'bundle_dir') this.bundleDir, @JsonKey(name: 'bundle_overwrite') this.bundleOverwrite, @JsonKey(name: 'log_dir_allow_dirty') this.logDirAllowDirty, @JsonKey(name: 'eval_set_id') this.evalSetId, @JsonKey(name: 'eval_set_overrides') final Map? evalSetOverrides, @JsonKey(name: 'task_defaults') final Map? taskDefaults}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_tags = tags,_metadata = metadata,_modelArgs = modelArgs,_modelRoles = modelRoles,_taskArgs = taskArgs,_modelCostConfig = modelCostConfig,_evalSetOverrides = evalSetOverrides,_taskDefaults = taskDefaults; + const _Job({this.description, @JsonKey(name: 'log_dir') required this.logDir, @JsonKey(name: 'max_connections') this.maxConnections = 10, required final List models, final Map>? variants, @JsonKey(name: 'task_paths') final List? taskPaths, final Map? tasks, @JsonKey(name: 'save_examples') this.saveExamples = false, final Map? sandbox, @JsonKey(name: 'inspect_eval_arguments') final Map? inspectEvalArguments, @JsonKey(name: 'task_filters') this.taskFilters, @JsonKey(name: 'sample_filters') this.sampleFilters}): _models = models,_variants = variants,_taskPaths = taskPaths,_tasks = tasks,_sandbox = sandbox,_inspectEvalArguments = inspectEvalArguments; factory _Job.fromJson(Map json) => _$JobFromJson(json); // ------------------------------------------------------------------ // Core job settings // ------------------------------------------------------------------ +/// Human-readable description of this job. +@override final String? description; /// Directory to write evaluation logs to. @override@JsonKey(name: 'log_dir') final String logDir; -/// Sandbox type: `'local'`, `'docker'`, or `'podman'`. -@override@JsonKey(name: 'sandbox_type') final String sandboxType; /// Maximum concurrent API connections. @override@JsonKey(name: 'max_connections') final int maxConnections; -/// Models to run. `null` means use defaults from registries. - final List? _models; -/// Models to run. `null` means use defaults from registries. -@override List? get models { - final value = _models; - if (value == null) return null; +/// Models to run (required). + final List _models; +/// Models to run (required). +@override List get models { if (_models is EqualUnmodifiableListView) return _models; // ignore: implicit_dynamic_type - return EqualUnmodifiableListView(value); + return EqualUnmodifiableListView(_models); } /// Named variant map. Keys are variant names, values are config dicts. @@ -392,197 +325,46 @@ class _Job implements Job { /// If `true`, copy final workspace to `/examples/` after each sample. @override@JsonKey(name: 'save_examples') final bool saveExamples; // ------------------------------------------------------------------ -// Promoted eval_set() parameters (convenience top-level keys) +// Sandbox configuration // ------------------------------------------------------------------ -/// Maximum retry attempts before giving up (defaults to 10). -@override@JsonKey(name: 'retry_attempts') final int? retryAttempts; -/// Maximum number of retry attempts for failed samples. -@override@JsonKey(name: 'max_retries') final int? maxRetries; -/// Time in seconds to wait between retry attempts (exponential backoff). -@override@JsonKey(name: 'retry_wait') final double? retryWait; -/// Reduce `max_connections` at this rate with each retry (default 1.0). -@override@JsonKey(name: 'retry_connections') final double? retryConnections; -/// Cleanup failed log files after retries (defaults to true). -@override@JsonKey(name: 'retry_cleanup') final bool? retryCleanup; -/// Fail on sample errors. -/// -/// `0.0–1.0` = fail if proportion exceeds threshold, -/// `>1` = fail if count exceeds threshold. -@override@JsonKey(name: 'fail_on_error') final double? failOnError; -/// Continue running even if `fail_on_error` condition is met. -@override@JsonKey(name: 'continue_on_fail') final bool? continueOnFail; -/// Number of times to retry samples on error (default: no retries). -@override@JsonKey(name: 'retry_on_error') final int? retryOnError; -/// Raise task errors for debugging (defaults to false). -@override@JsonKey(name: 'debug_errors') final bool? debugErrors; -/// Maximum samples to run in parallel (default is `max_connections`). -@override@JsonKey(name: 'max_samples') final int? maxSamples; -/// Maximum tasks to run in parallel. -@override@JsonKey(name: 'max_tasks') final int? maxTasks; -/// Maximum subprocesses to run in parallel. -@override@JsonKey(name: 'max_subprocesses') final int? maxSubprocesses; -/// Maximum sandboxes (per-provider) to run in parallel. -@override@JsonKey(name: 'max_sandboxes') final int? maxSandboxes; -/// Level for logging to the console (e.g. `"warning"`, `"info"`, `"debug"`). -@override@JsonKey(name: 'log_level') final String? logLevel; -/// Level for logging to the log file (defaults to `"info"`). -@override@JsonKey(name: 'log_level_transcript') final String? logLevelTranscript; -/// Format for writing log files (`"eval"` or `"json"`). -@override@JsonKey(name: 'log_format') final String? logFormat; -/// Tags to associate with this evaluation run. - final List? _tags; -/// Tags to associate with this evaluation run. -@override List? get tags { - final value = _tags; - if (value == null) return null; - if (_tags is EqualUnmodifiableListView) return _tags; - // ignore: implicit_dynamic_type - return EqualUnmodifiableListView(value); -} - -/// Metadata to associate with this evaluation run. - final Map? _metadata; -/// Metadata to associate with this evaluation run. -@override Map? get metadata { - final value = _metadata; - if (value == null) return null; - if (_metadata is EqualUnmodifiableMapView) return _metadata; - // ignore: implicit_dynamic_type - return EqualUnmodifiableMapView(value); -} - -/// Trace message interactions with evaluated model to terminal. -@override final bool? trace; -/// Task display type (defaults to `"full"`). -@override final String? display; -/// Score output (defaults to true). -@override final bool? score; -/// Limit evaluated samples (int count or `[start, end]` range). -@override final Object? limit; -/// Evaluate specific sample(s) from the dataset. -@override@JsonKey(name: 'sample_id') final Object? sampleId; -/// Shuffle order of samples (pass a seed to make order deterministic). -@override@JsonKey(name: 'sample_shuffle') final Object? sampleShuffle; -/// Epochs to repeat samples for and optional score reducer function(s). -@override final Object? epochs; -/// Tool use approval policies (string or config dict). -@override final Object? approval; -/// Alternative solver(s) for evaluating task(s) (string or config dict). -@override final Object? solver; -/// Sandbox cleanup after task completes (defaults to true). -@override@JsonKey(name: 'sandbox_cleanup') final bool? sandboxCleanup; -/// Base URL for communicating with the model API. -@override@JsonKey(name: 'model_base_url') final String? modelBaseUrl; -/// Model creation arguments. - final Map? _modelArgs; -/// Model creation arguments. -@override@JsonKey(name: 'model_args') Map? get modelArgs { - final value = _modelArgs; - if (value == null) return null; - if (_modelArgs is EqualUnmodifiableMapView) return _modelArgs; - // ignore: implicit_dynamic_type - return EqualUnmodifiableMapView(value); -} - -/// Named roles for use in `get_model()`. - final Map? _modelRoles; -/// Named roles for use in `get_model()`. -@override@JsonKey(name: 'model_roles') Map? get modelRoles { - final value = _modelRoles; - if (value == null) return null; - if (_modelRoles is EqualUnmodifiableMapView) return _modelRoles; - // ignore: implicit_dynamic_type - return EqualUnmodifiableMapView(value); -} - -/// Task creation arguments. - final Map? _taskArgs; -/// Task creation arguments. -@override@JsonKey(name: 'task_args') Map? get taskArgs { - final value = _taskArgs; - if (value == null) return null; - if (_taskArgs is EqualUnmodifiableMapView) return _taskArgs; - // ignore: implicit_dynamic_type - return EqualUnmodifiableMapView(value); -} - -/// Limit on total messages per sample. -@override@JsonKey(name: 'message_limit') final int? messageLimit; -/// Limit on total tokens per sample. -@override@JsonKey(name: 'token_limit') final int? tokenLimit; -/// Limit on clock time (in seconds) per sample. -@override@JsonKey(name: 'time_limit') final int? timeLimit; -/// Limit on working time (in seconds) per sample. -@override@JsonKey(name: 'working_limit') final int? workingLimit; -/// Limit on total cost (in dollars) per sample. -@override@JsonKey(name: 'cost_limit') final double? costLimit; -/// JSON file with model prices for cost tracking. - final Map? _modelCostConfig; -/// JSON file with model prices for cost tracking. -@override@JsonKey(name: 'model_cost_config') Map? get modelCostConfig { - final value = _modelCostConfig; +/// Sandbox config with keys: environment, parameters, image_prefix. + final Map? _sandbox; +// ------------------------------------------------------------------ +// Sandbox configuration +// ------------------------------------------------------------------ +/// Sandbox config with keys: environment, parameters, image_prefix. +@override Map? get sandbox { + final value = _sandbox; if (value == null) return null; - if (_modelCostConfig is EqualUnmodifiableMapView) return _modelCostConfig; + if (_sandbox is EqualUnmodifiableMapView) return _sandbox; // ignore: implicit_dynamic_type return EqualUnmodifiableMapView(value); } -/// Log detailed samples and scores (defaults to true). -@override@JsonKey(name: 'log_samples') final bool? logSamples; -/// Log events in realtime (defaults to true). -@override@JsonKey(name: 'log_realtime') final bool? logRealtime; -/// Log base64-encoded images (defaults to false). -@override@JsonKey(name: 'log_images') final bool? logImages; -/// Number of samples to buffer before writing log file. -@override@JsonKey(name: 'log_buffer') final int? logBuffer; -/// Sync sample events for realtime viewing. -@override@JsonKey(name: 'log_shared') final int? logShared; -/// Directory to bundle logs and viewer into. -@override@JsonKey(name: 'bundle_dir') final String? bundleDir; -/// Overwrite files in `bundle_dir` (defaults to false). -@override@JsonKey(name: 'bundle_overwrite') final bool? bundleOverwrite; -/// Allow log directory to contain unrelated logs (defaults to false). -@override@JsonKey(name: 'log_dir_allow_dirty') final bool? logDirAllowDirty; -/// ID for the eval set. Generated if not specified. -@override@JsonKey(name: 'eval_set_id') final String? evalSetId; // ------------------------------------------------------------------ -// Pass-through overrides +// Inspect eval arguments (passed through to eval_set()) // ------------------------------------------------------------------ -/// Additional `eval_set()` kwargs not covered by top-level fields. -/// -/// Any valid `eval_set()` parameter can be specified here and will be -/// merged into the output JSON. Top-level fields take precedence. - final Map? _evalSetOverrides; +/// All Inspect AI eval_set() parameters, nested under one key. + final Map? _inspectEvalArguments; // ------------------------------------------------------------------ -// Pass-through overrides +// Inspect eval arguments (passed through to eval_set()) // ------------------------------------------------------------------ -/// Additional `eval_set()` kwargs not covered by top-level fields. -/// -/// Any valid `eval_set()` parameter can be specified here and will be -/// merged into the output JSON. Top-level fields take precedence. -@override@JsonKey(name: 'eval_set_overrides') Map? get evalSetOverrides { - final value = _evalSetOverrides; +/// All Inspect AI eval_set() parameters, nested under one key. +@override@JsonKey(name: 'inspect_eval_arguments') Map? get inspectEvalArguments { + final value = _inspectEvalArguments; if (value == null) return null; - if (_evalSetOverrides is EqualUnmodifiableMapView) return _evalSetOverrides; - // ignore: implicit_dynamic_type - return EqualUnmodifiableMapView(value); -} - -/// Default `Task` kwargs applied to every task in this job. -/// -/// Per-task overrides (from `task.yaml`) take precedence. - final Map? _taskDefaults; -/// Default `Task` kwargs applied to every task in this job. -/// -/// Per-task overrides (from `task.yaml`) take precedence. -@override@JsonKey(name: 'task_defaults') Map? get taskDefaults { - final value = _taskDefaults; - if (value == null) return null; - if (_taskDefaults is EqualUnmodifiableMapView) return _taskDefaults; + if (_inspectEvalArguments is EqualUnmodifiableMapView) return _inspectEvalArguments; // ignore: implicit_dynamic_type return EqualUnmodifiableMapView(value); } +// ------------------------------------------------------------------ +// Tag-based filtering +// ------------------------------------------------------------------ +/// Tag filters applied to tasks. +@override@JsonKey(name: 'task_filters') final TagFilter? taskFilters; +/// Tag filters applied to samples. +@override@JsonKey(name: 'sample_filters') final TagFilter? sampleFilters; /// Create a copy of Job /// with the given fields replaced by the non-null parameter values. @@ -597,16 +379,16 @@ Map toJson() { @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is _Job&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.sandboxType, sandboxType) || other.sandboxType == sandboxType)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other._models, _models)&&const DeepCollectionEquality().equals(other._variants, _variants)&&const DeepCollectionEquality().equals(other._taskPaths, _taskPaths)&&const DeepCollectionEquality().equals(other._tasks, _tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&(identical(other.retryAttempts, retryAttempts) || other.retryAttempts == retryAttempts)&&(identical(other.maxRetries, maxRetries) || other.maxRetries == maxRetries)&&(identical(other.retryWait, retryWait) || other.retryWait == retryWait)&&(identical(other.retryConnections, retryConnections) || other.retryConnections == retryConnections)&&(identical(other.retryCleanup, retryCleanup) || other.retryCleanup == retryCleanup)&&(identical(other.failOnError, failOnError) || other.failOnError == failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.retryOnError, retryOnError) || other.retryOnError == retryOnError)&&(identical(other.debugErrors, debugErrors) || other.debugErrors == debugErrors)&&(identical(other.maxSamples, maxSamples) || other.maxSamples == maxSamples)&&(identical(other.maxTasks, maxTasks) || other.maxTasks == maxTasks)&&(identical(other.maxSubprocesses, maxSubprocesses) || other.maxSubprocesses == maxSubprocesses)&&(identical(other.maxSandboxes, maxSandboxes) || other.maxSandboxes == maxSandboxes)&&(identical(other.logLevel, logLevel) || other.logLevel == logLevel)&&(identical(other.logLevelTranscript, logLevelTranscript) || other.logLevelTranscript == logLevelTranscript)&&(identical(other.logFormat, logFormat) || other.logFormat == logFormat)&&const DeepCollectionEquality().equals(other._tags, _tags)&&const DeepCollectionEquality().equals(other._metadata, _metadata)&&(identical(other.trace, trace) || other.trace == trace)&&(identical(other.display, display) || other.display == display)&&(identical(other.score, score) || other.score == score)&&const DeepCollectionEquality().equals(other.limit, limit)&&const DeepCollectionEquality().equals(other.sampleId, sampleId)&&const DeepCollectionEquality().equals(other.sampleShuffle, sampleShuffle)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.solver, solver)&&(identical(other.sandboxCleanup, sandboxCleanup) || other.sandboxCleanup == sandboxCleanup)&&(identical(other.modelBaseUrl, modelBaseUrl) || other.modelBaseUrl == modelBaseUrl)&&const DeepCollectionEquality().equals(other._modelArgs, _modelArgs)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other._taskArgs, _taskArgs)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other._modelCostConfig, _modelCostConfig)&&(identical(other.logSamples, logSamples) || other.logSamples == logSamples)&&(identical(other.logRealtime, logRealtime) || other.logRealtime == logRealtime)&&(identical(other.logImages, logImages) || other.logImages == logImages)&&(identical(other.logBuffer, logBuffer) || other.logBuffer == logBuffer)&&(identical(other.logShared, logShared) || other.logShared == logShared)&&(identical(other.bundleDir, bundleDir) || other.bundleDir == bundleDir)&&(identical(other.bundleOverwrite, bundleOverwrite) || other.bundleOverwrite == bundleOverwrite)&&(identical(other.logDirAllowDirty, logDirAllowDirty) || other.logDirAllowDirty == logDirAllowDirty)&&(identical(other.evalSetId, evalSetId) || other.evalSetId == evalSetId)&&const DeepCollectionEquality().equals(other._evalSetOverrides, _evalSetOverrides)&&const DeepCollectionEquality().equals(other._taskDefaults, _taskDefaults)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is _Job&&(identical(other.description, description) || other.description == description)&&(identical(other.logDir, logDir) || other.logDir == logDir)&&(identical(other.maxConnections, maxConnections) || other.maxConnections == maxConnections)&&const DeepCollectionEquality().equals(other._models, _models)&&const DeepCollectionEquality().equals(other._variants, _variants)&&const DeepCollectionEquality().equals(other._taskPaths, _taskPaths)&&const DeepCollectionEquality().equals(other._tasks, _tasks)&&(identical(other.saveExamples, saveExamples) || other.saveExamples == saveExamples)&&const DeepCollectionEquality().equals(other._sandbox, _sandbox)&&const DeepCollectionEquality().equals(other._inspectEvalArguments, _inspectEvalArguments)&&(identical(other.taskFilters, taskFilters) || other.taskFilters == taskFilters)&&(identical(other.sampleFilters, sampleFilters) || other.sampleFilters == sampleFilters)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hashAll([runtimeType,logDir,sandboxType,maxConnections,const DeepCollectionEquality().hash(_models),const DeepCollectionEquality().hash(_variants),const DeepCollectionEquality().hash(_taskPaths),const DeepCollectionEquality().hash(_tasks),saveExamples,retryAttempts,maxRetries,retryWait,retryConnections,retryCleanup,failOnError,continueOnFail,retryOnError,debugErrors,maxSamples,maxTasks,maxSubprocesses,maxSandboxes,logLevel,logLevelTranscript,logFormat,const DeepCollectionEquality().hash(_tags),const DeepCollectionEquality().hash(_metadata),trace,display,score,const DeepCollectionEquality().hash(limit),const DeepCollectionEquality().hash(sampleId),const DeepCollectionEquality().hash(sampleShuffle),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(solver),sandboxCleanup,modelBaseUrl,const DeepCollectionEquality().hash(_modelArgs),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(_taskArgs),messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(_modelCostConfig),logSamples,logRealtime,logImages,logBuffer,logShared,bundleDir,bundleOverwrite,logDirAllowDirty,evalSetId,const DeepCollectionEquality().hash(_evalSetOverrides),const DeepCollectionEquality().hash(_taskDefaults)]); +int get hashCode => Object.hash(runtimeType,description,logDir,maxConnections,const DeepCollectionEquality().hash(_models),const DeepCollectionEquality().hash(_variants),const DeepCollectionEquality().hash(_taskPaths),const DeepCollectionEquality().hash(_tasks),saveExamples,const DeepCollectionEquality().hash(_sandbox),const DeepCollectionEquality().hash(_inspectEvalArguments),taskFilters,sampleFilters); @override String toString() { - return 'Job(logDir: $logDir, sandboxType: $sandboxType, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, retryAttempts: $retryAttempts, maxRetries: $maxRetries, retryWait: $retryWait, retryConnections: $retryConnections, retryCleanup: $retryCleanup, failOnError: $failOnError, continueOnFail: $continueOnFail, retryOnError: $retryOnError, debugErrors: $debugErrors, maxSamples: $maxSamples, maxTasks: $maxTasks, maxSubprocesses: $maxSubprocesses, maxSandboxes: $maxSandboxes, logLevel: $logLevel, logLevelTranscript: $logLevelTranscript, logFormat: $logFormat, tags: $tags, metadata: $metadata, trace: $trace, display: $display, score: $score, limit: $limit, sampleId: $sampleId, sampleShuffle: $sampleShuffle, epochs: $epochs, approval: $approval, solver: $solver, sandboxCleanup: $sandboxCleanup, modelBaseUrl: $modelBaseUrl, modelArgs: $modelArgs, modelRoles: $modelRoles, taskArgs: $taskArgs, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, modelCostConfig: $modelCostConfig, logSamples: $logSamples, logRealtime: $logRealtime, logImages: $logImages, logBuffer: $logBuffer, logShared: $logShared, bundleDir: $bundleDir, bundleOverwrite: $bundleOverwrite, logDirAllowDirty: $logDirAllowDirty, evalSetId: $evalSetId, evalSetOverrides: $evalSetOverrides, taskDefaults: $taskDefaults)'; + return 'Job(description: $description, logDir: $logDir, maxConnections: $maxConnections, models: $models, variants: $variants, taskPaths: $taskPaths, tasks: $tasks, saveExamples: $saveExamples, sandbox: $sandbox, inspectEvalArguments: $inspectEvalArguments, taskFilters: $taskFilters, sampleFilters: $sampleFilters)'; } @@ -617,11 +399,11 @@ abstract mixin class _$JobCopyWith<$Res> implements $JobCopyWith<$Res> { factory _$JobCopyWith(_Job value, $Res Function(_Job) _then) = __$JobCopyWithImpl; @override @useResult $Res call({ -@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'sandbox_type') String sandboxType,@JsonKey(name: 'max_connections') int maxConnections, List? models, Map>? variants,@JsonKey(name: 'task_paths') List? taskPaths, Map? tasks,@JsonKey(name: 'save_examples') bool saveExamples,@JsonKey(name: 'retry_attempts') int? retryAttempts,@JsonKey(name: 'max_retries') int? maxRetries,@JsonKey(name: 'retry_wait') double? retryWait,@JsonKey(name: 'retry_connections') double? retryConnections,@JsonKey(name: 'retry_cleanup') bool? retryCleanup,@JsonKey(name: 'fail_on_error') double? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'retry_on_error') int? retryOnError,@JsonKey(name: 'debug_errors') bool? debugErrors,@JsonKey(name: 'max_samples') int? maxSamples,@JsonKey(name: 'max_tasks') int? maxTasks,@JsonKey(name: 'max_subprocesses') int? maxSubprocesses,@JsonKey(name: 'max_sandboxes') int? maxSandboxes,@JsonKey(name: 'log_level') String? logLevel,@JsonKey(name: 'log_level_transcript') String? logLevelTranscript,@JsonKey(name: 'log_format') String? logFormat, List? tags, Map? metadata, bool? trace, String? display, bool? score, Object? limit,@JsonKey(name: 'sample_id') Object? sampleId,@JsonKey(name: 'sample_shuffle') Object? sampleShuffle, Object? epochs, Object? approval, Object? solver,@JsonKey(name: 'sandbox_cleanup') bool? sandboxCleanup,@JsonKey(name: 'model_base_url') String? modelBaseUrl,@JsonKey(name: 'model_args') Map? modelArgs,@JsonKey(name: 'model_roles') Map? modelRoles,@JsonKey(name: 'task_args') Map? taskArgs,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'model_cost_config') Map? modelCostConfig,@JsonKey(name: 'log_samples') bool? logSamples,@JsonKey(name: 'log_realtime') bool? logRealtime,@JsonKey(name: 'log_images') bool? logImages,@JsonKey(name: 'log_buffer') int? logBuffer,@JsonKey(name: 'log_shared') int? logShared,@JsonKey(name: 'bundle_dir') String? bundleDir,@JsonKey(name: 'bundle_overwrite') bool? bundleOverwrite,@JsonKey(name: 'log_dir_allow_dirty') bool? logDirAllowDirty,@JsonKey(name: 'eval_set_id') String? evalSetId,@JsonKey(name: 'eval_set_overrides') Map? evalSetOverrides,@JsonKey(name: 'task_defaults') Map? taskDefaults + String? description,@JsonKey(name: 'log_dir') String logDir,@JsonKey(name: 'max_connections') int maxConnections, List models, Map>? variants,@JsonKey(name: 'task_paths') List? taskPaths, Map? tasks,@JsonKey(name: 'save_examples') bool saveExamples, Map? sandbox,@JsonKey(name: 'inspect_eval_arguments') Map? inspectEvalArguments,@JsonKey(name: 'task_filters') TagFilter? taskFilters,@JsonKey(name: 'sample_filters') TagFilter? sampleFilters }); - +@override $TagFilterCopyWith<$Res>? get taskFilters;@override $TagFilterCopyWith<$Res>? get sampleFilters; } /// @nodoc @@ -634,64 +416,49 @@ class __$JobCopyWithImpl<$Res> /// Create a copy of Job /// with the given fields replaced by the non-null parameter values. -@override @pragma('vm:prefer-inline') $Res call({Object? logDir = null,Object? sandboxType = null,Object? maxConnections = null,Object? models = freezed,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? retryAttempts = freezed,Object? maxRetries = freezed,Object? retryWait = freezed,Object? retryConnections = freezed,Object? retryCleanup = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? retryOnError = freezed,Object? debugErrors = freezed,Object? maxSamples = freezed,Object? maxTasks = freezed,Object? maxSubprocesses = freezed,Object? maxSandboxes = freezed,Object? logLevel = freezed,Object? logLevelTranscript = freezed,Object? logFormat = freezed,Object? tags = freezed,Object? metadata = freezed,Object? trace = freezed,Object? display = freezed,Object? score = freezed,Object? limit = freezed,Object? sampleId = freezed,Object? sampleShuffle = freezed,Object? epochs = freezed,Object? approval = freezed,Object? solver = freezed,Object? sandboxCleanup = freezed,Object? modelBaseUrl = freezed,Object? modelArgs = freezed,Object? modelRoles = freezed,Object? taskArgs = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? modelCostConfig = freezed,Object? logSamples = freezed,Object? logRealtime = freezed,Object? logImages = freezed,Object? logBuffer = freezed,Object? logShared = freezed,Object? bundleDir = freezed,Object? bundleOverwrite = freezed,Object? logDirAllowDirty = freezed,Object? evalSetId = freezed,Object? evalSetOverrides = freezed,Object? taskDefaults = freezed,}) { +@override @pragma('vm:prefer-inline') $Res call({Object? description = freezed,Object? logDir = null,Object? maxConnections = null,Object? models = null,Object? variants = freezed,Object? taskPaths = freezed,Object? tasks = freezed,Object? saveExamples = null,Object? sandbox = freezed,Object? inspectEvalArguments = freezed,Object? taskFilters = freezed,Object? sampleFilters = freezed,}) { return _then(_Job( -logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable -as String,sandboxType: null == sandboxType ? _self.sandboxType : sandboxType // ignore: cast_nullable_to_non_nullable +description: freezed == description ? _self.description : description // ignore: cast_nullable_to_non_nullable +as String?,logDir: null == logDir ? _self.logDir : logDir // ignore: cast_nullable_to_non_nullable as String,maxConnections: null == maxConnections ? _self.maxConnections : maxConnections // ignore: cast_nullable_to_non_nullable -as int,models: freezed == models ? _self._models : models // ignore: cast_nullable_to_non_nullable -as List?,variants: freezed == variants ? _self._variants : variants // ignore: cast_nullable_to_non_nullable +as int,models: null == models ? _self._models : models // ignore: cast_nullable_to_non_nullable +as List,variants: freezed == variants ? _self._variants : variants // ignore: cast_nullable_to_non_nullable as Map>?,taskPaths: freezed == taskPaths ? _self._taskPaths : taskPaths // ignore: cast_nullable_to_non_nullable as List?,tasks: freezed == tasks ? _self._tasks : tasks // ignore: cast_nullable_to_non_nullable as Map?,saveExamples: null == saveExamples ? _self.saveExamples : saveExamples // ignore: cast_nullable_to_non_nullable -as bool,retryAttempts: freezed == retryAttempts ? _self.retryAttempts : retryAttempts // ignore: cast_nullable_to_non_nullable -as int?,maxRetries: freezed == maxRetries ? _self.maxRetries : maxRetries // ignore: cast_nullable_to_non_nullable -as int?,retryWait: freezed == retryWait ? _self.retryWait : retryWait // ignore: cast_nullable_to_non_nullable -as double?,retryConnections: freezed == retryConnections ? _self.retryConnections : retryConnections // ignore: cast_nullable_to_non_nullable -as double?,retryCleanup: freezed == retryCleanup ? _self.retryCleanup : retryCleanup // ignore: cast_nullable_to_non_nullable -as bool?,failOnError: freezed == failOnError ? _self.failOnError : failOnError // ignore: cast_nullable_to_non_nullable -as double?,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable -as bool?,retryOnError: freezed == retryOnError ? _self.retryOnError : retryOnError // ignore: cast_nullable_to_non_nullable -as int?,debugErrors: freezed == debugErrors ? _self.debugErrors : debugErrors // ignore: cast_nullable_to_non_nullable -as bool?,maxSamples: freezed == maxSamples ? _self.maxSamples : maxSamples // ignore: cast_nullable_to_non_nullable -as int?,maxTasks: freezed == maxTasks ? _self.maxTasks : maxTasks // ignore: cast_nullable_to_non_nullable -as int?,maxSubprocesses: freezed == maxSubprocesses ? _self.maxSubprocesses : maxSubprocesses // ignore: cast_nullable_to_non_nullable -as int?,maxSandboxes: freezed == maxSandboxes ? _self.maxSandboxes : maxSandboxes // ignore: cast_nullable_to_non_nullable -as int?,logLevel: freezed == logLevel ? _self.logLevel : logLevel // ignore: cast_nullable_to_non_nullable -as String?,logLevelTranscript: freezed == logLevelTranscript ? _self.logLevelTranscript : logLevelTranscript // ignore: cast_nullable_to_non_nullable -as String?,logFormat: freezed == logFormat ? _self.logFormat : logFormat // ignore: cast_nullable_to_non_nullable -as String?,tags: freezed == tags ? _self._tags : tags // ignore: cast_nullable_to_non_nullable -as List?,metadata: freezed == metadata ? _self._metadata : metadata // ignore: cast_nullable_to_non_nullable -as Map?,trace: freezed == trace ? _self.trace : trace // ignore: cast_nullable_to_non_nullable -as bool?,display: freezed == display ? _self.display : display // ignore: cast_nullable_to_non_nullable -as String?,score: freezed == score ? _self.score : score // ignore: cast_nullable_to_non_nullable -as bool?,limit: freezed == limit ? _self.limit : limit ,sampleId: freezed == sampleId ? _self.sampleId : sampleId ,sampleShuffle: freezed == sampleShuffle ? _self.sampleShuffle : sampleShuffle ,epochs: freezed == epochs ? _self.epochs : epochs ,approval: freezed == approval ? _self.approval : approval ,solver: freezed == solver ? _self.solver : solver ,sandboxCleanup: freezed == sandboxCleanup ? _self.sandboxCleanup : sandboxCleanup // ignore: cast_nullable_to_non_nullable -as bool?,modelBaseUrl: freezed == modelBaseUrl ? _self.modelBaseUrl : modelBaseUrl // ignore: cast_nullable_to_non_nullable -as String?,modelArgs: freezed == modelArgs ? _self._modelArgs : modelArgs // ignore: cast_nullable_to_non_nullable -as Map?,modelRoles: freezed == modelRoles ? _self._modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable -as Map?,taskArgs: freezed == taskArgs ? _self._taskArgs : taskArgs // ignore: cast_nullable_to_non_nullable -as Map?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable -as int?,tokenLimit: freezed == tokenLimit ? _self.tokenLimit : tokenLimit // ignore: cast_nullable_to_non_nullable -as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore: cast_nullable_to_non_nullable -as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable -as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable -as double?,modelCostConfig: freezed == modelCostConfig ? _self._modelCostConfig : modelCostConfig // ignore: cast_nullable_to_non_nullable -as Map?,logSamples: freezed == logSamples ? _self.logSamples : logSamples // ignore: cast_nullable_to_non_nullable -as bool?,logRealtime: freezed == logRealtime ? _self.logRealtime : logRealtime // ignore: cast_nullable_to_non_nullable -as bool?,logImages: freezed == logImages ? _self.logImages : logImages // ignore: cast_nullable_to_non_nullable -as bool?,logBuffer: freezed == logBuffer ? _self.logBuffer : logBuffer // ignore: cast_nullable_to_non_nullable -as int?,logShared: freezed == logShared ? _self.logShared : logShared // ignore: cast_nullable_to_non_nullable -as int?,bundleDir: freezed == bundleDir ? _self.bundleDir : bundleDir // ignore: cast_nullable_to_non_nullable -as String?,bundleOverwrite: freezed == bundleOverwrite ? _self.bundleOverwrite : bundleOverwrite // ignore: cast_nullable_to_non_nullable -as bool?,logDirAllowDirty: freezed == logDirAllowDirty ? _self.logDirAllowDirty : logDirAllowDirty // ignore: cast_nullable_to_non_nullable -as bool?,evalSetId: freezed == evalSetId ? _self.evalSetId : evalSetId // ignore: cast_nullable_to_non_nullable -as String?,evalSetOverrides: freezed == evalSetOverrides ? _self._evalSetOverrides : evalSetOverrides // ignore: cast_nullable_to_non_nullable -as Map?,taskDefaults: freezed == taskDefaults ? _self._taskDefaults : taskDefaults // ignore: cast_nullable_to_non_nullable -as Map?, +as bool,sandbox: freezed == sandbox ? _self._sandbox : sandbox // ignore: cast_nullable_to_non_nullable +as Map?,inspectEvalArguments: freezed == inspectEvalArguments ? _self._inspectEvalArguments : inspectEvalArguments // ignore: cast_nullable_to_non_nullable +as Map?,taskFilters: freezed == taskFilters ? _self.taskFilters : taskFilters // ignore: cast_nullable_to_non_nullable +as TagFilter?,sampleFilters: freezed == sampleFilters ? _self.sampleFilters : sampleFilters // ignore: cast_nullable_to_non_nullable +as TagFilter?, )); } +/// Create a copy of Job +/// with the given fields replaced by the non-null parameter values. +@override +@pragma('vm:prefer-inline') +$TagFilterCopyWith<$Res>? get taskFilters { + if (_self.taskFilters == null) { + return null; + } + + return $TagFilterCopyWith<$Res>(_self.taskFilters!, (value) { + return _then(_self.copyWith(taskFilters: value)); + }); +}/// Create a copy of Job +/// with the given fields replaced by the non-null parameter values. +@override +@pragma('vm:prefer-inline') +$TagFilterCopyWith<$Res>? get sampleFilters { + if (_self.sampleFilters == null) { + return null; + } + return $TagFilterCopyWith<$Res>(_self.sampleFilters!, (value) { + return _then(_self.copyWith(sampleFilters: value)); + }); +} } @@ -701,8 +468,10 @@ mixin _$JobTask { /// Task identifier matching a task directory name in `tasks/`. String get id;/// Only run these sample IDs. Mutually exclusive with [excludeSamples]. @JsonKey(name: 'include_samples') List? get includeSamples;/// Exclude these sample IDs. Mutually exclusive with [includeSamples]. -@JsonKey(name: 'exclude_samples') List? get excludeSamples;/// Override system message for this task. -@JsonKey(name: 'system_message') String? get systemMessage; +@JsonKey(name: 'exclude_samples') List? get excludeSamples;/// Only run these variant names for this task. +@JsonKey(name: 'include_variants') List? get includeVariants;/// Exclude these variant names for this task. +@JsonKey(name: 'exclude_variants') List? get excludeVariants;/// Per-task argument overrides passed to the task function. +@JsonKey(name: 'args') Map? get args; /// Create a copy of JobTask /// with the given fields replaced by the non-null parameter values. @JsonKey(includeFromJson: false, includeToJson: false) @@ -715,16 +484,16 @@ $JobTaskCopyWith get copyWith => _$JobTaskCopyWithImpl(this as @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other.includeSamples, includeSamples)&&const DeepCollectionEquality().equals(other.excludeSamples, excludeSamples)&&const DeepCollectionEquality().equals(other.includeVariants, includeVariants)&&const DeepCollectionEquality().equals(other.excludeVariants, excludeVariants)&&const DeepCollectionEquality().equals(other.args, args)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),systemMessage); +int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(includeSamples),const DeepCollectionEquality().hash(excludeSamples),const DeepCollectionEquality().hash(includeVariants),const DeepCollectionEquality().hash(excludeVariants),const DeepCollectionEquality().hash(args)); @override String toString() { - return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage)'; + return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, includeVariants: $includeVariants, excludeVariants: $excludeVariants, args: $args)'; } @@ -735,7 +504,7 @@ abstract mixin class $JobTaskCopyWith<$Res> { factory $JobTaskCopyWith(JobTask value, $Res Function(JobTask) _then) = _$JobTaskCopyWithImpl; @useResult $Res call({ - String id,@JsonKey(name: 'include_samples') List? includeSamples,@JsonKey(name: 'exclude_samples') List? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage + String id,@JsonKey(name: 'include_samples') List? includeSamples,@JsonKey(name: 'exclude_samples') List? excludeSamples,@JsonKey(name: 'include_variants') List? includeVariants,@JsonKey(name: 'exclude_variants') List? excludeVariants,@JsonKey(name: 'args') Map? args }); @@ -752,13 +521,15 @@ class _$JobTaskCopyWithImpl<$Res> /// Create a copy of JobTask /// with the given fields replaced by the non-null parameter values. -@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,}) { +@pragma('vm:prefer-inline') @override $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? includeVariants = freezed,Object? excludeVariants = freezed,Object? args = freezed,}) { return _then(_self.copyWith( id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable as String,includeSamples: freezed == includeSamples ? _self.includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable as List?,excludeSamples: freezed == excludeSamples ? _self.excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable -as List?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable -as String?, +as List?,includeVariants: freezed == includeVariants ? _self.includeVariants : includeVariants // ignore: cast_nullable_to_non_nullable +as List?,excludeVariants: freezed == excludeVariants ? _self.excludeVariants : excludeVariants // ignore: cast_nullable_to_non_nullable +as List?,args: freezed == args ? _self.args : args // ignore: cast_nullable_to_non_nullable +as Map?, )); } @@ -840,10 +611,10 @@ return $default(_that);case _: /// } /// ``` -@optionalTypeArgs TResult maybeWhen(TResult Function( String id, @JsonKey(name: 'include_samples') List? includeSamples, @JsonKey(name: 'exclude_samples') List? excludeSamples, @JsonKey(name: 'system_message') String? systemMessage)? $default,{required TResult orElse(),}) {final _that = this; +@optionalTypeArgs TResult maybeWhen(TResult Function( String id, @JsonKey(name: 'include_samples') List? includeSamples, @JsonKey(name: 'exclude_samples') List? excludeSamples, @JsonKey(name: 'include_variants') List? includeVariants, @JsonKey(name: 'exclude_variants') List? excludeVariants, @JsonKey(name: 'args') Map? args)? $default,{required TResult orElse(),}) {final _that = this; switch (_that) { case _JobTask() when $default != null: -return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage);case _: +return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.includeVariants,_that.excludeVariants,_that.args);case _: return orElse(); } @@ -861,10 +632,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM /// } /// ``` -@optionalTypeArgs TResult when(TResult Function( String id, @JsonKey(name: 'include_samples') List? includeSamples, @JsonKey(name: 'exclude_samples') List? excludeSamples, @JsonKey(name: 'system_message') String? systemMessage) $default,) {final _that = this; +@optionalTypeArgs TResult when(TResult Function( String id, @JsonKey(name: 'include_samples') List? includeSamples, @JsonKey(name: 'exclude_samples') List? excludeSamples, @JsonKey(name: 'include_variants') List? includeVariants, @JsonKey(name: 'exclude_variants') List? excludeVariants, @JsonKey(name: 'args') Map? args) $default,) {final _that = this; switch (_that) { case _JobTask(): -return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage);} +return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.includeVariants,_that.excludeVariants,_that.args);} } /// A variant of `when` that fallback to returning `null` /// @@ -878,10 +649,10 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM /// } /// ``` -@optionalTypeArgs TResult? whenOrNull(TResult? Function( String id, @JsonKey(name: 'include_samples') List? includeSamples, @JsonKey(name: 'exclude_samples') List? excludeSamples, @JsonKey(name: 'system_message') String? systemMessage)? $default,) {final _that = this; +@optionalTypeArgs TResult? whenOrNull(TResult? Function( String id, @JsonKey(name: 'include_samples') List? includeSamples, @JsonKey(name: 'exclude_samples') List? excludeSamples, @JsonKey(name: 'include_variants') List? includeVariants, @JsonKey(name: 'exclude_variants') List? excludeVariants, @JsonKey(name: 'args') Map? args)? $default,) {final _that = this; switch (_that) { case _JobTask() when $default != null: -return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemMessage);case _: +return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.includeVariants,_that.excludeVariants,_that.args);case _: return null; } @@ -893,7 +664,7 @@ return $default(_that.id,_that.includeSamples,_that.excludeSamples,_that.systemM @JsonSerializable() class _JobTask implements JobTask { - const _JobTask({required this.id, @JsonKey(name: 'include_samples') final List? includeSamples, @JsonKey(name: 'exclude_samples') final List? excludeSamples, @JsonKey(name: 'system_message') this.systemMessage}): _includeSamples = includeSamples,_excludeSamples = excludeSamples; + const _JobTask({required this.id, @JsonKey(name: 'include_samples') final List? includeSamples, @JsonKey(name: 'exclude_samples') final List? excludeSamples, @JsonKey(name: 'include_variants') final List? includeVariants, @JsonKey(name: 'exclude_variants') final List? excludeVariants, @JsonKey(name: 'args') final Map? args}): _includeSamples = includeSamples,_excludeSamples = excludeSamples,_includeVariants = includeVariants,_excludeVariants = excludeVariants,_args = args; factory _JobTask.fromJson(Map json) => _$JobTaskFromJson(json); /// Task identifier matching a task directory name in `tasks/`. @@ -920,8 +691,39 @@ class _JobTask implements JobTask { return EqualUnmodifiableListView(value); } -/// Override system message for this task. -@override@JsonKey(name: 'system_message') final String? systemMessage; +/// Only run these variant names for this task. + final List? _includeVariants; +/// Only run these variant names for this task. +@override@JsonKey(name: 'include_variants') List? get includeVariants { + final value = _includeVariants; + if (value == null) return null; + if (_includeVariants is EqualUnmodifiableListView) return _includeVariants; + // ignore: implicit_dynamic_type + return EqualUnmodifiableListView(value); +} + +/// Exclude these variant names for this task. + final List? _excludeVariants; +/// Exclude these variant names for this task. +@override@JsonKey(name: 'exclude_variants') List? get excludeVariants { + final value = _excludeVariants; + if (value == null) return null; + if (_excludeVariants is EqualUnmodifiableListView) return _excludeVariants; + // ignore: implicit_dynamic_type + return EqualUnmodifiableListView(value); +} + +/// Per-task argument overrides passed to the task function. + final Map? _args; +/// Per-task argument overrides passed to the task function. +@override@JsonKey(name: 'args') Map? get args { + final value = _args; + if (value == null) return null; + if (_args is EqualUnmodifiableMapView) return _args; + // ignore: implicit_dynamic_type + return EqualUnmodifiableMapView(value); +} + /// Create a copy of JobTask /// with the given fields replaced by the non-null parameter values. @@ -936,16 +738,16 @@ Map toJson() { @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is _JobTask&&(identical(other.id, id) || other.id == id)&&const DeepCollectionEquality().equals(other._includeSamples, _includeSamples)&&const DeepCollectionEquality().equals(other._excludeSamples, _excludeSamples)&&const DeepCollectionEquality().equals(other._includeVariants, _includeVariants)&&const DeepCollectionEquality().equals(other._excludeVariants, _excludeVariants)&&const DeepCollectionEquality().equals(other._args, _args)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),systemMessage); +int get hashCode => Object.hash(runtimeType,id,const DeepCollectionEquality().hash(_includeSamples),const DeepCollectionEquality().hash(_excludeSamples),const DeepCollectionEquality().hash(_includeVariants),const DeepCollectionEquality().hash(_excludeVariants),const DeepCollectionEquality().hash(_args)); @override String toString() { - return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, systemMessage: $systemMessage)'; + return 'JobTask(id: $id, includeSamples: $includeSamples, excludeSamples: $excludeSamples, includeVariants: $includeVariants, excludeVariants: $excludeVariants, args: $args)'; } @@ -956,7 +758,7 @@ abstract mixin class _$JobTaskCopyWith<$Res> implements $JobTaskCopyWith<$Res> { factory _$JobTaskCopyWith(_JobTask value, $Res Function(_JobTask) _then) = __$JobTaskCopyWithImpl; @override @useResult $Res call({ - String id,@JsonKey(name: 'include_samples') List? includeSamples,@JsonKey(name: 'exclude_samples') List? excludeSamples,@JsonKey(name: 'system_message') String? systemMessage + String id,@JsonKey(name: 'include_samples') List? includeSamples,@JsonKey(name: 'exclude_samples') List? excludeSamples,@JsonKey(name: 'include_variants') List? includeVariants,@JsonKey(name: 'exclude_variants') List? excludeVariants,@JsonKey(name: 'args') Map? args }); @@ -973,13 +775,15 @@ class __$JobTaskCopyWithImpl<$Res> /// Create a copy of JobTask /// with the given fields replaced by the non-null parameter values. -@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? systemMessage = freezed,}) { +@override @pragma('vm:prefer-inline') $Res call({Object? id = null,Object? includeSamples = freezed,Object? excludeSamples = freezed,Object? includeVariants = freezed,Object? excludeVariants = freezed,Object? args = freezed,}) { return _then(_JobTask( id: null == id ? _self.id : id // ignore: cast_nullable_to_non_nullable as String,includeSamples: freezed == includeSamples ? _self._includeSamples : includeSamples // ignore: cast_nullable_to_non_nullable as List?,excludeSamples: freezed == excludeSamples ? _self._excludeSamples : excludeSamples // ignore: cast_nullable_to_non_nullable -as List?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable -as String?, +as List?,includeVariants: freezed == includeVariants ? _self._includeVariants : includeVariants // ignore: cast_nullable_to_non_nullable +as List?,excludeVariants: freezed == excludeVariants ? _self._excludeVariants : excludeVariants // ignore: cast_nullable_to_non_nullable +as List?,args: freezed == args ? _self._args : args // ignore: cast_nullable_to_non_nullable +as Map?, )); } diff --git a/packages/dataset_config_dart/lib/src/models/job.g.dart b/packages/dataset_config_dart/lib/src/models/job.g.dart index f62e5b3..c5aad96 100644 --- a/packages/dataset_config_dart/lib/src/models/job.g.dart +++ b/packages/dataset_config_dart/lib/src/models/job.g.dart @@ -7,10 +7,10 @@ part of 'job.dart'; // ************************************************************************** _Job _$JobFromJson(Map json) => _Job( + description: json['description'] as String?, logDir: json['log_dir'] as String, - sandboxType: json['sandbox_type'] as String? ?? 'local', maxConnections: (json['max_connections'] as num?)?.toInt() ?? 10, - models: (json['models'] as List?)?.map((e) => e as String).toList(), + models: (json['models'] as List).map((e) => e as String).toList(), variants: (json['variants'] as Map?)?.map( (k, e) => MapEntry(k, e as Map), ), @@ -21,117 +21,29 @@ _Job _$JobFromJson(Map json) => _Job( (k, e) => MapEntry(k, JobTask.fromJson(e as Map)), ), saveExamples: json['save_examples'] as bool? ?? false, - retryAttempts: (json['retry_attempts'] as num?)?.toInt(), - maxRetries: (json['max_retries'] as num?)?.toInt(), - retryWait: (json['retry_wait'] as num?)?.toDouble(), - retryConnections: (json['retry_connections'] as num?)?.toDouble(), - retryCleanup: json['retry_cleanup'] as bool?, - failOnError: (json['fail_on_error'] as num?)?.toDouble(), - continueOnFail: json['continue_on_fail'] as bool?, - retryOnError: (json['retry_on_error'] as num?)?.toInt(), - debugErrors: json['debug_errors'] as bool?, - maxSamples: (json['max_samples'] as num?)?.toInt(), - maxTasks: (json['max_tasks'] as num?)?.toInt(), - maxSubprocesses: (json['max_subprocesses'] as num?)?.toInt(), - maxSandboxes: (json['max_sandboxes'] as num?)?.toInt(), - logLevel: json['log_level'] as String?, - logLevelTranscript: json['log_level_transcript'] as String?, - logFormat: json['log_format'] as String?, - tags: (json['tags'] as List?)?.map((e) => e as String).toList(), - metadata: json['metadata'] as Map?, - trace: json['trace'] as bool?, - display: json['display'] as String?, - score: json['score'] as bool?, - limit: json['limit'], - sampleId: json['sample_id'], - sampleShuffle: json['sample_shuffle'], - epochs: json['epochs'], - approval: json['approval'], - solver: json['solver'], - sandboxCleanup: json['sandbox_cleanup'] as bool?, - modelBaseUrl: json['model_base_url'] as String?, - modelArgs: json['model_args'] as Map?, - modelRoles: (json['model_roles'] as Map?)?.map( - (k, e) => MapEntry(k, e as String), - ), - taskArgs: json['task_args'] as Map?, - messageLimit: (json['message_limit'] as num?)?.toInt(), - tokenLimit: (json['token_limit'] as num?)?.toInt(), - timeLimit: (json['time_limit'] as num?)?.toInt(), - workingLimit: (json['working_limit'] as num?)?.toInt(), - costLimit: (json['cost_limit'] as num?)?.toDouble(), - modelCostConfig: json['model_cost_config'] as Map?, - logSamples: json['log_samples'] as bool?, - logRealtime: json['log_realtime'] as bool?, - logImages: json['log_images'] as bool?, - logBuffer: (json['log_buffer'] as num?)?.toInt(), - logShared: (json['log_shared'] as num?)?.toInt(), - bundleDir: json['bundle_dir'] as String?, - bundleOverwrite: json['bundle_overwrite'] as bool?, - logDirAllowDirty: json['log_dir_allow_dirty'] as bool?, - evalSetId: json['eval_set_id'] as String?, - evalSetOverrides: json['eval_set_overrides'] as Map?, - taskDefaults: json['task_defaults'] as Map?, + sandbox: json['sandbox'] as Map?, + inspectEvalArguments: json['inspect_eval_arguments'] as Map?, + taskFilters: json['task_filters'] == null + ? null + : TagFilter.fromJson(json['task_filters'] as Map), + sampleFilters: json['sample_filters'] == null + ? null + : TagFilter.fromJson(json['sample_filters'] as Map), ); Map _$JobToJson(_Job instance) => { + 'description': instance.description, 'log_dir': instance.logDir, - 'sandbox_type': instance.sandboxType, 'max_connections': instance.maxConnections, 'models': instance.models, 'variants': instance.variants, 'task_paths': instance.taskPaths, - 'tasks': instance.tasks?.map((k, e) => MapEntry(k, e.toJson())), + 'tasks': instance.tasks, 'save_examples': instance.saveExamples, - 'retry_attempts': instance.retryAttempts, - 'max_retries': instance.maxRetries, - 'retry_wait': instance.retryWait, - 'retry_connections': instance.retryConnections, - 'retry_cleanup': instance.retryCleanup, - 'fail_on_error': instance.failOnError, - 'continue_on_fail': instance.continueOnFail, - 'retry_on_error': instance.retryOnError, - 'debug_errors': instance.debugErrors, - 'max_samples': instance.maxSamples, - 'max_tasks': instance.maxTasks, - 'max_subprocesses': instance.maxSubprocesses, - 'max_sandboxes': instance.maxSandboxes, - 'log_level': instance.logLevel, - 'log_level_transcript': instance.logLevelTranscript, - 'log_format': instance.logFormat, - 'tags': instance.tags, - 'metadata': instance.metadata, - 'trace': instance.trace, - 'display': instance.display, - 'score': instance.score, - 'limit': instance.limit, - 'sample_id': instance.sampleId, - 'sample_shuffle': instance.sampleShuffle, - 'epochs': instance.epochs, - 'approval': instance.approval, - 'solver': instance.solver, - 'sandbox_cleanup': instance.sandboxCleanup, - 'model_base_url': instance.modelBaseUrl, - 'model_args': instance.modelArgs, - 'model_roles': instance.modelRoles, - 'task_args': instance.taskArgs, - 'message_limit': instance.messageLimit, - 'token_limit': instance.tokenLimit, - 'time_limit': instance.timeLimit, - 'working_limit': instance.workingLimit, - 'cost_limit': instance.costLimit, - 'model_cost_config': instance.modelCostConfig, - 'log_samples': instance.logSamples, - 'log_realtime': instance.logRealtime, - 'log_images': instance.logImages, - 'log_buffer': instance.logBuffer, - 'log_shared': instance.logShared, - 'bundle_dir': instance.bundleDir, - 'bundle_overwrite': instance.bundleOverwrite, - 'log_dir_allow_dirty': instance.logDirAllowDirty, - 'eval_set_id': instance.evalSetId, - 'eval_set_overrides': instance.evalSetOverrides, - 'task_defaults': instance.taskDefaults, + 'sandbox': instance.sandbox, + 'inspect_eval_arguments': instance.inspectEvalArguments, + 'task_filters': instance.taskFilters, + 'sample_filters': instance.sampleFilters, }; _JobTask _$JobTaskFromJson(Map json) => _JobTask( @@ -142,12 +54,20 @@ _JobTask _$JobTaskFromJson(Map json) => _JobTask( excludeSamples: (json['exclude_samples'] as List?) ?.map((e) => e as String) .toList(), - systemMessage: json['system_message'] as String?, + includeVariants: (json['include_variants'] as List?) + ?.map((e) => e as String) + .toList(), + excludeVariants: (json['exclude_variants'] as List?) + ?.map((e) => e as String) + .toList(), + args: json['args'] as Map?, ); Map _$JobTaskToJson(_JobTask instance) => { 'id': instance.id, 'include_samples': instance.includeSamples, 'exclude_samples': instance.excludeSamples, - 'system_message': instance.systemMessage, + 'include_variants': instance.includeVariants, + 'exclude_variants': instance.excludeVariants, + 'args': instance.args, }; diff --git a/packages/dataset_config_dart/lib/src/models/models.dart b/packages/dataset_config_dart/lib/src/models/models.dart index 5b590fb..4fba25c 100644 --- a/packages/dataset_config_dart/lib/src/models/models.dart +++ b/packages/dataset_config_dart/lib/src/models/models.dart @@ -1,6 +1,7 @@ // Config models (eval runner input configuration) export 'context_file.dart'; export 'job.dart'; +export 'tag_filter.dart'; export 'variant.dart'; // Inspect AI models (mirrors the Python Inspect AI API types) diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.dart new file mode 100644 index 0000000..3e112f4 --- /dev/null +++ b/packages/dataset_config_dart/lib/src/models/tag_filter.dart @@ -0,0 +1,35 @@ +import 'package:freezed_annotation/freezed_annotation.dart'; + +part 'tag_filter.freezed.dart'; +part 'tag_filter.g.dart'; + +/// Tag-based filter for including/excluding items by their tags. +@freezed +sealed class TagFilter with _$TagFilter { + const factory TagFilter({ + @JsonKey(name: 'include_tags') List? includeTags, + @JsonKey(name: 'exclude_tags') List? excludeTags, + }) = _TagFilter; + + factory TagFilter.fromJson(Map json) => + _$TagFilterFromJson(json); +} + +/// Check whether a set of [itemTags] matches the given [filter]. +/// +/// Returns `true` if: +/// - All include_tags (if any) are present in [itemTags] +/// - No exclude_tags (if any) are present in [itemTags] +bool matchesTagFilter(List itemTags, TagFilter filter) { + if (filter.includeTags != null && + filter.includeTags!.isNotEmpty && + !filter.includeTags!.every((t) => itemTags.contains(t))) { + return false; + } + if (filter.excludeTags != null && + filter.excludeTags!.isNotEmpty && + filter.excludeTags!.any((t) => itemTags.contains(t))) { + return false; + } + return true; +} diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart new file mode 100644 index 0000000..5df78eb --- /dev/null +++ b/packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart @@ -0,0 +1,290 @@ +// GENERATED CODE - DO NOT MODIFY BY HAND +// coverage:ignore-file +// ignore_for_file: type=lint +// ignore_for_file: unused_element, deprecated_member_use, deprecated_member_use_from_same_package, use_function_type_syntax_for_parameters, unnecessary_const, avoid_init_to_null, invalid_override_different_default_values_named, prefer_expression_function_bodies, annotate_overrides, invalid_annotation_target, unnecessary_question_mark + +part of 'tag_filter.dart'; + +// ************************************************************************** +// FreezedGenerator +// ************************************************************************** + +// dart format off +T _$identity(T value) => value; + +/// @nodoc +mixin _$TagFilter { + +@JsonKey(name: 'include_tags') List? get includeTags;@JsonKey(name: 'exclude_tags') List? get excludeTags; +/// Create a copy of TagFilter +/// with the given fields replaced by the non-null parameter values. +@JsonKey(includeFromJson: false, includeToJson: false) +@pragma('vm:prefer-inline') +$TagFilterCopyWith get copyWith => _$TagFilterCopyWithImpl(this as TagFilter, _$identity); + + /// Serializes this TagFilter to a JSON map. + Map toJson(); + + +@override +bool operator ==(Object other) { + return identical(this, other) || (other.runtimeType == runtimeType&&other is TagFilter&&const DeepCollectionEquality().equals(other.includeTags, includeTags)&&const DeepCollectionEquality().equals(other.excludeTags, excludeTags)); +} + +@JsonKey(includeFromJson: false, includeToJson: false) +@override +int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(includeTags),const DeepCollectionEquality().hash(excludeTags)); + +@override +String toString() { + return 'TagFilter(includeTags: $includeTags, excludeTags: $excludeTags)'; +} + + +} + +/// @nodoc +abstract mixin class $TagFilterCopyWith<$Res> { + factory $TagFilterCopyWith(TagFilter value, $Res Function(TagFilter) _then) = _$TagFilterCopyWithImpl; +@useResult +$Res call({ +@JsonKey(name: 'include_tags') List? includeTags,@JsonKey(name: 'exclude_tags') List? excludeTags +}); + + + + +} +/// @nodoc +class _$TagFilterCopyWithImpl<$Res> + implements $TagFilterCopyWith<$Res> { + _$TagFilterCopyWithImpl(this._self, this._then); + + final TagFilter _self; + final $Res Function(TagFilter) _then; + +/// Create a copy of TagFilter +/// with the given fields replaced by the non-null parameter values. +@pragma('vm:prefer-inline') @override $Res call({Object? includeTags = freezed,Object? excludeTags = freezed,}) { + return _then(_self.copyWith( +includeTags: freezed == includeTags ? _self.includeTags : includeTags // ignore: cast_nullable_to_non_nullable +as List?,excludeTags: freezed == excludeTags ? _self.excludeTags : excludeTags // ignore: cast_nullable_to_non_nullable +as List?, + )); +} + +} + + +/// Adds pattern-matching-related methods to [TagFilter]. +extension TagFilterPatterns on TagFilter { +/// A variant of `map` that fallback to returning `orElse`. +/// +/// It is equivalent to doing: +/// ```dart +/// switch (sealedClass) { +/// case final Subclass value: +/// return ...; +/// case _: +/// return orElse(); +/// } +/// ``` + +@optionalTypeArgs TResult maybeMap(TResult Function( _TagFilter value)? $default,{required TResult orElse(),}){ +final _that = this; +switch (_that) { +case _TagFilter() when $default != null: +return $default(_that);case _: + return orElse(); + +} +} +/// A `switch`-like method, using callbacks. +/// +/// Callbacks receives the raw object, upcasted. +/// It is equivalent to doing: +/// ```dart +/// switch (sealedClass) { +/// case final Subclass value: +/// return ...; +/// case final Subclass2 value: +/// return ...; +/// } +/// ``` + +@optionalTypeArgs TResult map(TResult Function( _TagFilter value) $default,){ +final _that = this; +switch (_that) { +case _TagFilter(): +return $default(_that);} +} +/// A variant of `map` that fallback to returning `null`. +/// +/// It is equivalent to doing: +/// ```dart +/// switch (sealedClass) { +/// case final Subclass value: +/// return ...; +/// case _: +/// return null; +/// } +/// ``` + +@optionalTypeArgs TResult? mapOrNull(TResult? Function( _TagFilter value)? $default,){ +final _that = this; +switch (_that) { +case _TagFilter() when $default != null: +return $default(_that);case _: + return null; + +} +} +/// A variant of `when` that fallback to an `orElse` callback. +/// +/// It is equivalent to doing: +/// ```dart +/// switch (sealedClass) { +/// case Subclass(:final field): +/// return ...; +/// case _: +/// return orElse(); +/// } +/// ``` + +@optionalTypeArgs TResult maybeWhen(TResult Function(@JsonKey(name: 'include_tags') List? includeTags, @JsonKey(name: 'exclude_tags') List? excludeTags)? $default,{required TResult orElse(),}) {final _that = this; +switch (_that) { +case _TagFilter() when $default != null: +return $default(_that.includeTags,_that.excludeTags);case _: + return orElse(); + +} +} +/// A `switch`-like method, using callbacks. +/// +/// As opposed to `map`, this offers destructuring. +/// It is equivalent to doing: +/// ```dart +/// switch (sealedClass) { +/// case Subclass(:final field): +/// return ...; +/// case Subclass2(:final field2): +/// return ...; +/// } +/// ``` + +@optionalTypeArgs TResult when(TResult Function(@JsonKey(name: 'include_tags') List? includeTags, @JsonKey(name: 'exclude_tags') List? excludeTags) $default,) {final _that = this; +switch (_that) { +case _TagFilter(): +return $default(_that.includeTags,_that.excludeTags);} +} +/// A variant of `when` that fallback to returning `null` +/// +/// It is equivalent to doing: +/// ```dart +/// switch (sealedClass) { +/// case Subclass(:final field): +/// return ...; +/// case _: +/// return null; +/// } +/// ``` + +@optionalTypeArgs TResult? whenOrNull(TResult? Function(@JsonKey(name: 'include_tags') List? includeTags, @JsonKey(name: 'exclude_tags') List? excludeTags)? $default,) {final _that = this; +switch (_that) { +case _TagFilter() when $default != null: +return $default(_that.includeTags,_that.excludeTags);case _: + return null; + +} +} + +} + +/// @nodoc +@JsonSerializable() + +class _TagFilter implements TagFilter { + const _TagFilter({@JsonKey(name: 'include_tags') final List? includeTags, @JsonKey(name: 'exclude_tags') final List? excludeTags}): _includeTags = includeTags,_excludeTags = excludeTags; + factory _TagFilter.fromJson(Map json) => _$TagFilterFromJson(json); + + final List? _includeTags; +@override@JsonKey(name: 'include_tags') List? get includeTags { + final value = _includeTags; + if (value == null) return null; + if (_includeTags is EqualUnmodifiableListView) return _includeTags; + // ignore: implicit_dynamic_type + return EqualUnmodifiableListView(value); +} + + final List? _excludeTags; +@override@JsonKey(name: 'exclude_tags') List? get excludeTags { + final value = _excludeTags; + if (value == null) return null; + if (_excludeTags is EqualUnmodifiableListView) return _excludeTags; + // ignore: implicit_dynamic_type + return EqualUnmodifiableListView(value); +} + + +/// Create a copy of TagFilter +/// with the given fields replaced by the non-null parameter values. +@override @JsonKey(includeFromJson: false, includeToJson: false) +@pragma('vm:prefer-inline') +_$TagFilterCopyWith<_TagFilter> get copyWith => __$TagFilterCopyWithImpl<_TagFilter>(this, _$identity); + +@override +Map toJson() { + return _$TagFilterToJson(this, ); +} + +@override +bool operator ==(Object other) { + return identical(this, other) || (other.runtimeType == runtimeType&&other is _TagFilter&&const DeepCollectionEquality().equals(other._includeTags, _includeTags)&&const DeepCollectionEquality().equals(other._excludeTags, _excludeTags)); +} + +@JsonKey(includeFromJson: false, includeToJson: false) +@override +int get hashCode => Object.hash(runtimeType,const DeepCollectionEquality().hash(_includeTags),const DeepCollectionEquality().hash(_excludeTags)); + +@override +String toString() { + return 'TagFilter(includeTags: $includeTags, excludeTags: $excludeTags)'; +} + + +} + +/// @nodoc +abstract mixin class _$TagFilterCopyWith<$Res> implements $TagFilterCopyWith<$Res> { + factory _$TagFilterCopyWith(_TagFilter value, $Res Function(_TagFilter) _then) = __$TagFilterCopyWithImpl; +@override @useResult +$Res call({ +@JsonKey(name: 'include_tags') List? includeTags,@JsonKey(name: 'exclude_tags') List? excludeTags +}); + + + + +} +/// @nodoc +class __$TagFilterCopyWithImpl<$Res> + implements _$TagFilterCopyWith<$Res> { + __$TagFilterCopyWithImpl(this._self, this._then); + + final _TagFilter _self; + final $Res Function(_TagFilter) _then; + +/// Create a copy of TagFilter +/// with the given fields replaced by the non-null parameter values. +@override @pragma('vm:prefer-inline') $Res call({Object? includeTags = freezed,Object? excludeTags = freezed,}) { + return _then(_TagFilter( +includeTags: freezed == includeTags ? _self._includeTags : includeTags // ignore: cast_nullable_to_non_nullable +as List?,excludeTags: freezed == excludeTags ? _self._excludeTags : excludeTags // ignore: cast_nullable_to_non_nullable +as List?, + )); +} + + +} + +// dart format on diff --git a/packages/dataset_config_dart/lib/src/models/tag_filter.g.dart b/packages/dataset_config_dart/lib/src/models/tag_filter.g.dart new file mode 100644 index 0000000..db8553c --- /dev/null +++ b/packages/dataset_config_dart/lib/src/models/tag_filter.g.dart @@ -0,0 +1,22 @@ +// GENERATED CODE - DO NOT MODIFY BY HAND + +part of 'tag_filter.dart'; + +// ************************************************************************** +// JsonSerializableGenerator +// ************************************************************************** + +_TagFilter _$TagFilterFromJson(Map json) => _TagFilter( + includeTags: (json['include_tags'] as List?) + ?.map((e) => e as String) + .toList(), + excludeTags: (json['exclude_tags'] as List?) + ?.map((e) => e as String) + .toList(), +); + +Map _$TagFilterToJson(_TagFilter instance) => + { + 'include_tags': instance.includeTags, + 'exclude_tags': instance.excludeTags, + }; diff --git a/packages/dataset_config_dart/lib/src/models/task.dart b/packages/dataset_config_dart/lib/src/models/task.dart index ccb568b..1ba0380 100644 --- a/packages/dataset_config_dart/lib/src/models/task.dart +++ b/packages/dataset_config_dart/lib/src/models/task.dart @@ -17,6 +17,12 @@ sealed class Task with _$Task { /// A `Dataset`, a sequence of `Sample` objects, or `null`. Dataset? dataset, + /// Files to copy into sandbox (inherited by all samples). + /// + /// Keys are destination paths, values are source paths, inline text, + /// or inline binary (base64-encoded data URLs). + Map? files, + /// Setup step (always run even when the main solver is replaced). Object? setup, @@ -95,7 +101,13 @@ sealed class Task with _$Task { /// `@task` function (e.g. `"flutter_code_gen"` or /// `"dash_evals.runner.tasks.flutter_code_gen"`). /// When absent, the runner hydrates directly from JSON (Mode 2 — future). - @JsonKey(name: 'task_func') String? taskFunc, + @JsonKey(name: 'func') String? func, + + /// System message override for this task. + @JsonKey(name: 'system_message') String? systemMessage, + + /// Pass-through dict for sandbox plugin configuration. + @JsonKey(name: 'sandbox_parameters') Map? sandboxParameters, /// Task name. /// @@ -113,14 +125,14 @@ sealed class Task with _$Task { } class TaskMetadata { - final String taskFunc; + final String func; final Map additional; - TaskMetadata(this.taskFunc, this.additional); + TaskMetadata(this.func, this.additional); Map toJson() { return { - 'taskFunc': taskFunc, + 'func': func, }; } } diff --git a/packages/dataset_config_dart/lib/src/models/task.freezed.dart b/packages/dataset_config_dart/lib/src/models/task.freezed.dart index 94a4a37..bbf94e6 100644 --- a/packages/dataset_config_dart/lib/src/models/task.freezed.dart +++ b/packages/dataset_config_dart/lib/src/models/task.freezed.dart @@ -18,7 +18,11 @@ mixin _$Task { /// Dataset to evaluate. /// /// A `Dataset`, a sequence of `Sample` objects, or `null`. - Dataset? get dataset;/// Setup step (always run even when the main solver is replaced). + Dataset? get dataset;/// Files to copy into sandbox (inherited by all samples). +/// +/// Keys are destination paths, values are source paths, inline text, +/// or inline binary (base64-encoded data URLs). + Map? get files;/// Setup step (always run even when the main solver is replaced). Object? get setup;/// Solver or list of solvers. Defaults to `generate()`. Object? get solver;/// Optional cleanup function for task. /// @@ -50,14 +54,15 @@ mixin _$Task { @JsonKey(name: 'early_stopping') Object? get earlyStopping;/// Task display name (e.g. for plotting). /// /// Defaults to the registered task name. -@JsonKey(name: 'display_name') String? get displayName; -/// Task function identifier for Mode 1 hydration. +@JsonKey(name: 'display_name') String? get displayName;/// Task function identifier for Mode 1 hydration. /// /// When present, the Python runner uses this to look up a pre-built /// `@task` function (e.g. `"flutter_code_gen"` or /// `"dash_evals.runner.tasks.flutter_code_gen"`). /// When absent, the runner hydrates directly from JSON (Mode 2 — future). -@JsonKey(name: 'task_func') String? get taskFunc;/// Task name. +@JsonKey(name: 'func') String? get func;/// System message override for this task. +@JsonKey(name: 'system_message') String? get systemMessage;/// Pass-through dict for sandbox plugin configuration. +@JsonKey(name: 'sandbox_parameters') Map? get sandboxParameters;/// Task name. /// /// Automatically determined based on the registered name if not specified. String? get name;/// Version of task (to distinguish evolutions of the task spec). @@ -75,16 +80,16 @@ $TaskCopyWith get copyWith => _$TaskCopyWithImpl(this as Task, _$ide @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.taskFunc, taskFunc) || other.taskFunc == taskFunc)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other.metadata, metadata)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.files, files)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other.modelRoles, modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other.sandboxParameters, sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other.metadata, metadata)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,taskFunc,name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(metadata)]); +int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(files),const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(metadata)]); @override String toString() { - return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, taskFunc: $taskFunc, name: $name, version: $version, metadata: $metadata)'; + return 'Task(dataset: $dataset, files: $files, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)'; } @@ -95,7 +100,7 @@ abstract mixin class $TaskCopyWith<$Res> { factory $TaskCopyWith(Task value, $Res Function(Task) _then) = _$TaskCopyWithImpl; @useResult $Res call({ - Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map? metadata + Dataset? dataset, Map? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map? sandboxParameters, String? name, Object version, Map? metadata }); @@ -112,10 +117,11 @@ class _$TaskCopyWithImpl<$Res> /// Create a copy of Task /// with the given fields replaced by the non-null parameter values. -@pragma('vm:prefer-inline') @override $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? taskFunc = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) { +@pragma('vm:prefer-inline') @override $Res call({Object? dataset = freezed,Object? files = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) { return _then(_self.copyWith( dataset: freezed == dataset ? _self.dataset : dataset // ignore: cast_nullable_to_non_nullable -as Dataset?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable +as Dataset?,files: freezed == files ? _self.files : files // ignore: cast_nullable_to_non_nullable +as Map?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable as String?,config: freezed == config ? _self.config : config ,modelRoles: freezed == modelRoles ? _self.modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable as Map?,sandbox: freezed == sandbox ? _self.sandbox : sandbox ,approval: freezed == approval ? _self.approval : approval ,epochs: freezed == epochs ? _self.epochs : epochs ,failOnError: freezed == failOnError ? _self.failOnError : failOnError ,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable as bool?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable @@ -124,8 +130,10 @@ as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore: as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable as double?,earlyStopping: freezed == earlyStopping ? _self.earlyStopping : earlyStopping ,displayName: freezed == displayName ? _self.displayName : displayName // ignore: cast_nullable_to_non_nullable -as String?,taskFunc: freezed == taskFunc ? _self.taskFunc : taskFunc // ignore: cast_nullable_to_non_nullable -as String?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable +as String?,func: freezed == func ? _self.func : func // ignore: cast_nullable_to_non_nullable +as String?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable +as String?,sandboxParameters: freezed == sandboxParameters ? _self.sandboxParameters : sandboxParameters // ignore: cast_nullable_to_non_nullable +as Map?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable as String?,version: null == version ? _self.version : version ,metadata: freezed == metadata ? _self.metadata : metadata // ignore: cast_nullable_to_non_nullable as Map?, )); @@ -221,10 +229,10 @@ return $default(_that);case _: /// } /// ``` -@optionalTypeArgs TResult maybeWhen(TResult Function( Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, @JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, @JsonKey(name: 'fail_on_error') Object? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'early_stopping') Object? earlyStopping, @JsonKey(name: 'display_name') String? displayName, @JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map? metadata)? $default,{required TResult orElse(),}) {final _that = this; +@optionalTypeArgs TResult maybeWhen(TResult Function( Dataset? dataset, Map? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, @JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, @JsonKey(name: 'fail_on_error') Object? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'early_stopping') Object? earlyStopping, @JsonKey(name: 'display_name') String? displayName, @JsonKey(name: 'func') String? func, @JsonKey(name: 'system_message') String? systemMessage, @JsonKey(name: 'sandbox_parameters') Map? sandboxParameters, String? name, Object version, Map? metadata)? $default,{required TResult orElse(),}) {final _that = this; switch (_that) { case _Task() when $default != null: -return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.taskFunc,_that.name,_that.version,_that.metadata);case _: +return $default(_that.dataset,_that.files,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _: return orElse(); } @@ -242,10 +250,10 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score /// } /// ``` -@optionalTypeArgs TResult when(TResult Function( Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, @JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, @JsonKey(name: 'fail_on_error') Object? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'early_stopping') Object? earlyStopping, @JsonKey(name: 'display_name') String? displayName, @JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map? metadata) $default,) {final _that = this; +@optionalTypeArgs TResult when(TResult Function( Dataset? dataset, Map? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, @JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, @JsonKey(name: 'fail_on_error') Object? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'early_stopping') Object? earlyStopping, @JsonKey(name: 'display_name') String? displayName, @JsonKey(name: 'func') String? func, @JsonKey(name: 'system_message') String? systemMessage, @JsonKey(name: 'sandbox_parameters') Map? sandboxParameters, String? name, Object version, Map? metadata) $default,) {final _that = this; switch (_that) { case _Task(): -return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.taskFunc,_that.name,_that.version,_that.metadata);} +return $default(_that.dataset,_that.files,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);} } /// A variant of `when` that fallback to returning `null` /// @@ -259,10 +267,10 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score /// } /// ``` -@optionalTypeArgs TResult? whenOrNull(TResult? Function( Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, @JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, @JsonKey(name: 'fail_on_error') Object? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'early_stopping') Object? earlyStopping, @JsonKey(name: 'display_name') String? displayName, @JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map? metadata)? $default,) {final _that = this; +@optionalTypeArgs TResult? whenOrNull(TResult? Function( Dataset? dataset, Map? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config, @JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs, @JsonKey(name: 'fail_on_error') Object? failOnError, @JsonKey(name: 'continue_on_fail') bool? continueOnFail, @JsonKey(name: 'message_limit') int? messageLimit, @JsonKey(name: 'token_limit') int? tokenLimit, @JsonKey(name: 'time_limit') int? timeLimit, @JsonKey(name: 'working_limit') int? workingLimit, @JsonKey(name: 'cost_limit') double? costLimit, @JsonKey(name: 'early_stopping') Object? earlyStopping, @JsonKey(name: 'display_name') String? displayName, @JsonKey(name: 'func') String? func, @JsonKey(name: 'system_message') String? systemMessage, @JsonKey(name: 'sandbox_parameters') Map? sandboxParameters, String? name, Object version, Map? metadata)? $default,) {final _that = this; switch (_that) { case _Task() when $default != null: -return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.taskFunc,_that.name,_that.version,_that.metadata);case _: +return $default(_that.dataset,_that.files,_that.setup,_that.solver,_that.cleanup,_that.scorer,_that.metrics,_that.model,_that.config,_that.modelRoles,_that.sandbox,_that.approval,_that.epochs,_that.failOnError,_that.continueOnFail,_that.messageLimit,_that.tokenLimit,_that.timeLimit,_that.workingLimit,_that.costLimit,_that.earlyStopping,_that.displayName,_that.func,_that.systemMessage,_that.sandboxParameters,_that.name,_that.version,_that.metadata);case _: return null; } @@ -274,13 +282,30 @@ return $default(_that.dataset,_that.setup,_that.solver,_that.cleanup,_that.score @JsonSerializable() class _Task implements Task { - const _Task({this.dataset, this.setup, this.solver, this.cleanup, this.scorer, this.metrics, this.model, this.config, @JsonKey(name: 'model_roles') final Map? modelRoles, this.sandbox, this.approval, this.epochs, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'early_stopping') this.earlyStopping, @JsonKey(name: 'display_name') this.displayName, @JsonKey(name: 'task_func') this.taskFunc, this.name, this.version = 0, final Map? metadata}): _modelRoles = modelRoles,_metadata = metadata; + const _Task({this.dataset, final Map? files, this.setup, this.solver, this.cleanup, this.scorer, this.metrics, this.model, this.config, @JsonKey(name: 'model_roles') final Map? modelRoles, this.sandbox, this.approval, this.epochs, @JsonKey(name: 'fail_on_error') this.failOnError, @JsonKey(name: 'continue_on_fail') this.continueOnFail, @JsonKey(name: 'message_limit') this.messageLimit, @JsonKey(name: 'token_limit') this.tokenLimit, @JsonKey(name: 'time_limit') this.timeLimit, @JsonKey(name: 'working_limit') this.workingLimit, @JsonKey(name: 'cost_limit') this.costLimit, @JsonKey(name: 'early_stopping') this.earlyStopping, @JsonKey(name: 'display_name') this.displayName, @JsonKey(name: 'func') this.func, @JsonKey(name: 'system_message') this.systemMessage, @JsonKey(name: 'sandbox_parameters') final Map? sandboxParameters, this.name, this.version = 0, final Map? metadata}): _files = files,_modelRoles = modelRoles,_sandboxParameters = sandboxParameters,_metadata = metadata; factory _Task.fromJson(Map json) => _$TaskFromJson(json); /// Dataset to evaluate. /// /// A `Dataset`, a sequence of `Sample` objects, or `null`. @override final Dataset? dataset; +/// Files to copy into sandbox (inherited by all samples). +/// +/// Keys are destination paths, values are source paths, inline text, +/// or inline binary (base64-encoded data URLs). + final Map? _files; +/// Files to copy into sandbox (inherited by all samples). +/// +/// Keys are destination paths, values are source paths, inline text, +/// or inline binary (base64-encoded data URLs). +@override Map? get files { + final value = _files; + if (value == null) return null; + if (_files is EqualUnmodifiableMapView) return _files; + // ignore: implicit_dynamic_type + return EqualUnmodifiableMapView(value); +} + /// Setup step (always run even when the main solver is replaced). @override final Object? setup; /// Solver or list of solvers. Defaults to `generate()`. @@ -348,7 +373,20 @@ class _Task implements Task { /// `@task` function (e.g. `"flutter_code_gen"` or /// `"dash_evals.runner.tasks.flutter_code_gen"`). /// When absent, the runner hydrates directly from JSON (Mode 2 — future). -@override@JsonKey(name: 'task_func') final String? taskFunc; +@override@JsonKey(name: 'func') final String? func; +/// System message override for this task. +@override@JsonKey(name: 'system_message') final String? systemMessage; +/// Pass-through dict for sandbox plugin configuration. + final Map? _sandboxParameters; +/// Pass-through dict for sandbox plugin configuration. +@override@JsonKey(name: 'sandbox_parameters') Map? get sandboxParameters { + final value = _sandboxParameters; + if (value == null) return null; + if (_sandboxParameters is EqualUnmodifiableMapView) return _sandboxParameters; + // ignore: implicit_dynamic_type + return EqualUnmodifiableMapView(value); +} + /// Task name. /// /// Automatically determined based on the registered name if not specified. @@ -380,16 +418,16 @@ Map toJson() { @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is _Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.taskFunc, taskFunc) || other.taskFunc == taskFunc)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other._metadata, _metadata)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is _Task&&(identical(other.dataset, dataset) || other.dataset == dataset)&&const DeepCollectionEquality().equals(other._files, _files)&&const DeepCollectionEquality().equals(other.setup, setup)&&const DeepCollectionEquality().equals(other.solver, solver)&&const DeepCollectionEquality().equals(other.cleanup, cleanup)&&const DeepCollectionEquality().equals(other.scorer, scorer)&&const DeepCollectionEquality().equals(other.metrics, metrics)&&(identical(other.model, model) || other.model == model)&&const DeepCollectionEquality().equals(other.config, config)&&const DeepCollectionEquality().equals(other._modelRoles, _modelRoles)&&const DeepCollectionEquality().equals(other.sandbox, sandbox)&&const DeepCollectionEquality().equals(other.approval, approval)&&const DeepCollectionEquality().equals(other.epochs, epochs)&&const DeepCollectionEquality().equals(other.failOnError, failOnError)&&(identical(other.continueOnFail, continueOnFail) || other.continueOnFail == continueOnFail)&&(identical(other.messageLimit, messageLimit) || other.messageLimit == messageLimit)&&(identical(other.tokenLimit, tokenLimit) || other.tokenLimit == tokenLimit)&&(identical(other.timeLimit, timeLimit) || other.timeLimit == timeLimit)&&(identical(other.workingLimit, workingLimit) || other.workingLimit == workingLimit)&&(identical(other.costLimit, costLimit) || other.costLimit == costLimit)&&const DeepCollectionEquality().equals(other.earlyStopping, earlyStopping)&&(identical(other.displayName, displayName) || other.displayName == displayName)&&(identical(other.func, func) || other.func == func)&&(identical(other.systemMessage, systemMessage) || other.systemMessage == systemMessage)&&const DeepCollectionEquality().equals(other._sandboxParameters, _sandboxParameters)&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.version, version)&&const DeepCollectionEquality().equals(other._metadata, _metadata)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,taskFunc,name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(_metadata)]); +int get hashCode => Object.hashAll([runtimeType,dataset,const DeepCollectionEquality().hash(_files),const DeepCollectionEquality().hash(setup),const DeepCollectionEquality().hash(solver),const DeepCollectionEquality().hash(cleanup),const DeepCollectionEquality().hash(scorer),const DeepCollectionEquality().hash(metrics),model,const DeepCollectionEquality().hash(config),const DeepCollectionEquality().hash(_modelRoles),const DeepCollectionEquality().hash(sandbox),const DeepCollectionEquality().hash(approval),const DeepCollectionEquality().hash(epochs),const DeepCollectionEquality().hash(failOnError),continueOnFail,messageLimit,tokenLimit,timeLimit,workingLimit,costLimit,const DeepCollectionEquality().hash(earlyStopping),displayName,func,systemMessage,const DeepCollectionEquality().hash(_sandboxParameters),name,const DeepCollectionEquality().hash(version),const DeepCollectionEquality().hash(_metadata)]); @override String toString() { - return 'Task(dataset: $dataset, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, taskFunc: $taskFunc, name: $name, version: $version, metadata: $metadata)'; + return 'Task(dataset: $dataset, files: $files, setup: $setup, solver: $solver, cleanup: $cleanup, scorer: $scorer, metrics: $metrics, model: $model, config: $config, modelRoles: $modelRoles, sandbox: $sandbox, approval: $approval, epochs: $epochs, failOnError: $failOnError, continueOnFail: $continueOnFail, messageLimit: $messageLimit, tokenLimit: $tokenLimit, timeLimit: $timeLimit, workingLimit: $workingLimit, costLimit: $costLimit, earlyStopping: $earlyStopping, displayName: $displayName, func: $func, systemMessage: $systemMessage, sandboxParameters: $sandboxParameters, name: $name, version: $version, metadata: $metadata)'; } @@ -400,7 +438,7 @@ abstract mixin class _$TaskCopyWith<$Res> implements $TaskCopyWith<$Res> { factory _$TaskCopyWith(_Task value, $Res Function(_Task) _then) = __$TaskCopyWithImpl; @override @useResult $Res call({ - Dataset? dataset, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'task_func') String? taskFunc, String? name, Object version, Map? metadata + Dataset? dataset, Map? files, Object? setup, Object? solver, Object? cleanup, Object? scorer, Object? metrics, String? model, Object? config,@JsonKey(name: 'model_roles') Map? modelRoles, Object? sandbox, Object? approval, Object? epochs,@JsonKey(name: 'fail_on_error') Object? failOnError,@JsonKey(name: 'continue_on_fail') bool? continueOnFail,@JsonKey(name: 'message_limit') int? messageLimit,@JsonKey(name: 'token_limit') int? tokenLimit,@JsonKey(name: 'time_limit') int? timeLimit,@JsonKey(name: 'working_limit') int? workingLimit,@JsonKey(name: 'cost_limit') double? costLimit,@JsonKey(name: 'early_stopping') Object? earlyStopping,@JsonKey(name: 'display_name') String? displayName,@JsonKey(name: 'func') String? func,@JsonKey(name: 'system_message') String? systemMessage,@JsonKey(name: 'sandbox_parameters') Map? sandboxParameters, String? name, Object version, Map? metadata }); @@ -417,10 +455,11 @@ class __$TaskCopyWithImpl<$Res> /// Create a copy of Task /// with the given fields replaced by the non-null parameter values. -@override @pragma('vm:prefer-inline') $Res call({Object? dataset = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? taskFunc = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) { +@override @pragma('vm:prefer-inline') $Res call({Object? dataset = freezed,Object? files = freezed,Object? setup = freezed,Object? solver = freezed,Object? cleanup = freezed,Object? scorer = freezed,Object? metrics = freezed,Object? model = freezed,Object? config = freezed,Object? modelRoles = freezed,Object? sandbox = freezed,Object? approval = freezed,Object? epochs = freezed,Object? failOnError = freezed,Object? continueOnFail = freezed,Object? messageLimit = freezed,Object? tokenLimit = freezed,Object? timeLimit = freezed,Object? workingLimit = freezed,Object? costLimit = freezed,Object? earlyStopping = freezed,Object? displayName = freezed,Object? func = freezed,Object? systemMessage = freezed,Object? sandboxParameters = freezed,Object? name = freezed,Object? version = null,Object? metadata = freezed,}) { return _then(_Task( dataset: freezed == dataset ? _self.dataset : dataset // ignore: cast_nullable_to_non_nullable -as Dataset?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable +as Dataset?,files: freezed == files ? _self._files : files // ignore: cast_nullable_to_non_nullable +as Map?,setup: freezed == setup ? _self.setup : setup ,solver: freezed == solver ? _self.solver : solver ,cleanup: freezed == cleanup ? _self.cleanup : cleanup ,scorer: freezed == scorer ? _self.scorer : scorer ,metrics: freezed == metrics ? _self.metrics : metrics ,model: freezed == model ? _self.model : model // ignore: cast_nullable_to_non_nullable as String?,config: freezed == config ? _self.config : config ,modelRoles: freezed == modelRoles ? _self._modelRoles : modelRoles // ignore: cast_nullable_to_non_nullable as Map?,sandbox: freezed == sandbox ? _self.sandbox : sandbox ,approval: freezed == approval ? _self.approval : approval ,epochs: freezed == epochs ? _self.epochs : epochs ,failOnError: freezed == failOnError ? _self.failOnError : failOnError ,continueOnFail: freezed == continueOnFail ? _self.continueOnFail : continueOnFail // ignore: cast_nullable_to_non_nullable as bool?,messageLimit: freezed == messageLimit ? _self.messageLimit : messageLimit // ignore: cast_nullable_to_non_nullable @@ -429,8 +468,10 @@ as int?,timeLimit: freezed == timeLimit ? _self.timeLimit : timeLimit // ignore: as int?,workingLimit: freezed == workingLimit ? _self.workingLimit : workingLimit // ignore: cast_nullable_to_non_nullable as int?,costLimit: freezed == costLimit ? _self.costLimit : costLimit // ignore: cast_nullable_to_non_nullable as double?,earlyStopping: freezed == earlyStopping ? _self.earlyStopping : earlyStopping ,displayName: freezed == displayName ? _self.displayName : displayName // ignore: cast_nullable_to_non_nullable -as String?,taskFunc: freezed == taskFunc ? _self.taskFunc : taskFunc // ignore: cast_nullable_to_non_nullable -as String?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable +as String?,func: freezed == func ? _self.func : func // ignore: cast_nullable_to_non_nullable +as String?,systemMessage: freezed == systemMessage ? _self.systemMessage : systemMessage // ignore: cast_nullable_to_non_nullable +as String?,sandboxParameters: freezed == sandboxParameters ? _self._sandboxParameters : sandboxParameters // ignore: cast_nullable_to_non_nullable +as Map?,name: freezed == name ? _self.name : name // ignore: cast_nullable_to_non_nullable as String?,version: null == version ? _self.version : version ,metadata: freezed == metadata ? _self._metadata : metadata // ignore: cast_nullable_to_non_nullable as Map?, )); diff --git a/packages/dataset_config_dart/lib/src/models/task.g.dart b/packages/dataset_config_dart/lib/src/models/task.g.dart index 9906b3a..0ad2491 100644 --- a/packages/dataset_config_dart/lib/src/models/task.g.dart +++ b/packages/dataset_config_dart/lib/src/models/task.g.dart @@ -10,6 +10,9 @@ _Task _$TaskFromJson(Map json) => _Task( dataset: json['dataset'] == null ? null : Dataset.fromJson(json['dataset'] as Map), + files: (json['files'] as Map?)?.map( + (k, e) => MapEntry(k, e as String), + ), setup: json['setup'], solver: json['solver'], cleanup: json['cleanup'], @@ -32,14 +35,17 @@ _Task _$TaskFromJson(Map json) => _Task( costLimit: (json['cost_limit'] as num?)?.toDouble(), earlyStopping: json['early_stopping'], displayName: json['display_name'] as String?, - taskFunc: json['task_func'] as String?, + func: json['func'] as String?, + systemMessage: json['system_message'] as String?, + sandboxParameters: json['sandbox_parameters'] as Map?, name: json['name'] as String?, version: json['version'] as Object? ?? 0, metadata: json['metadata'] as Map?, ); Map _$TaskToJson(_Task instance) => { - 'dataset': instance.dataset?.toJson(), + 'dataset': instance.dataset, + 'files': instance.files, 'setup': instance.setup, 'solver': instance.solver, 'cleanup': instance.cleanup, @@ -60,7 +66,9 @@ Map _$TaskToJson(_Task instance) => { 'cost_limit': instance.costLimit, 'early_stopping': instance.earlyStopping, 'display_name': instance.displayName, - 'task_func': instance.taskFunc, + 'func': instance.func, + 'system_message': instance.systemMessage, + 'sandbox_parameters': instance.sandboxParameters, 'name': instance.name, 'version': instance.version, 'metadata': instance.metadata, diff --git a/packages/dataset_config_dart/lib/src/models/variant.dart b/packages/dataset_config_dart/lib/src/models/variant.dart index 82afa37..15a3bbb 100644 --- a/packages/dataset_config_dart/lib/src/models/variant.dart +++ b/packages/dataset_config_dart/lib/src/models/variant.dart @@ -11,9 +11,10 @@ part 'variant.g.dart'; /// performance with and without specific tooling or context. /// /// Features are implied by field presence — no explicit feature list needed: -/// - [contextFiles] populated → context injection enabled +/// - [files] populated → context injection enabled /// - [mcpServers] populated → MCP tools enabled -/// - [skillPaths] populated → agent skills enabled +/// - [skills] populated → agent skills enabled +/// - [taskParameters] populated → extra parameters passed to the task /// - all empty → baseline variant /// /// Example YAML: @@ -21,10 +22,13 @@ part 'variant.g.dart'; /// variants: /// baseline: {} /// context_only: -/// context_files: [./context_files/flutter.md] +/// files: [./context_files/flutter.md] /// full: -/// context_files: [./context_files/flutter.md] -/// mcp_servers: [dart] +/// files: [./context_files/flutter.md] +/// mcp_servers: +/// - name: dart +/// command: dart +/// args: [mcp-server] /// skills: [./skills/flutter_docs_ui] /// ``` @freezed @@ -34,18 +38,21 @@ sealed class Variant with _$Variant { @Default('baseline') String name, /// Loaded context files (paths resolved by config resolver). - @JsonKey(name: 'context_files') @Default([]) List contextFiles, + @JsonKey(name: 'files') @Default([]) List files, - /// MCP server keys to enable (e.g., `['dart']`). - @JsonKey(name: 'mcp_servers') @Default([]) List mcpServers, + /// MCP server configurations (list of config maps or ref strings). + @JsonKey(name: 'mcp_servers') + @Default([]) + List> mcpServers, /// Resolved paths to agent skill directories. /// Each directory must contain a `SKILL.md` file. - @JsonKey(name: 'skill_paths') @Default([]) List skillPaths, + @JsonKey(name: 'skills') @Default([]) List skills, - /// Flutter SDK channel to use (e.g., `'stable'`, `'beta'`, `'main'`). - /// `null` means use the default (stable) image from the job's sandbox. - @JsonKey(name: 'flutter_channel') String? flutterChannel, + /// Optional parameters merged into the task config dict at runtime. + @JsonKey(name: 'task_parameters') + @Default({}) + Map taskParameters, }) = _Variant; const Variant._(); diff --git a/packages/dataset_config_dart/lib/src/models/variant.freezed.dart b/packages/dataset_config_dart/lib/src/models/variant.freezed.dart index 9fe224c..f322f5a 100644 --- a/packages/dataset_config_dart/lib/src/models/variant.freezed.dart +++ b/packages/dataset_config_dart/lib/src/models/variant.freezed.dart @@ -17,12 +17,11 @@ mixin _$Variant { /// User-defined variant name from the job file. String get name;/// Loaded context files (paths resolved by config resolver). -@JsonKey(name: 'context_files') List get contextFiles;/// MCP server keys to enable (e.g., `['dart']`). -@JsonKey(name: 'mcp_servers') List get mcpServers;/// Resolved paths to agent skill directories. +@JsonKey(name: 'files') List get files;/// MCP server configurations (list of config maps or ref strings). +@JsonKey(name: 'mcp_servers') List> get mcpServers;/// Resolved paths to agent skill directories. /// Each directory must contain a `SKILL.md` file. -@JsonKey(name: 'skill_paths') List get skillPaths;/// Flutter SDK channel to use (e.g., `'stable'`, `'beta'`, `'main'`). -/// `null` means use the default (stable) image from the job's sandbox. -@JsonKey(name: 'flutter_channel') String? get flutterChannel; +@JsonKey(name: 'skills') List get skills;/// Optional parameters merged into the task config dict at runtime. +@JsonKey(name: 'task_parameters') Map get taskParameters; /// Create a copy of Variant /// with the given fields replaced by the non-null parameter values. @JsonKey(includeFromJson: false, includeToJson: false) @@ -35,16 +34,16 @@ $VariantCopyWith get copyWith => _$VariantCopyWithImpl(this as @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.contextFiles, contextFiles)&&const DeepCollectionEquality().equals(other.mcpServers, mcpServers)&&const DeepCollectionEquality().equals(other.skillPaths, skillPaths)&&(identical(other.flutterChannel, flutterChannel) || other.flutterChannel == flutterChannel)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other.files, files)&&const DeepCollectionEquality().equals(other.mcpServers, mcpServers)&&const DeepCollectionEquality().equals(other.skills, skills)&&const DeepCollectionEquality().equals(other.taskParameters, taskParameters)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(contextFiles),const DeepCollectionEquality().hash(mcpServers),const DeepCollectionEquality().hash(skillPaths),flutterChannel); +int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(files),const DeepCollectionEquality().hash(mcpServers),const DeepCollectionEquality().hash(skills),const DeepCollectionEquality().hash(taskParameters)); @override String toString() { - return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, flutterChannel: $flutterChannel)'; + return 'Variant(name: $name, files: $files, mcpServers: $mcpServers, skills: $skills, taskParameters: $taskParameters)'; } @@ -55,7 +54,7 @@ abstract mixin class $VariantCopyWith<$Res> { factory $VariantCopyWith(Variant value, $Res Function(Variant) _then) = _$VariantCopyWithImpl; @useResult $Res call({ - String name,@JsonKey(name: 'context_files') List contextFiles,@JsonKey(name: 'mcp_servers') List mcpServers,@JsonKey(name: 'skill_paths') List skillPaths,@JsonKey(name: 'flutter_channel') String? flutterChannel + String name,@JsonKey(name: 'files') List files,@JsonKey(name: 'mcp_servers') List> mcpServers,@JsonKey(name: 'skills') List skills,@JsonKey(name: 'task_parameters') Map taskParameters }); @@ -72,14 +71,14 @@ class _$VariantCopyWithImpl<$Res> /// Create a copy of Variant /// with the given fields replaced by the non-null parameter values. -@pragma('vm:prefer-inline') @override $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? flutterChannel = freezed,}) { +@pragma('vm:prefer-inline') @override $Res call({Object? name = null,Object? files = null,Object? mcpServers = null,Object? skills = null,Object? taskParameters = null,}) { return _then(_self.copyWith( name: null == name ? _self.name : name // ignore: cast_nullable_to_non_nullable -as String,contextFiles: null == contextFiles ? _self.contextFiles : contextFiles // ignore: cast_nullable_to_non_nullable +as String,files: null == files ? _self.files : files // ignore: cast_nullable_to_non_nullable as List,mcpServers: null == mcpServers ? _self.mcpServers : mcpServers // ignore: cast_nullable_to_non_nullable -as List,skillPaths: null == skillPaths ? _self.skillPaths : skillPaths // ignore: cast_nullable_to_non_nullable -as List,flutterChannel: freezed == flutterChannel ? _self.flutterChannel : flutterChannel // ignore: cast_nullable_to_non_nullable -as String?, +as List>,skills: null == skills ? _self.skills : skills // ignore: cast_nullable_to_non_nullable +as List,taskParameters: null == taskParameters ? _self.taskParameters : taskParameters // ignore: cast_nullable_to_non_nullable +as Map, )); } @@ -161,10 +160,10 @@ return $default(_that);case _: /// } /// ``` -@optionalTypeArgs TResult maybeWhen(TResult Function( String name, @JsonKey(name: 'context_files') List contextFiles, @JsonKey(name: 'mcp_servers') List mcpServers, @JsonKey(name: 'skill_paths') List skillPaths, @JsonKey(name: 'flutter_channel') String? flutterChannel)? $default,{required TResult orElse(),}) {final _that = this; +@optionalTypeArgs TResult maybeWhen(TResult Function( String name, @JsonKey(name: 'files') List files, @JsonKey(name: 'mcp_servers') List> mcpServers, @JsonKey(name: 'skills') List skills, @JsonKey(name: 'task_parameters') Map taskParameters)? $default,{required TResult orElse(),}) {final _that = this; switch (_that) { case _Variant() when $default != null: -return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.flutterChannel);case _: +return $default(_that.name,_that.files,_that.mcpServers,_that.skills,_that.taskParameters);case _: return orElse(); } @@ -182,10 +181,10 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths, /// } /// ``` -@optionalTypeArgs TResult when(TResult Function( String name, @JsonKey(name: 'context_files') List contextFiles, @JsonKey(name: 'mcp_servers') List mcpServers, @JsonKey(name: 'skill_paths') List skillPaths, @JsonKey(name: 'flutter_channel') String? flutterChannel) $default,) {final _that = this; +@optionalTypeArgs TResult when(TResult Function( String name, @JsonKey(name: 'files') List files, @JsonKey(name: 'mcp_servers') List> mcpServers, @JsonKey(name: 'skills') List skills, @JsonKey(name: 'task_parameters') Map taskParameters) $default,) {final _that = this; switch (_that) { case _Variant(): -return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.flutterChannel);} +return $default(_that.name,_that.files,_that.mcpServers,_that.skills,_that.taskParameters);} } /// A variant of `when` that fallback to returning `null` /// @@ -199,10 +198,10 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths, /// } /// ``` -@optionalTypeArgs TResult? whenOrNull(TResult? Function( String name, @JsonKey(name: 'context_files') List contextFiles, @JsonKey(name: 'mcp_servers') List mcpServers, @JsonKey(name: 'skill_paths') List skillPaths, @JsonKey(name: 'flutter_channel') String? flutterChannel)? $default,) {final _that = this; +@optionalTypeArgs TResult? whenOrNull(TResult? Function( String name, @JsonKey(name: 'files') List files, @JsonKey(name: 'mcp_servers') List> mcpServers, @JsonKey(name: 'skills') List skills, @JsonKey(name: 'task_parameters') Map taskParameters)? $default,) {final _that = this; switch (_that) { case _Variant() when $default != null: -return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths,_that.flutterChannel);case _: +return $default(_that.name,_that.files,_that.mcpServers,_that.skills,_that.taskParameters);case _: return null; } @@ -214,24 +213,24 @@ return $default(_that.name,_that.contextFiles,_that.mcpServers,_that.skillPaths, @JsonSerializable() class _Variant extends Variant { - const _Variant({this.name = 'baseline', @JsonKey(name: 'context_files') final List contextFiles = const [], @JsonKey(name: 'mcp_servers') final List mcpServers = const [], @JsonKey(name: 'skill_paths') final List skillPaths = const [], @JsonKey(name: 'flutter_channel') this.flutterChannel}): _contextFiles = contextFiles,_mcpServers = mcpServers,_skillPaths = skillPaths,super._(); + const _Variant({this.name = 'baseline', @JsonKey(name: 'files') final List files = const [], @JsonKey(name: 'mcp_servers') final List> mcpServers = const [], @JsonKey(name: 'skills') final List skills = const [], @JsonKey(name: 'task_parameters') final Map taskParameters = const {}}): _files = files,_mcpServers = mcpServers,_skills = skills,_taskParameters = taskParameters,super._(); factory _Variant.fromJson(Map json) => _$VariantFromJson(json); /// User-defined variant name from the job file. @override@JsonKey() final String name; /// Loaded context files (paths resolved by config resolver). - final List _contextFiles; + final List _files; /// Loaded context files (paths resolved by config resolver). -@override@JsonKey(name: 'context_files') List get contextFiles { - if (_contextFiles is EqualUnmodifiableListView) return _contextFiles; +@override@JsonKey(name: 'files') List get files { + if (_files is EqualUnmodifiableListView) return _files; // ignore: implicit_dynamic_type - return EqualUnmodifiableListView(_contextFiles); + return EqualUnmodifiableListView(_files); } -/// MCP server keys to enable (e.g., `['dart']`). - final List _mcpServers; -/// MCP server keys to enable (e.g., `['dart']`). -@override@JsonKey(name: 'mcp_servers') List get mcpServers { +/// MCP server configurations (list of config maps or ref strings). + final List> _mcpServers; +/// MCP server configurations (list of config maps or ref strings). +@override@JsonKey(name: 'mcp_servers') List> get mcpServers { if (_mcpServers is EqualUnmodifiableListView) return _mcpServers; // ignore: implicit_dynamic_type return EqualUnmodifiableListView(_mcpServers); @@ -239,18 +238,24 @@ class _Variant extends Variant { /// Resolved paths to agent skill directories. /// Each directory must contain a `SKILL.md` file. - final List _skillPaths; + final List _skills; /// Resolved paths to agent skill directories. /// Each directory must contain a `SKILL.md` file. -@override@JsonKey(name: 'skill_paths') List get skillPaths { - if (_skillPaths is EqualUnmodifiableListView) return _skillPaths; +@override@JsonKey(name: 'skills') List get skills { + if (_skills is EqualUnmodifiableListView) return _skills; // ignore: implicit_dynamic_type - return EqualUnmodifiableListView(_skillPaths); + return EqualUnmodifiableListView(_skills); +} + +/// Optional parameters merged into the task config dict at runtime. + final Map _taskParameters; +/// Optional parameters merged into the task config dict at runtime. +@override@JsonKey(name: 'task_parameters') Map get taskParameters { + if (_taskParameters is EqualUnmodifiableMapView) return _taskParameters; + // ignore: implicit_dynamic_type + return EqualUnmodifiableMapView(_taskParameters); } -/// Flutter SDK channel to use (e.g., `'stable'`, `'beta'`, `'main'`). -/// `null` means use the default (stable) image from the job's sandbox. -@override@JsonKey(name: 'flutter_channel') final String? flutterChannel; /// Create a copy of Variant /// with the given fields replaced by the non-null parameter values. @@ -265,16 +270,16 @@ Map toJson() { @override bool operator ==(Object other) { - return identical(this, other) || (other.runtimeType == runtimeType&&other is _Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other._contextFiles, _contextFiles)&&const DeepCollectionEquality().equals(other._mcpServers, _mcpServers)&&const DeepCollectionEquality().equals(other._skillPaths, _skillPaths)&&(identical(other.flutterChannel, flutterChannel) || other.flutterChannel == flutterChannel)); + return identical(this, other) || (other.runtimeType == runtimeType&&other is _Variant&&(identical(other.name, name) || other.name == name)&&const DeepCollectionEquality().equals(other._files, _files)&&const DeepCollectionEquality().equals(other._mcpServers, _mcpServers)&&const DeepCollectionEquality().equals(other._skills, _skills)&&const DeepCollectionEquality().equals(other._taskParameters, _taskParameters)); } @JsonKey(includeFromJson: false, includeToJson: false) @override -int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(_contextFiles),const DeepCollectionEquality().hash(_mcpServers),const DeepCollectionEquality().hash(_skillPaths),flutterChannel); +int get hashCode => Object.hash(runtimeType,name,const DeepCollectionEquality().hash(_files),const DeepCollectionEquality().hash(_mcpServers),const DeepCollectionEquality().hash(_skills),const DeepCollectionEquality().hash(_taskParameters)); @override String toString() { - return 'Variant(name: $name, contextFiles: $contextFiles, mcpServers: $mcpServers, skillPaths: $skillPaths, flutterChannel: $flutterChannel)'; + return 'Variant(name: $name, files: $files, mcpServers: $mcpServers, skills: $skills, taskParameters: $taskParameters)'; } @@ -285,7 +290,7 @@ abstract mixin class _$VariantCopyWith<$Res> implements $VariantCopyWith<$Res> { factory _$VariantCopyWith(_Variant value, $Res Function(_Variant) _then) = __$VariantCopyWithImpl; @override @useResult $Res call({ - String name,@JsonKey(name: 'context_files') List contextFiles,@JsonKey(name: 'mcp_servers') List mcpServers,@JsonKey(name: 'skill_paths') List skillPaths,@JsonKey(name: 'flutter_channel') String? flutterChannel + String name,@JsonKey(name: 'files') List files,@JsonKey(name: 'mcp_servers') List> mcpServers,@JsonKey(name: 'skills') List skills,@JsonKey(name: 'task_parameters') Map taskParameters }); @@ -302,14 +307,14 @@ class __$VariantCopyWithImpl<$Res> /// Create a copy of Variant /// with the given fields replaced by the non-null parameter values. -@override @pragma('vm:prefer-inline') $Res call({Object? name = null,Object? contextFiles = null,Object? mcpServers = null,Object? skillPaths = null,Object? flutterChannel = freezed,}) { +@override @pragma('vm:prefer-inline') $Res call({Object? name = null,Object? files = null,Object? mcpServers = null,Object? skills = null,Object? taskParameters = null,}) { return _then(_Variant( name: null == name ? _self.name : name // ignore: cast_nullable_to_non_nullable -as String,contextFiles: null == contextFiles ? _self._contextFiles : contextFiles // ignore: cast_nullable_to_non_nullable +as String,files: null == files ? _self._files : files // ignore: cast_nullable_to_non_nullable as List,mcpServers: null == mcpServers ? _self._mcpServers : mcpServers // ignore: cast_nullable_to_non_nullable -as List,skillPaths: null == skillPaths ? _self._skillPaths : skillPaths // ignore: cast_nullable_to_non_nullable -as List,flutterChannel: freezed == flutterChannel ? _self.flutterChannel : flutterChannel // ignore: cast_nullable_to_non_nullable -as String?, +as List>,skills: null == skills ? _self._skills : skills // ignore: cast_nullable_to_non_nullable +as List,taskParameters: null == taskParameters ? _self._taskParameters : taskParameters // ignore: cast_nullable_to_non_nullable +as Map, )); } diff --git a/packages/dataset_config_dart/lib/src/models/variant.g.dart b/packages/dataset_config_dart/lib/src/models/variant.g.dart index 3ed7ff4..35e3d0c 100644 --- a/packages/dataset_config_dart/lib/src/models/variant.g.dart +++ b/packages/dataset_config_dart/lib/src/models/variant.g.dart @@ -8,28 +8,26 @@ part of 'variant.dart'; _Variant _$VariantFromJson(Map json) => _Variant( name: json['name'] as String? ?? 'baseline', - contextFiles: - (json['context_files'] as List?) + files: + (json['files'] as List?) ?.map((e) => ContextFile.fromJson(e as Map)) .toList() ?? const [], mcpServers: (json['mcp_servers'] as List?) - ?.map((e) => e as String) + ?.map((e) => e as Map) .toList() ?? const [], - skillPaths: - (json['skill_paths'] as List?) - ?.map((e) => e as String) - .toList() ?? + skills: + (json['skills'] as List?)?.map((e) => e as String).toList() ?? const [], - flutterChannel: json['flutter_channel'] as String?, + taskParameters: json['task_parameters'] as Map? ?? const {}, ); Map _$VariantToJson(_Variant instance) => { 'name': instance.name, - 'context_files': instance.contextFiles.map((e) => e.toJson()).toList(), + 'files': instance.files, 'mcp_servers': instance.mcpServers, - 'skill_paths': instance.skillPaths, - 'flutter_channel': instance.flutterChannel, + 'skills': instance.skills, + 'task_parameters': instance.taskParameters, }; diff --git a/packages/dataset_config_dart/lib/src/parsed_task.dart b/packages/dataset_config_dart/lib/src/parsed_task.dart index 21ce5e3..40fa790 100644 --- a/packages/dataset_config_dart/lib/src/parsed_task.dart +++ b/packages/dataset_config_dart/lib/src/parsed_task.dart @@ -13,15 +13,23 @@ const kDefaultSystemMessage = /// former `TaskConfig` model-package class. class ParsedTask { final String id; - final String taskFunc; + final String func; final List samples; final Variant variant; final String sandboxType; final String? systemMessage; - final List? allowedVariants; final bool saveExamples; final String? examplesDir; + /// Pass-through dict for sandbox plugin configuration. + final Map? sandboxParameters; + + /// Task-level files to copy into sandbox. + final Map? taskFiles; + + /// Task-level setup script. + final String? taskSetup; + // ------------------------------------------------------------------ // Task-level settings (from task.yaml) // ------------------------------------------------------------------ @@ -77,17 +85,27 @@ class ParsedTask { /// Additional metadata to associate with the task. final Map? metadata; + /// Dataset format: 'memory' (inline samples), 'json', or 'csv'. + final String datasetFormat; + + /// File path or URL for json/csv datasets. + final String? datasetSource; + + /// Extra kwargs passed to json_dataset() or csv_dataset(). + final Map? datasetArgs; + const ParsedTask({ required this.id, - required this.taskFunc, + required this.func, required this.samples, required this.variant, this.sandboxType = 'local', this.systemMessage, - this.allowedVariants, this.saveExamples = false, this.examplesDir, - // Task-level settings + this.sandboxParameters, + this.taskFiles, + this.taskSetup, this.model, this.config, this.modelRoles, @@ -105,19 +123,24 @@ class ParsedTask { this.displayName, this.version, this.metadata, + this.datasetFormat = 'memory', + this.datasetSource, + this.datasetArgs, }); /// Create a copy with overrides. ParsedTask copyWith({ String? id, - String? taskFunc, + String? func, List? samples, Variant? variant, String? sandboxType, String? systemMessage, - List? allowedVariants, bool? saveExamples, String? examplesDir, + Map? sandboxParameters, + Map? taskFiles, + String? taskSetup, String? model, Map? config, Map? modelRoles, @@ -138,14 +161,16 @@ class ParsedTask { }) { return ParsedTask( id: id ?? this.id, - taskFunc: taskFunc ?? this.taskFunc, + func: func ?? this.func, samples: samples ?? this.samples, variant: variant ?? this.variant, sandboxType: sandboxType ?? this.sandboxType, systemMessage: systemMessage ?? this.systemMessage, - allowedVariants: allowedVariants ?? this.allowedVariants, saveExamples: saveExamples ?? this.saveExamples, examplesDir: examplesDir ?? this.examplesDir, + sandboxParameters: sandboxParameters ?? this.sandboxParameters, + taskFiles: taskFiles ?? this.taskFiles, + taskSetup: taskSetup ?? this.taskSetup, model: model ?? this.model, config: config ?? this.config, modelRoles: modelRoles ?? this.modelRoles, @@ -163,6 +188,9 @@ class ParsedTask { displayName: displayName ?? this.displayName, version: version ?? this.version, metadata: metadata ?? this.metadata, + datasetFormat: datasetFormat, + datasetSource: datasetSource, + datasetArgs: datasetArgs, ); } } diff --git a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart index 89d9668..74ad76d 100644 --- a/packages/dataset_config_dart/lib/src/parsers/json_parser.dart +++ b/packages/dataset_config_dart/lib/src/parsers/json_parser.dart @@ -21,90 +21,118 @@ class JsonParser extends Parser { List parseTasksFromMaps(List> taskMaps) { return taskMaps.map((data) { final taskId = data['id'] as String; - final taskFunc = (data['func'] as String?) ?? taskId; + final func = (data['func'] as String?) ?? taskId; final systemMessage = data['system_message'] as String?; - final allowedVariants = (data['allowed_variants'] as List?) - ?.cast(); - // Parse samples from inline data (no file I/O) - final samplesRaw = data['samples']; + // Parse dataset section (matches YAML parser's dataset key structure) + final datasetRaw = data['dataset']; final samples = []; - if (samplesRaw is Map) { - final inlineDefs = - (samplesRaw['inline'] as List?)?.cast>() ?? - const []; - for (final def in inlineDefs) { - if (def.isEmpty) continue; - - // Validate required fields - for (final field in ['id', 'input', 'target']) { - if (!def.containsKey(field)) { - throw FormatException( - "Sample '${def['id'] ?? 'unknown'}' missing required " - "field: $field", + var datasetFormat = 'memory'; + String? datasetSource; + Map? datasetArgs; + + if (datasetRaw is Map) { + final datasetMap = Map.from(datasetRaw); + + // Parse optional args + if (datasetMap['args'] is Map) { + datasetArgs = Map.from(datasetMap['args'] as Map); + } + + if (datasetMap.containsKey('json')) { + datasetFormat = 'json'; + datasetSource = datasetMap['json'].toString(); + } else if (datasetMap.containsKey('csv')) { + datasetFormat = 'csv'; + datasetSource = datasetMap['csv'].toString(); + } else if (datasetMap.containsKey('samples')) { + // Inline samples — same as before + final samplesSection = datasetMap['samples']; + if (samplesSection is Map) { + final inlineDefs = + (samplesSection['inline'] as List?) + ?.cast>() ?? + const []; + for (final def in inlineDefs) { + if (def.isEmpty) continue; + + // Validate required fields + for (final field in ['id', 'input', 'target']) { + if (!def.containsKey(field)) { + throw FormatException( + "Sample '${def['id'] ?? 'unknown'}' missing required " + "field: $field", + ); + } + } + + // Read metadata from the metadata dict + final metaRaw = Map.from( + def['metadata'] as Map? ?? {}, ); - } - } - // Normalize tags - final rawTags = def['tags']; - final List tags; - if (rawTags is String) { - tags = rawTags.split(',').map((t) => t.trim()).toList(); - } else if (rawTags is List) { - tags = rawTags.cast(); - } else { - tags = []; - } + // Normalize tags from metadata + final rawTags = metaRaw['tags']; + final List tags; + if (rawTags is String) { + tags = rawTags.split(',').map((t) => t.trim()).toList(); + } else if (rawTags is List) { + tags = rawTags.cast(); + } else { + tags = []; + } - // Parse sample-level fields - final choices = (def['choices'] as List?)?.cast(); - final sampleSandbox = def['sandbox']; - final setup = def['setup'] as String?; - final files = def['files'] is Map - ? Map.from(def['files'] as Map) - : null; - - samples.add( - Sample( - id: def['id'] as String, - input: def['input'] as String, - target: def['target'] as String, - metadata: { - ...Map.from( - def['metadata'] as Map? ?? {}, + // Parse sample-level fields + final choices = (def['choices'] as List?)?.cast(); + final sampleSandbox = def['sandbox']; + final setup = def['setup'] as String?; + final files = def['files'] is Map + ? Map.from(def['files'] as Map) + : null; + + samples.add( + Sample( + id: def['id'] as String, + input: def['input'] as String, + target: def['target'] as String, + metadata: { + ...metaRaw, + 'difficulty': metaRaw['difficulty'] as String? ?? 'medium', + 'tags': tags, + }, + choices: choices, + sandbox: sampleSandbox, + setup: setup, + files: files, ), - 'difficulty': def['difficulty'] as String? ?? 'medium', - 'tags': tags, - }, - choices: choices, - sandbox: sampleSandbox, - setup: setup, - files: files, - ), - ); + ); + } + } } } - // Parse Task-level settings - final model = data['model'] as String?; - final config = data['config'] is Map - ? Map.from(data['config'] as Map) + // Task-level Inspect AI args from inspect_task_args + final taskArgs = data['inspect_task_args'] is Map + ? Map.from(data['inspect_task_args'] as Map) + : {}; + final model = taskArgs['model'] as String?; + final config = taskArgs['config'] is Map + ? Map.from(taskArgs['config'] as Map) : null; - final modelRoles = data['model_roles'] is Map - ? Map.from(data['model_roles'] as Map) + final modelRoles = taskArgs['model_roles'] is Map + ? Map.from(taskArgs['model_roles'] as Map) : null; - final sandbox = data['sandbox']; - final approval = data['approval']; - final epochs = data['epochs']; - final failOnError = data['fail_on_error']; - final continueOnFail = data['continue_on_fail'] as bool?; - final messageLimit = data['message_limit'] as int?; - final tokenLimit = data['token_limit'] as int?; - final timeLimit = data['time_limit'] as int?; - final workingLimit = data['working_limit'] as int?; - final costLimit = (data['cost_limit'] as num?)?.toDouble(); - final earlyStopping = data['early_stopping']; + final sandbox = taskArgs['sandbox']; + final approval = taskArgs['approval']; + final epochs = taskArgs['epochs']; + final failOnError = taskArgs['fail_on_error']; + final continueOnFail = taskArgs['continue_on_fail'] as bool?; + final messageLimit = taskArgs['message_limit'] as int?; + final tokenLimit = taskArgs['token_limit'] as int?; + final timeLimit = taskArgs['time_limit'] as int?; + final workingLimit = taskArgs['working_limit'] as int?; + final costLimit = (taskArgs['cost_limit'] as num?)?.toDouble(); + final earlyStopping = taskArgs['early_stopping']; final displayName = data['display_name'] as String?; final version = data['version']; final taskMetadata = data['metadata'] is Map @@ -113,11 +141,10 @@ class JsonParser extends Parser { return ParsedTask( id: taskId, - taskFunc: taskFunc, + func: func, variant: const Variant(), samples: samples, systemMessage: systemMessage, - allowedVariants: allowedVariants, // Task-level settings model: model, config: config, @@ -136,6 +163,9 @@ class JsonParser extends Parser { displayName: displayName, version: version, metadata: taskMetadata, + datasetFormat: datasetFormat, + datasetSource: datasetSource, + datasetArgs: datasetArgs, ); }).toList(); } @@ -152,76 +182,33 @@ class JsonParser extends Parser { /// Parse a job from a pre-parsed map. Job parseJobFromMap(Map data) { + // Parse sandbox config + Map? sandbox; + final sandboxRaw = data['sandbox']; + if (sandboxRaw is Map) { + sandbox = Map.from(sandboxRaw); + } else if (sandboxRaw is String) { + sandbox = {'environment': sandboxRaw}; + } + + // Parse models (required) + final modelsRaw = data['models'] as List?; + if (modelsRaw == null || modelsRaw.isEmpty) { + throw FormatException( + "Job data is missing required 'models' field. " + 'Specify at least one model.', + ); + } + final models = modelsRaw.cast(); + return Job( logDir: (data['log_dir'] as String?) ?? '', - sandboxType: (data['sandbox_type'] as String?) ?? 'local', maxConnections: (data['max_connections'] as int?) ?? 10, - models: (data['models'] as List?)?.cast(), + models: models, saveExamples: data['save_examples'] == true, - // Promoted eval_set() fields - retryAttempts: data['retry_attempts'] as int?, - maxRetries: data['max_retries'] as int?, - retryWait: (data['retry_wait'] as num?)?.toDouble(), - retryConnections: (data['retry_connections'] as num?)?.toDouble(), - retryCleanup: data['retry_cleanup'] as bool?, - failOnError: (data['fail_on_error'] as num?)?.toDouble(), - continueOnFail: data['continue_on_fail'] as bool?, - retryOnError: data['retry_on_error'] as int?, - debugErrors: data['debug_errors'] as bool?, - maxSamples: data['max_samples'] as int?, - maxTasks: data['max_tasks'] as int?, - maxSubprocesses: data['max_subprocesses'] as int?, - maxSandboxes: data['max_sandboxes'] as int?, - logLevel: data['log_level'] as String?, - logLevelTranscript: data['log_level_transcript'] as String?, - logFormat: data['log_format'] as String?, - tags: (data['tags'] as List?)?.cast(), - metadata: data['metadata'] is Map - ? Map.from(data['metadata'] as Map) - : null, - trace: data['trace'] as bool?, - display: data['display'] as String?, - score: data['score'] as bool?, - limit: data['limit'], - sampleId: data['sample_id'], - sampleShuffle: data['sample_shuffle'], - epochs: data['epochs'], - approval: data['approval'], - solver: data['solver'], - sandboxCleanup: data['sandbox_cleanup'] as bool?, - modelBaseUrl: data['model_base_url'] as String?, - modelArgs: data['model_args'] is Map - ? Map.from(data['model_args'] as Map) - : null, - modelRoles: data['model_roles'] is Map - ? Map.from(data['model_roles'] as Map) - : null, - taskArgs: data['task_args'] is Map - ? Map.from(data['task_args'] as Map) - : null, - messageLimit: data['message_limit'] as int?, - tokenLimit: data['token_limit'] as int?, - timeLimit: data['time_limit'] as int?, - workingLimit: data['working_limit'] as int?, - costLimit: (data['cost_limit'] as num?)?.toDouble(), - modelCostConfig: data['model_cost_config'] is Map - ? Map.from(data['model_cost_config'] as Map) - : null, - logSamples: data['log_samples'] as bool?, - logRealtime: data['log_realtime'] as bool?, - logImages: data['log_images'] as bool?, - logBuffer: data['log_buffer'] as int?, - logShared: data['log_shared'] as int?, - bundleDir: data['bundle_dir'] as String?, - bundleOverwrite: data['bundle_overwrite'] as bool?, - logDirAllowDirty: data['log_dir_allow_dirty'] as bool?, - evalSetId: data['eval_set_id'] as String?, - // Pass-through sections - evalSetOverrides: data['eval_set_overrides'] is Map - ? Map.from(data['eval_set_overrides'] as Map) - : null, - taskDefaults: data['task_defaults'] is Map - ? Map.from(data['task_defaults'] as Map) + sandbox: sandbox, + inspectEvalArguments: data['inspect_eval_arguments'] is Map + ? Map.from(data['inspect_eval_arguments'] as Map) : null, ); } diff --git a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart index 3ea236c..a8d2e33 100644 --- a/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart +++ b/packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart @@ -50,63 +50,114 @@ class YamlParser extends Parser { final taskDir = p.dirname(taskPath); final taskId = (data['id'] as String?) ?? p.basename(taskDir); - final taskFunc = (data['func'] as String?) ?? taskId; + final func = (data['func'] as String?) ?? taskId; - final taskWorkspaceRaw = data['workspace']; - final taskTestsRaw = data['tests']; final systemMessage = data['system_message'] as String?; - // Pre-resolve task-level paths to absolute - final taskWorkspace = _preResolveToAbs(taskWorkspaceRaw, taskDir); - final taskTests = _preResolveToAbs(taskTestsRaw, taskDir); + // Parse task-level files and setup + final taskFiles = _asStringMap(data['files']); + final taskSetup = data['setup'] as String?; - // Optional whitelist of variant names - final allowedVariants = (data['allowed_variants'] as List?)?.cast(); + // Parse dataset section (replaces the old top-level 'samples' key) + final datasetRaw = data['dataset']; + var datasetFormat = 'memory'; + String? datasetSource; + Map? datasetArgs; + List samples; - // Parse samples section - final samplesRaw = data['samples']; - if (samplesRaw is! Map) { + if (datasetRaw == null) { + samples = []; + } else if (datasetRaw is! Map) { throw FormatException( - "Task '$taskId': 'samples' must be a dict with 'inline' and/or " - "'paths' keys, got ${samplesRaw.runtimeType}", + "Task '$taskId': 'dataset' must be a dict with one of " + "'samples', 'json', or 'csv' keys, got ${datasetRaw.runtimeType}", ); + } else { + final datasetMap = Map.from(datasetRaw); + final formatKeys = {'samples', 'json', 'csv'}; + final presentKeys = + formatKeys.intersection(datasetMap.keys.toSet().cast()); + if (presentKeys.length > 1) { + throw FormatException( + "Task '$taskId': 'dataset' must have exactly one of " + "'samples', 'json', or 'csv', found: $presentKeys", + ); + } + + // Parse optional args + final argsRaw = datasetMap['args']; + if (argsRaw != null) { + if (argsRaw is! Map) { + throw FormatException( + "Task '$taskId': 'dataset.args' must be a dict, " + 'got ${argsRaw.runtimeType}', + ); + } + datasetArgs = Map.from(argsRaw); + } + + if (datasetMap.containsKey('samples')) { + // Inline/path-based samples (existing MemoryDataset behavior) + final samplesSection = datasetMap['samples']; + if (samplesSection is! Map) { + throw FormatException( + "Task '$taskId': 'dataset.samples' must be a dict with " + "'inline' and/or 'paths' keys, got ${samplesSection.runtimeType}", + ); + } + samples = _loadSamplesSection( + Map.from(samplesSection), + datasetRoot, + taskFiles, + taskDir, + ); + } else if (datasetMap.containsKey('json')) { + datasetFormat = 'json'; + datasetSource = datasetMap['json'].toString(); + samples = []; + } else if (datasetMap.containsKey('csv')) { + datasetFormat = 'csv'; + datasetSource = datasetMap['csv'].toString(); + samples = []; + } else { + samples = []; + } } - final samplesMap = Map.from(samplesRaw); - final samples = _loadSamplesSection( - samplesMap, - datasetRoot, - taskWorkspace, - taskTests, - taskDir, - ); - // Parse Task-level settings - final model = data['model'] as String?; - final config = _asMap(data['config']); - final modelRoles = _asStringMap(data['model_roles']); - final sandbox = data['sandbox']; - final approval = data['approval']; - final epochs = data['epochs']; - final failOnError = data['fail_on_error']; - final continueOnFail = data['continue_on_fail'] as bool?; - final messageLimit = data['message_limit'] as int?; - final tokenLimit = data['token_limit'] as int?; - final timeLimit = data['time_limit'] as int?; - final workingLimit = data['working_limit'] as int?; - final costLimit = (data['cost_limit'] as num?)?.toDouble(); - final earlyStopping = data['early_stopping']; + // Task-level Inspect AI args are nested under inspect_task_args + final taskArgs = _asMap(data['inspect_task_args']) ?? {}; + final model = taskArgs['model'] as String?; + final config = _asMap(taskArgs['config']); + final modelRoles = _asStringMap(taskArgs['model_roles']); + final sandbox = taskArgs['sandbox']; + final approval = taskArgs['approval']; + final epochs = taskArgs['epochs']; + final failOnError = taskArgs['fail_on_error']; + final continueOnFail = taskArgs['continue_on_fail'] as bool?; + final messageLimit = taskArgs['message_limit'] as int?; + final tokenLimit = taskArgs['token_limit'] as int?; + final timeLimit = taskArgs['time_limit'] as int?; + final workingLimit = taskArgs['working_limit'] as int?; + final costLimit = (taskArgs['cost_limit'] as num?)?.toDouble(); + final earlyStopping = taskArgs['early_stopping']; final displayName = data['display_name'] as String?; final version = data['version']; final taskMetadata = _asMap(data['metadata']); + final sandboxParameters = _asMap(data['sandbox_parameters']); return [ ParsedTask( id: taskId, - taskFunc: taskFunc, + func: func, variant: const Variant(), // placeholder baseline samples: samples, systemMessage: systemMessage, - allowedVariants: allowedVariants, + sandboxParameters: sandboxParameters, + taskFiles: taskFiles, + taskSetup: taskSetup, + datasetFormat: datasetFormat, + datasetSource: datasetSource, + datasetArgs: datasetArgs, // Task-level settings model: model, config: config, @@ -137,8 +188,7 @@ class YamlParser extends Parser { List _loadSamplesSection( Map samplesMap, String datasetRoot, - Object? taskWorkspace, - Object? taskTests, + Map? taskFiles, String taskDir, ) { final pathPatterns = @@ -169,8 +219,7 @@ class YamlParser extends Parser { _loadSamplesFromFiles( matchedFiles, datasetRoot, - taskWorkspace, - taskTests, + taskFiles, ), ); } @@ -179,7 +228,7 @@ class YamlParser extends Parser { for (final def in inlineDefs) { if (def.isEmpty) continue; samples.add( - _resolveSample(def, taskDir, datasetRoot, taskWorkspace, taskTests), + _resolveSample(def, taskDir, datasetRoot, taskFiles), ); } @@ -190,8 +239,7 @@ class YamlParser extends Parser { List _loadSamplesFromFiles( List sampleFiles, String datasetRoot, - Object? taskWorkspace, - Object? taskTests, + Map? taskFiles, ) { final samples = []; @@ -216,8 +264,7 @@ class YamlParser extends Parser { data, sampleDir, datasetRoot, - taskWorkspace, - taskTests, + taskFiles, ), ); } @@ -238,8 +285,7 @@ class YamlParser extends Parser { Map doc, String baseDir, String datasetRoot, - Object? taskWorkspace, - Object? taskTests, + Map? taskFiles, ) { // --- Validate required fields --- for (final field in ['id', 'input', 'target']) { @@ -250,35 +296,11 @@ class YamlParser extends Parser { } } - final sampleWorkspace = doc['workspace']; - final sampleTests = doc['tests']; - - // Sample-level overrides task-level - final effectiveWorkspace = sampleWorkspace ?? taskWorkspace; - - String? workspace; - String? workspaceGit; - String? workspaceGitRef; - - if (effectiveWorkspace != null) { - if (effectiveWorkspace is Map && effectiveWorkspace.containsKey('git')) { - workspaceGit = effectiveWorkspace['git'] as String?; - workspaceGitRef = effectiveWorkspace['ref'] as String?; - } else { - final resolveDir = sampleWorkspace != null ? baseDir : datasetRoot; - workspace = _resolveResourcePath(effectiveWorkspace, resolveDir); - } - } - - String? tests; - if (sampleTests != null) { - tests = _resolveResourcePath(sampleTests, baseDir); - } else if (taskTests != null) { - tests = _resolveResourcePath(taskTests, datasetRoot); - } + // Read metadata fields from the metadata dict + final metaRaw = Map.from(doc['metadata'] as Map? ?? {}); - // --- Normalize tags --- - final rawTags = doc['tags']; + // --- Normalize tags from metadata --- + final rawTags = metaRaw['tags']; final List tags; if (rawTags is String) { tags = rawTags.split(',').map((t) => t.trim()).toList(); @@ -290,20 +312,22 @@ class YamlParser extends Parser { // Build metadata with domain-specific fields final metadata = { - ...Map.from(doc['metadata'] as Map? ?? {}), - 'difficulty': doc['difficulty'] as String? ?? 'medium', + ...metaRaw, + 'difficulty': metaRaw['difficulty'] as String? ?? 'medium', 'tags': tags, - 'workspace': ?workspace, - 'tests': ?tests, - 'workspace_git': ?workspaceGit, - 'workspace_git_ref': ?workspaceGitRef, }; // Parse sample-level fields final choices = (doc['choices'] as List?)?.cast(); final sampleSandbox = doc['sandbox']; final setup = doc['setup'] as String?; - final files = _asStringMap(doc['files']); + final sampleFiles = _asStringMap(doc['files']); + + // Stack files: task-level files + sample-level files (sample wins on conflict) + Map? mergedFiles; + if (taskFiles != null || sampleFiles != null) { + mergedFiles = {...?taskFiles, ...?sampleFiles}; + } return Sample( id: doc['id'] as String, @@ -312,7 +336,7 @@ class YamlParser extends Parser { metadata: metadata, choices: choices, sandbox: sampleSandbox, - files: files, + files: mergedFiles, setup: setup, ); } @@ -330,7 +354,6 @@ class YamlParser extends Parser { final data = readYamlFileAsMap(jobPath); final logsDir = (data['logs_dir'] as String?) ?? _kDefaultLogsDir; - final sandboxType = (data['sandbox_type'] as String?) ?? 'local'; final maxConnections = (data['max_connections'] as int?) ?? 10; // Resolve log directory with timestamp @@ -370,75 +393,66 @@ class YamlParser extends Parser { } } + // Parse tag filters + final taskFiltersRaw = data['task_filters']; + final sampleFiltersRaw = data['sample_filters']; + final TagFilter? taskFilters = taskFiltersRaw is Map + ? TagFilter.fromJson(Map.from(taskFiltersRaw)) + : null; + final TagFilter? sampleFilters = sampleFiltersRaw is Map + ? TagFilter.fromJson(Map.from(sampleFiltersRaw)) + : null; + + // Parse models (required) + final modelsRaw = data['models'] as List?; + if (modelsRaw == null || modelsRaw.isEmpty) { + throw FormatException( + "Job file '$jobPath' is missing required 'models' field. " + "Specify at least one model, e.g.:\n" + ' models:\n - google/gemini-2.5-flash', + ); + } + final models = modelsRaw.cast(); + return Job( logDir: logDir, - sandboxType: sandboxType, maxConnections: maxConnections, - models: (data['models'] as List?)?.cast(), + description: data['description'] as String?, + models: models, variants: variants, taskPaths: taskPaths, tasks: tasks, + taskFilters: taskFilters, + sampleFilters: sampleFilters, saveExamples: data['save_examples'] == true, - // Promoted eval_set() fields - retryAttempts: data['retry_attempts'] as int?, - maxRetries: data['max_retries'] as int?, - retryWait: (data['retry_wait'] as num?)?.toDouble(), - retryConnections: (data['retry_connections'] as num?)?.toDouble(), - retryCleanup: data['retry_cleanup'] as bool?, - failOnError: (data['fail_on_error'] as num?)?.toDouble(), - continueOnFail: data['continue_on_fail'] as bool?, - retryOnError: data['retry_on_error'] as int?, - debugErrors: data['debug_errors'] as bool?, - maxSamples: data['max_samples'] as int?, - maxTasks: data['max_tasks'] as int?, - maxSubprocesses: data['max_subprocesses'] as int?, - maxSandboxes: data['max_sandboxes'] as int?, - logLevel: data['log_level'] as String?, - logLevelTranscript: data['log_level_transcript'] as String?, - logFormat: data['log_format'] as String?, - tags: (data['tags'] as List?)?.cast(), - metadata: _asMap(data['metadata']), - trace: data['trace'] as bool?, - display: data['display'] as String?, - score: data['score'] as bool?, - limit: data['limit'], - sampleId: data['sample_id'], - sampleShuffle: data['sample_shuffle'], - epochs: data['epochs'], - approval: data['approval'], - solver: data['solver'], - sandboxCleanup: data['sandbox_cleanup'] as bool?, - modelBaseUrl: data['model_base_url'] as String?, - modelArgs: _asObjectMap(data['model_args']), - modelRoles: _asStringMap(data['model_roles']), - taskArgs: _asObjectMap(data['task_args']), - messageLimit: data['message_limit'] as int?, - tokenLimit: data['token_limit'] as int?, - timeLimit: data['time_limit'] as int?, - workingLimit: data['working_limit'] as int?, - costLimit: (data['cost_limit'] as num?)?.toDouble(), - modelCostConfig: _asObjectMap(data['model_cost_config']), - logSamples: data['log_samples'] as bool?, - logRealtime: data['log_realtime'] as bool?, - logImages: data['log_images'] as bool?, - logBuffer: data['log_buffer'] as int?, - logShared: data['log_shared'] as int?, - bundleDir: data['bundle_dir'] as String?, - bundleOverwrite: data['bundle_overwrite'] as bool?, - logDirAllowDirty: data['log_dir_allow_dirty'] as bool?, - evalSetId: data['eval_set_id'] as String?, - // Pass-through sections - evalSetOverrides: _asMap(data['eval_set_overrides']), - taskDefaults: _asMap(data['task_defaults']), + // Sandbox configuration + sandbox: _parseSandbox(data['sandbox']), + // All inspect eval arguments + inspectEvalArguments: _asMap(data['inspect_eval_arguments']), ); } + /// Parse sandbox config from YAML value. + /// + /// Supports both string shorthand ('podman') and map form. + static Map? _parseSandbox(Object? value) { + if (value is Map) { + return Map.from(value); + } else if (value is String) { + return {'environment': value}; + } + return null; + } + /// Create a [Job] with default settings (when no job file is provided). + /// + /// Note: The caller must specify models, as there are no defaults. + /// This method creates a job with an empty models list; the resolver + /// will raise an error if models is empty at resolution time. Job createDefaultJob(String baseDir) { return Job( logDir: _resolveLogDir(_kDefaultLogsDir, baseDir), - sandboxType: 'local', - maxConnections: 10, + models: [], ); } @@ -458,64 +472,8 @@ class YamlParser extends Parser { return null; } - /// Safely cast a YAML value to `Map?`. - static Map? _asObjectMap(Object? value) { - if (value is Map) return Map.from(value); - return null; - } - - // ------------------------------------------------------------------ - // Path resolution helpers - // ------------------------------------------------------------------ - /// Pre-resolve a task-level resource to an absolute path. - Object? _preResolveToAbs(Object? resource, String taskDir) { - if (resource == null) return null; - - if (resource is String) { - if (resource.startsWith('./') || - resource.startsWith('../') || - resource.startsWith('/')) { - return {'path': p.normalize(p.join(taskDir, resource))}; - } - return resource; - } - if (resource is Map) { - if (resource.containsKey('path')) { - final pathVal = resource['path'] as String; - return { - ...resource, - 'path': p.normalize(p.join(taskDir, pathVal)), - }; - } - return resource; - } - - return resource; - } - - /// Resolve a workspace/tests resource reference to an absolute path string. - String? _resolveResourcePath(Object? resource, String baseDir) { - if (resource == null) return null; - - if (resource is String) { - if (resource.startsWith('./') || - resource.startsWith('../') || - resource.startsWith('/')) { - return p.normalize(p.join(baseDir, resource)); - } - return null; - } - - if (resource is Map) { - if (resource.containsKey('path')) { - return p.normalize(p.join(baseDir, resource['path'] as String)); - } - } - - return null; - } // ------------------------------------------------------------------ // Log dir helpers diff --git a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart index d308d68..ec9b36c 100644 --- a/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart +++ b/packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart @@ -7,22 +7,12 @@ import 'package:path/path.dart' as p; import '../parsed_task.dart'; -/// Default models used when a job doesn't specify its own. -const List kDefaultModels = [ - 'anthropic/claude-haiku-4-5', - 'anthropic/claude-sonnet-4-5', - 'anthropic/claude-opus-4-6', - 'google/gemini-2.5-flash', - 'google/gemini-3-pro-preview', - 'google/gemini-3-flash-preview', - 'openai/gpt-5-mini', - 'openai/gpt-5-nano', - 'openai/gpt-5', - 'openai/gpt-5-pro', -]; - -/// Available sandbox configurations. -const Map> kSandboxRegistry = { + + +/// Default sandbox configurations for Flutter evaluations. +/// +/// Consumers can pass these to [EvalSetResolver] or provide their own. +const Map> kDefaultSandboxRegistry = { 'podman': {'name': 'podman', 'path': './sandboxes/podman/compose.yaml'}, 'podman-beta': { 'name': 'podman', @@ -34,12 +24,7 @@ const Map> kSandboxRegistry = { }, }; -/// Maps Flutter SDK channel names to sandbox registry keys. -const Map kSdkChannels = { - 'stable': 'podman', - 'beta': 'podman-beta', - 'main': 'podman-main', -}; + /// Resolves parsed task configs and job into fully-resolved /// [EvalSet] objects ready for JSON serialization. @@ -47,19 +32,36 @@ const Map kSdkChannels = { /// This is the resolution engine. It: /// 1. Resolves models, sandboxes, and variants /// 2. Expands task × variant combinations into [Task] entries -/// 3. Groups by flutter_channel (one [EvalSet] per group) -/// 4. Propagates job-level and task-level settings to the output +/// 3. Propagates job-level and task-level settings to the output class EvalSetResolver { - /// Resolve task configs and job into [EvalSet] objects. + /// Creates a resolver with optional sandbox configuration. /// - /// Groups by flutter_channel so each gets its own sandbox. + /// If [sandboxRegistry] is not provided, it defaults to an empty map + /// (no sandbox resolution). Pass [kDefaultSandboxRegistry] for the + /// Flutter-specific sandbox setup. + const EvalSetResolver({ + this.sandboxRegistry = const {}, + }); + + /// Named sandbox configurations (e.g. `'podman'` → compose file path). + final Map> sandboxRegistry; + + /// Resolve task configs and job into [EvalSet] objects. List resolve( List datasetTasks, Job job, String datasetRoot, ) { - final models = _resolveModels(job); - final sandboxTypeStr = job.sandboxType; + if (job.models.isEmpty) { + throw ArgumentError( + 'job.models is required and must contain at least one model. ' + 'Specify models in your job YAML, e.g.:\n' + ' models:\n - google/gemini-2.5-flash', + ); + } + final models = job.models; + final sandboxCfg = job.sandbox ?? {}; + final sandboxTypeStr = (sandboxCfg['environment'] as String?) ?? 'local'; final expandedTasks = _expandTaskConfigs( datasetTasks, job, @@ -67,26 +69,16 @@ class EvalSetResolver { datasetRoot, ); - // Group by flutter channel - final groups = >{}; - for (final tc in expandedTasks) { - final key = tc.variant.flutterChannel; - (groups[key] ??= []).add(tc); - } + final sandbox = _resolveSandbox(datasetRoot, job); return [ - for (final entry in groups.entries) - _buildEvalSet( - taskConfigs: entry.value, - logDir: job.logDir, - models: models, - sandbox: _resolveSandbox( - datasetRoot, - job, - flutterChannel: entry.key, - ), - job: job, - ), + _buildEvalSet( + taskConfigs: expandedTasks, + logDir: job.logDir, + models: models, + sandbox: sandbox, + job: job, + ), ]; } @@ -106,11 +98,12 @@ class EvalSetResolver { required Job job, }) { final inspectTasks = []; - final isContainer = - job.sandboxType.isNotEmpty && job.sandboxType != 'local'; + final sandboxCfg = job.sandbox ?? {}; + final sandboxTypeStr = (sandboxCfg['environment'] as String?) ?? 'local'; - // Parse task_defaults from the job - final taskDefaults = job.taskDefaults ?? {}; + // Parse task_defaults from inspect_eval_arguments + final evalArgs = job.inspectEvalArguments ?? {}; + final taskDefaults = (evalArgs['task_defaults'] as Map?) ?? {}; for (final tc in taskConfigs) { // Enrich each sample with task-level metadata @@ -126,26 +119,15 @@ class EvalSetResolver { } } - // Build files + setup for sandbox provisioning - Map? files = sample.files; - String? setup = sample.setup; - final workspace = sample.metadata?['workspace'] as String?; - final workspaceGit = sample.metadata?['workspace_git'] as String?; - final workspaceGitRef = - sample.metadata?['workspace_git_ref'] as String?; - - if (workspace != null && isContainer) { - files = {...?files, '/workspace': workspace}; - setup = setup ?? 'cd /workspace && flutter pub get'; - enriched['workspace'] = '/workspace'; - } - if (workspaceGit != null) { - enriched['workspace_git'] = workspaceGit; - if (workspaceGitRef != null) { - enriched['workspace_git_ref'] = workspaceGitRef; - } + // Stack files: task-level + sample-level (sample wins on conflict) + Map? files; + if (tc.taskFiles != null || sample.files != null) { + files = {...?tc.taskFiles, ...?sample.files}; } + // Setup: sample overrides task + final setup = sample.setup ?? tc.taskSetup; + inspectSamples.add( Sample( id: sample.id, @@ -163,35 +145,42 @@ class EvalSetResolver { final dataset = Dataset( samples: inspectSamples, name: '${tc.id}:${tc.variant.name}', + format: tc.datasetFormat, + source: tc.datasetSource, + args: tc.datasetArgs, ); // Build task metadata (variant config, system message, etc.) final metadata = { 'variant': tc.variant.name, - if (tc.variant.contextFiles.isNotEmpty) - 'variant_config': { - 'context_files': tc.variant.contextFiles - .map( - (cf) => { - 'title': cf.metadata.title, - 'version': cf.metadata.version, - 'content': cf.content, - }, - ) - .toList(), - 'mcp_servers': tc.variant.mcpServers, - 'skill_paths': tc.variant.skillPaths, - }, - if (tc.variant.contextFiles.isEmpty && - (tc.variant.mcpServers.isNotEmpty || - tc.variant.skillPaths.isNotEmpty)) + if (tc.variant.files.isNotEmpty || + tc.variant.mcpServers.isNotEmpty || + tc.variant.skills.isNotEmpty || + tc.variant.taskParameters.isNotEmpty) 'variant_config': { - 'mcp_servers': tc.variant.mcpServers, - 'skill_paths': tc.variant.skillPaths, + if (tc.variant.files.isNotEmpty) + 'files': tc.variant.files + .map( + (cf) => { + 'title': cf.metadata.title, + 'version': cf.metadata.version, + 'content': cf.content, + }, + ) + .toList(), + if (tc.variant.mcpServers.isNotEmpty) + 'mcp_servers': tc.variant.mcpServers, + if (tc.variant.skills.isNotEmpty) + 'skills': tc.variant.skills, + if (tc.variant.taskParameters.isNotEmpty) + 'task_parameters': tc.variant.taskParameters, }, if (tc.systemMessage != null) 'system_message': tc.systemMessage, if (tc.saveExamples) 'save_examples': true, if (tc.examplesDir != null) 'examples_dir': tc.examplesDir, + // Propagate image_prefix from sandbox for container image resolution + if (sandboxCfg['image_prefix'] != null) + 'image_prefix': sandboxCfg['image_prefix'], // Merge any task-level metadata from YAML ...?tc.metadata, }; @@ -201,7 +190,7 @@ class EvalSetResolver { if (tc.sandbox != null) { // Task-level sandbox override taskSandbox = tc.sandbox; - } else if (tc.sandboxType.isNotEmpty && tc.sandboxType != 'local') { + } else if (sandboxTypeStr != 'local') { taskSandbox = _serializeSandbox(sandbox); } @@ -210,7 +199,7 @@ class EvalSetResolver { final resolvedTimeLimit = tc.timeLimit ?? taskDefaults['time_limit'] as int? ?? - (job.sandboxType != 'local' ? 300 : null); + (sandboxTypeStr != 'local' ? 300 : null); final resolvedMessageLimit = tc.messageLimit ?? taskDefaults['message_limit'] as int?; final resolvedTokenLimit = @@ -239,10 +228,12 @@ class EvalSetResolver { inspectTasks.add( Task( name: '${tc.id}:${tc.variant.name}', - taskFunc: tc.taskFunc, + func: tc.func, dataset: dataset, sandbox: taskSandbox, metadata: metadata, + systemMessage: tc.systemMessage, + sandboxParameters: tc.sandboxParameters, model: resolvedModel, config: resolvedConfig, modelRoles: resolvedModelRoles, @@ -262,9 +253,17 @@ class EvalSetResolver { ); } - // Build the EvalSet with all job-level parameters. - // Start with any eval_set_overrides, then apply explicit fields. - final overrides = job.evalSetOverrides ?? {}; + // Build the EvalSet with all job-level parameters from inspect_eval_arguments. + final evalSetOverrides = (evalArgs['eval_set_overrides'] as Map?) ?? {}; + + // Helper to get a value from evalArgs then overrides + T? getArg(String key, [T? defaultVal]) { + final v = evalArgs[key] as T?; + if (v != null) return v; + final o = evalSetOverrides[key] as T?; + if (o != null) return o; + return defaultVal; + } return EvalSet( tasks: inspectTasks, @@ -272,102 +271,77 @@ class EvalSetResolver { model: models, sandbox: _serializeSandbox(sandbox), // Retry settings - retryAttempts: - job.retryAttempts ?? overrides['retry_attempts'] as int? ?? 10, - retryWait: - job.retryWait ?? (overrides['retry_wait'] as num?)?.toDouble() ?? 60, + retryAttempts: getArg('retry_attempts', 10), + retryWait: (getArg('retry_wait', 60))?.toDouble() ?? 60, retryConnections: - job.retryConnections ?? - (overrides['retry_connections'] as num?)?.toDouble() ?? - 0.5, - retryCleanup: job.retryCleanup ?? overrides['retry_cleanup'] as bool?, + (getArg('retry_connections', 0.5))?.toDouble() ?? 0.5, + retryCleanup: getArg('retry_cleanup'), retryOnError: - job.retryOnError ?? - job.maxRetries ?? - overrides['retry_on_error'] as int?, + getArg('retry_on_error') ?? getArg('max_retries'), // Error handling - failOnError: - job.failOnError ?? - (overrides['fail_on_error'] as num?)?.toDouble() ?? - 0.05, - continueOnFail: - job.continueOnFail ?? overrides['continue_on_fail'] as bool?, - debugErrors: job.debugErrors ?? overrides['debug_errors'] as bool?, + failOnError: (getArg('fail_on_error', 0.05))?.toDouble() ?? 0.05, + continueOnFail: getArg('continue_on_fail'), + debugErrors: getArg('debug_errors'), // Concurrency - maxSamples: job.maxSamples ?? overrides['max_samples'] as int?, - maxTasks: job.maxTasks ?? overrides['max_tasks'] as int?, - maxSubprocesses: - job.maxSubprocesses ?? overrides['max_subprocesses'] as int?, - maxSandboxes: job.maxSandboxes ?? overrides['max_sandboxes'] as int?, + maxSamples: getArg('max_samples'), + maxTasks: getArg('max_tasks'), + maxSubprocesses: getArg('max_subprocesses'), + maxSandboxes: getArg('max_sandboxes'), // Logging - logLevel: job.logLevel ?? overrides['log_level'] as String? ?? 'info', - logLevelTranscript: - job.logLevelTranscript ?? - overrides['log_level_transcript'] as String?, - logFormat: job.logFormat ?? overrides['log_format'] as String? ?? 'json', - logSamples: job.logSamples ?? overrides['log_samples'] as bool?, - logRealtime: job.logRealtime ?? overrides['log_realtime'] as bool?, - logImages: job.logImages ?? overrides['log_images'] as bool?, - logBuffer: job.logBuffer ?? overrides['log_buffer'] as int?, - logShared: job.logShared ?? overrides['log_shared'] as int?, - logDirAllowDirty: - job.logDirAllowDirty ?? overrides['log_dir_allow_dirty'] as bool?, + logLevel: getArg('log_level', 'info'), + logLevelTranscript: getArg('log_level_transcript'), + logFormat: getArg('log_format', 'json'), + logSamples: getArg('log_samples'), + logRealtime: getArg('log_realtime'), + logImages: getArg('log_images'), + logBuffer: getArg('log_buffer'), + logShared: getArg('log_shared'), + logDirAllowDirty: getArg('log_dir_allow_dirty'), // Model config - modelBaseUrl: job.modelBaseUrl ?? overrides['model_base_url'] as String?, + modelBaseUrl: getArg('model_base_url'), modelArgs: - job.modelArgs ?? - (overrides['model_args'] as Map?) ?? + (evalArgs['model_args'] as Map?) ?? + (evalSetOverrides['model_args'] as Map?) ?? const {}, modelRoles: - job.modelRoles ?? overrides['model_roles'] as Map?, + (evalArgs['model_roles'] as Map?) ?? + evalSetOverrides['model_roles'] as Map?, taskArgs: - job.taskArgs ?? - (overrides['task_args'] as Map?) ?? + (evalArgs['task_args'] as Map?) ?? + (evalSetOverrides['task_args'] as Map?) ?? const {}, modelCostConfig: - job.modelCostConfig ?? - overrides['model_cost_config'] as Map?, + (evalArgs['model_cost_config'] as Map?) ?? + evalSetOverrides['model_cost_config'] as Map?, // Sandbox - sandboxCleanup: - job.sandboxCleanup ?? overrides['sandbox_cleanup'] as bool?, + sandboxCleanup: getArg('sandbox_cleanup'), // Sample control - limit: job.limit ?? overrides['limit'], - sampleId: job.sampleId ?? overrides['sample_id'], - sampleShuffle: job.sampleShuffle ?? overrides['sample_shuffle'], - epochs: job.epochs ?? overrides['epochs'], + limit: evalArgs['limit'] ?? evalSetOverrides['limit'], + sampleId: evalArgs['sample_id'] ?? evalSetOverrides['sample_id'], + sampleShuffle: evalArgs['sample_shuffle'] ?? evalSetOverrides['sample_shuffle'], + epochs: evalArgs['epochs'] ?? evalSetOverrides['epochs'], // Misc - tags: job.tags ?? (overrides['tags'] as List?)?.cast(), - metadata: job.metadata ?? overrides['metadata'] as Map?, - trace: job.trace ?? overrides['trace'] as bool?, - display: job.display ?? overrides['display'] as String?, - approval: job.approval ?? overrides['approval'], - solver: job.solver ?? overrides['solver'], - score: job.score ?? overrides['score'] as bool? ?? true, + tags: (evalArgs['tags'] as List?)?.cast() ?? (evalSetOverrides['tags'] as List?)?.cast(), + metadata: (evalArgs['metadata'] as Map?) ?? evalSetOverrides['metadata'] as Map?, + trace: getArg('trace'), + display: getArg('display'), + approval: evalArgs['approval'] ?? evalSetOverrides['approval'], + solver: evalArgs['solver'] ?? evalSetOverrides['solver'], + score: getArg('score', true) ?? true, // Limits - messageLimit: job.messageLimit ?? overrides['message_limit'] as int?, - tokenLimit: job.tokenLimit ?? overrides['token_limit'] as int?, - timeLimit: job.timeLimit ?? overrides['time_limit'] as int?, - workingLimit: job.workingLimit ?? overrides['working_limit'] as int?, - costLimit: job.costLimit ?? (overrides['cost_limit'] as num?)?.toDouble(), + messageLimit: getArg('message_limit'), + tokenLimit: getArg('token_limit'), + timeLimit: getArg('time_limit'), + workingLimit: getArg('working_limit'), + costLimit: (getArg('cost_limit'))?.toDouble(), // Bundling - bundleDir: job.bundleDir ?? overrides['bundle_dir'] as String?, - bundleOverwrite: - job.bundleOverwrite ?? - overrides['bundle_overwrite'] as bool? ?? - false, - evalSetId: job.evalSetId ?? overrides['eval_set_id'] as String?, + bundleDir: getArg('bundle_dir'), + bundleOverwrite: getArg('bundle_overwrite', false) ?? false, + evalSetId: getArg('eval_set_id'), ); } - // ------------------------------------------------------------------ - // Model resolution - // ------------------------------------------------------------------ - /// Resolve which models to run. Job overrides default. - List _resolveModels(Job job) { - if (job.models != null && job.models!.isNotEmpty) return job.models!; - return List.of(kDefaultModels); - } // ------------------------------------------------------------------ // Sandbox resolution @@ -378,28 +352,15 @@ class EvalSetResolver { /// Returns either `"local"` or a `Map` with `type` and `path` keys. Object _resolveSandbox( String datasetRoot, - Job job, { - String? flutterChannel, - }) { - final sandboxType = job.sandboxType; + Job job, + ) { + final sandboxCfg = job.sandbox ?? {}; + final sandboxType = (sandboxCfg['environment'] as String?) ?? 'local'; if (sandboxType.isEmpty || sandboxType == 'local') return 'local'; - // Channel override → look up channel-specific sandbox - if (flutterChannel != null && kSdkChannels.containsKey(flutterChannel)) { - final registryKey = kSdkChannels[flutterChannel]!; - if (kSandboxRegistry.containsKey(registryKey)) { - final def = kSandboxRegistry[registryKey]!; - var sandboxPath = def['path']!; - if (!p.isAbsolute(sandboxPath)) { - sandboxPath = p.normalize(p.join(datasetRoot, sandboxPath)); - } - return {'type': def['name']!, 'path': sandboxPath}; - } - } - // Named sandbox from registry - if (kSandboxRegistry.containsKey(sandboxType)) { - final def = kSandboxRegistry[sandboxType]!; + if (sandboxRegistry.containsKey(sandboxType)) { + final def = sandboxRegistry[sandboxType]!; var sandboxPath = def['path']!; if (!p.isAbsolute(sandboxPath)) { sandboxPath = p.normalize(p.join(datasetRoot, sandboxPath)); @@ -427,16 +388,13 @@ class EvalSetResolver { for (final taskConfig in datasetTasks) { final taskId = taskConfig.id; - // Filter by job.tasks + // Filter by job.tasks (ID-based) if (job.tasks != null && !job.tasks!.containsKey(taskId)) continue; - // Determine effective variants (intersection) - final effectiveVariants = >{}; - for (final entry in jobVariants.entries) { - if (taskConfig.allowedVariants == null || - taskConfig.allowedVariants!.contains(entry.key)) { - effectiveVariants[entry.key] = entry.value; - } + // Filter by job.taskFilters (tag-based) + if (job.taskFilters != null) { + final taskTags = (taskConfig.metadata?['tags'] as List?)?.cast() ?? []; + if (!matchesTagFilter(taskTags, job.taskFilters!)) continue; } // Get job-level task overrides @@ -444,6 +402,25 @@ class EvalSetResolver { ? job.tasks![taskId] : null; + // Determine effective variants using job-level include/exclude + final effectiveVariants = >{}; + for (final entry in jobVariants.entries) { + final vName = entry.key; + + // Job-task level include_variants filter + if (jobTask?.includeVariants != null && + !jobTask!.includeVariants!.contains(vName)) { + continue; + } + // Job-task level exclude_variants filter + if (jobTask?.excludeVariants != null && + jobTask!.excludeVariants!.contains(vName)) { + continue; + } + + effectiveVariants[vName] = entry.value; + } + // Apply sample filtering var samples = taskConfig.samples; if (jobTask != null) { @@ -459,10 +436,21 @@ class EvalSetResolver { } } - // Apply system_message override + // Apply sample tag filtering (job-level) + if (job.sampleFilters != null) { + samples = samples.where((s) { + final sampleTags = (s.metadata?['tags'] as List?)?.cast() ?? []; + return matchesTagFilter(sampleTags, job.sampleFilters!); + }).toList(); + } + + // Apply system_message from task (no longer overridden by job task) var systemMessage = taskConfig.systemMessage; - if (jobTask?.systemMessage != null) { - systemMessage = jobTask!.systemMessage; + + // Merge job-task args into metadata + Map? mergedMetadata = taskConfig.metadata; + if (jobTask?.args != null && jobTask!.args!.isNotEmpty) { + mergedMetadata = {...?mergedMetadata, 'args': jobTask.args}; } // Create one ParsedTask per effective variant @@ -481,9 +469,9 @@ class EvalSetResolver { variant: variant, sandboxType: sandboxType, systemMessage: systemMessage, - allowedVariants: null, saveExamples: job.saveExamples, examplesDir: examplesDir, + metadata: mergedMetadata, ), ); } @@ -505,9 +493,9 @@ class EvalSetResolver { if (vDef.isEmpty) return Variant(name: name); // Load context files (with glob support) - final contextFiles = []; + final files = []; final cfPaths = - (vDef['context_files'] as List?)?.cast() ?? const []; + (vDef['files'] as List?)?.cast() ?? const []; for (final cfPath in cfPaths) { if (_isGlob(cfPath)) { final matched = _expandGlobFiles(datasetRoot, cfPath); @@ -517,19 +505,18 @@ class EvalSetResolver { ); } for (final f in matched) { - contextFiles.add(ContextFile.load(f)); + files.add(ContextFile.load(f)); } } else { final fullPath = p.normalize(p.join(datasetRoot, cfPath)); - contextFiles.add(ContextFile.load(fullPath)); + files.add(ContextFile.load(fullPath)); } } // Resolve skill paths (with glob support) - final skillPaths = []; + final skills = []; final rawSkills = - ((vDef['skills'] as List?) ?? (vDef['skill_paths'] as List?) ?? []) - .cast(); + (vDef['skills'] as List?)?.cast() ?? const []; for (final skillPathStr in rawSkills) { if (_isGlob(skillPathStr)) { final matched = _expandGlobDirs(datasetRoot, skillPathStr); @@ -541,7 +528,7 @@ class EvalSetResolver { 'No skill directories matched pattern: $skillPathStr', ); } - skillPaths.addAll(validDirs); + skills.addAll(validDirs); } else { final skillDir = p.normalize(p.join(datasetRoot, skillPathStr)); if (!Directory(skillDir).existsSync()) { @@ -553,16 +540,31 @@ class EvalSetResolver { 'Each skill directory must contain a SKILL.md file.', ); } - skillPaths.add(skillDir); + skills.add(skillDir); + } + } + + // Parse MCP servers as config objects + final mcpServers = >[]; + final rawMcpServers = vDef['mcp_servers'] as List? ?? []; + for (final srv in rawMcpServers) { + if (srv is Map) { + mcpServers.add(Map.from(srv)); + } else if (srv is String) { + // String shorthand: treat as a ref (Python import path) + mcpServers.add({'ref': srv}); } } + // Parse task_parameters + final taskParameters = (vDef['task_parameters'] as Map?)?.cast() ?? {}; + return Variant( name: name, - contextFiles: contextFiles, - mcpServers: (vDef['mcp_servers'] as List?)?.cast() ?? [], - skillPaths: skillPaths, - flutterChannel: vDef['flutter_channel'] as String?, + files: files, + mcpServers: mcpServers, + skills: skills, + taskParameters: taskParameters, ); } diff --git a/packages/dataset_config_dart/pubspec.yaml b/packages/dataset_config_dart/pubspec.yaml index cc76a7a..61a386b 100644 --- a/packages/dataset_config_dart/pubspec.yaml +++ b/packages/dataset_config_dart/pubspec.yaml @@ -15,5 +15,8 @@ dependencies: yaml: ^3.1.0 dev_dependencies: + build_runner: ^2.12.2 + freezed: ^3.2.5 + json_serializable: ^6.13.0 lints: ^6.0.0 test: any diff --git a/packages/dataset_config_dart/test/eval_set_resolver_test.dart b/packages/dataset_config_dart/test/eval_set_resolver_test.dart index d982b58..bd4eb82 100644 --- a/packages/dataset_config_dart/test/eval_set_resolver_test.dart +++ b/packages/dataset_config_dart/test/eval_set_resolver_test.dart @@ -7,18 +7,18 @@ void main() { /// Helper to create a minimal [ParsedTask] for testing. ParsedTask makeTask({ String id = 'test_task', - String taskFunc = 'question_answer', + String func = 'question_answer', List? samples, Variant? variant, - List? allowedVariants, String? systemMessage, String? model, int? timeLimit, int? messageLimit, + Map? metadata, }) { return ParsedTask( id: id, - taskFunc: taskFunc, + func: func, samples: samples ?? [ @@ -30,32 +30,32 @@ void main() { ), ], variant: variant ?? const Variant(), - allowedVariants: allowedVariants, systemMessage: systemMessage, model: model, timeLimit: timeLimit, messageLimit: messageLimit, + metadata: metadata, ); } /// Helper to create a minimal [Job] for testing. Job makeJob({ String logDir = '/tmp/logs', - String sandboxType = 'local', - List? models, + Map? sandbox, + List models = const ['test-model'], Map>? variants, Map? tasks, bool saveExamples = false, - Map? taskDefaults, + Map? inspectEvalArguments, }) { return Job( logDir: logDir, - sandboxType: sandboxType, + sandbox: sandbox, models: models, variants: variants, tasks: tasks, saveExamples: saveExamples, - taskDefaults: taskDefaults, + inspectEvalArguments: inspectEvalArguments, ); } @@ -148,14 +148,15 @@ void main() { expect(results.first.model, ['model_a', 'model_b']); }); - test('uses default models when job has none', () { - final results = resolver.resolve( - [makeTask()], - makeJob(models: null), - '/tmp/dataset', + test('throws when job has empty models', () { + expect( + () => resolver.resolve( + [makeTask()], + makeJob(models: []), + '/tmp/dataset', + ), + throwsArgumentError, ); - - expect(results.first.model, kDefaultModels); }); test('job with include_samples filters to only matching samples', () { @@ -231,21 +232,27 @@ void main() { test('local sandbox resolves to null in output', () { final results = resolver.resolve( [makeTask()], - makeJob(models: ['m'], sandboxType: 'local'), + makeJob(models: ['m'], sandbox: {'environment': 'local'}), '/tmp/dataset', ); expect(results.first.sandbox, isNull); }); - test('respects allowedVariants on tasks', () { + test('respects includeVariants on job tasks', () { final results = resolver.resolve( [ - makeTask(allowedVariants: ['baseline']), + makeTask(), ], makeJob( models: ['m'], variants: {'baseline': {}, 'full': {}}, + tasks: { + 'test_task': const JobTask( + id: 'test_task', + includeVariants: ['baseline'], + ), + }, ), '/tmp/dataset', ); @@ -278,14 +285,14 @@ void main() { expect(taskNames.first, contains('included')); }); - test('taskFunc is propagated to output Task', () { + test('func is propagated to output Task', () { final results = resolver.resolve( - [makeTask(taskFunc: 'flutter_code_gen')], + [makeTask(func: 'flutter_code_gen')], makeJob(models: ['m']), '/tmp/dataset', ); - expect(results.first.tasks.first.taskFunc, 'flutter_code_gen'); + expect(results.first.tasks.first.func, 'flutter_code_gen'); }); test('system_message appears in task metadata', () { @@ -317,7 +324,7 @@ void main() { [makeTask()], makeJob( models: ['m'], - taskDefaults: {'time_limit': 999, 'message_limit': 77}, + inspectEvalArguments: {'task_defaults': {'time_limit': 999, 'message_limit': 77}}, ), '/tmp/dataset', ); @@ -332,7 +339,7 @@ void main() { [makeTask(timeLimit: 100)], makeJob( models: ['m'], - taskDefaults: {'time_limit': 999}, + inspectEvalArguments: {'task_defaults': {'time_limit': 999}}, ), '/tmp/dataset', ); @@ -343,11 +350,13 @@ void main() { test('job-level eval_set fields propagate', () { final results = resolver.resolve( [makeTask()], - const Job( + Job( logDir: '/tmp/logs', models: ['m'], - retryAttempts: 42, - logLevel: 'debug', + inspectEvalArguments: { + 'retry_attempts': 42, + 'log_level': 'debug', + }, ), '/tmp/dataset', ); @@ -366,5 +375,73 @@ void main() { final dataset = results.first.tasks.first.dataset!; expect(dataset.name, 'my_eval:baseline'); }); + + test('excludeVariants restricts effective variants', () { + final results = resolver.resolve( + [ + makeTask(), + ], + makeJob( + models: ['m'], + variants: {'baseline': {}, 'full': {}, 'mcp_only': {}}, + tasks: { + 'test_task': const JobTask( + id: 'test_task', + excludeVariants: ['full', 'mcp_only'], + ), + }, + ), + '/tmp/dataset', + ); + + final taskNames = results + .expand((e) => e.tasks) + .map((t) => t.name) + .toList(); + expect(taskNames, ['test_task:baseline']); + expect(taskNames, isNot(contains('test_task:full'))); + expect(taskNames, isNot(contains('test_task:mcp_only'))); + }); + + test('image_prefix from sandbox appears in task metadata', () { + final results = resolver.resolve( + [makeTask()], + makeJob( + models: ['m'], + sandbox: { + 'environment': 'podman', + 'image_prefix': 'us-central1-docker.pkg.dev/my-project/repo/', + }, + ), + '/tmp/dataset', + ); + + final metadata = results.first.tasks.first.metadata!; + expect( + metadata['image_prefix'], + 'us-central1-docker.pkg.dev/my-project/repo/', + ); + }); + + test('JobTask.args appears in task metadata', () { + final results = resolver.resolve( + [makeTask(id: 'my_task')], + makeJob( + models: ['m'], + tasks: { + 'my_task': const JobTask( + id: 'my_task', + args: {'base_url': 'http://localhost', 'timeout': 30}, + ), + }, + ), + '/tmp/dataset', + ); + + final metadata = results.first.tasks.first.metadata!; + expect(metadata['args'], isA()); + expect(metadata['args']['base_url'], 'http://localhost'); + expect(metadata['args']['timeout'], 30); + }); }); } diff --git a/packages/dataset_config_dart/test/eval_set_writer_test.dart b/packages/dataset_config_dart/test/eval_set_writer_test.dart index 2ef58e5..ef377e6 100644 --- a/packages/dataset_config_dart/test/eval_set_writer_test.dart +++ b/packages/dataset_config_dart/test/eval_set_writer_test.dart @@ -25,7 +25,7 @@ void main() { taskCount, (i) => Task( name: 'task_$i:baseline', - taskFunc: 'func_$i', + func: 'func_$i', dataset: Dataset( samples: [ Sample(id: 's$i', input: 'input $i', target: 'target $i'), diff --git a/packages/dataset_config_dart/test/json_parser_test.dart b/packages/dataset_config_dart/test/json_parser_test.dart index f09520c..9583e65 100644 --- a/packages/dataset_config_dart/test/json_parser_test.dart +++ b/packages/dataset_config_dart/test/json_parser_test.dart @@ -14,17 +14,17 @@ void main() { { 'id': 'my_task', 'func': 'question_answer', - 'samples': { + 'dataset': {'samples': { 'inline': [ {'id': 's1', 'input': 'What is Dart?', 'target': 'A language'}, ], - }, + }}, }, ]); expect(tasks, hasLength(1)); expect(tasks.first.id, 'my_task'); - expect(tasks.first.taskFunc, 'question_answer'); + expect(tasks.first.func, 'question_answer'); expect(tasks.first.samples, hasLength(1)); expect(tasks.first.samples.first.id, 's1'); expect(tasks.first.samples.first.input, 'What is Dart?'); @@ -35,11 +35,11 @@ void main() { final tasks = parser.parseTasksFromMaps([ { 'id': 'dart_qa', - 'samples': {'inline': >[]}, + 'dataset': {'samples': {'inline': >[]}}, }, ]); - expect(tasks.first.taskFunc, 'dart_qa'); + expect(tasks.first.func, 'dart_qa'); }); test('throws FormatException when sample missing required field', () { @@ -47,31 +47,33 @@ void main() { () => parser.parseTasksFromMaps([ { 'id': 'bad_task', - 'samples': { + 'dataset': {'samples': { 'inline': [ {'id': 's1', 'input': 'hello'}, // missing 'target' ], - }, + }}, }, ]), throwsA(isA()), ); }); - test('normalises tags from comma-separated string', () { + test('normalises tags from comma-separated string in metadata', () { final tasks = parser.parseTasksFromMaps([ { 'id': 'tagged_task', - 'samples': { + 'dataset': {'samples': { 'inline': [ { 'id': 's1', 'input': 'q', 'target': 'a', - 'tags': 'flutter, dart, widgets', + 'metadata': { + 'tags': 'flutter, dart, widgets', + }, }, ], - }, + }}, }, ]); @@ -79,20 +81,22 @@ void main() { expect(metadata['tags'], equals(['flutter', 'dart', 'widgets'])); }); - test('normalises tags from list', () { + test('normalises tags from list in metadata', () { final tasks = parser.parseTasksFromMaps([ { 'id': 'tagged_task', - 'samples': { + 'dataset': {'samples': { 'inline': [ { 'id': 's1', 'input': 'q', 'target': 'a', - 'tags': ['tag1', 'tag2'], + 'metadata': { + 'tags': ['tag1', 'tag2'], + }, }, ], - }, + }}, }, ]); @@ -104,11 +108,11 @@ void main() { final tasks = parser.parseTasksFromMaps([ { 'id': 'no_tags', - 'samples': { + 'dataset': {'samples': { 'inline': [ {'id': 's1', 'input': 'q', 'target': 'a'}, ], - }, + }}, }, ]); @@ -120,11 +124,11 @@ void main() { final tasks = parser.parseTasksFromMaps([ { 'id': 'task', - 'samples': { + 'dataset': {'samples': { 'inline': [ {'id': 's1', 'input': 'q', 'target': 'a'}, ], - }, + }}, }, ]); @@ -136,7 +140,7 @@ void main() { final tasks = parser.parseTasksFromMaps([ { 'id': 'task', - 'samples': { + 'dataset': {'samples': { 'inline': [ { 'id': 's1', @@ -147,7 +151,7 @@ void main() { 'files': {'main.dart': 'void main() {}'}, }, ], - }, + }}, }, ]); @@ -157,31 +161,31 @@ void main() { expect(sample.files, {'main.dart': 'void main() {}'}); }); - test('parses all task-level settings', () { + test('parses all task-level settings from inspect_task_args', () { final tasks = parser.parseTasksFromMaps([ { 'id': 'full_task', 'func': 'my_func', 'system_message': 'Be helpful', - 'allowed_variants': ['baseline', 'full'], - 'model': 'gemini-pro', - 'config': {'temperature': 0.5}, - 'model_roles': {'grader': 'gpt-4o'}, - 'message_limit': 50, - 'token_limit': 4096, - 'time_limit': 600, - 'working_limit': 300, - 'cost_limit': 1.5, + 'inspect_task_args': { + 'model': 'gemini-pro', + 'config': {'temperature': 0.5}, + 'model_roles': {'grader': 'gpt-4o'}, + 'message_limit': 50, + 'token_limit': 4096, + 'time_limit': 600, + 'working_limit': 300, + 'cost_limit': 1.5, + }, 'display_name': 'Full Task', 'version': 2, 'metadata': {'author': 'test'}, - 'samples': {'inline': >[]}, + 'dataset': {'samples': {'inline': >[]}}, }, ]); final task = tasks.first; expect(task.systemMessage, 'Be helpful'); - expect(task.allowedVariants, ['baseline', 'full']); expect(task.model, 'gemini-pro'); expect(task.config, {'temperature': 0.5}); expect(task.modelRoles, {'grader': 'gpt-4o'}); @@ -199,9 +203,9 @@ void main() { final tasks = parser.parseTasksFromMaps([ { 'id': 'task', - 'samples': { + 'dataset': {'samples': { 'inline': [{}], - }, + }}, }, ]); @@ -210,66 +214,80 @@ void main() { }); group('parseJobFromMap()', () { - test('parses minimal job with defaults', () { - final job = parser.parseJobFromMap({}); - - expect(job.logDir, ''); - expect(job.sandboxType, 'local'); - expect(job.maxConnections, 10); - expect(job.models, isNull); - expect(job.saveExamples, false); + test('throws when models is missing', () { + expect( + () => parser.parseJobFromMap({}), + throwsA(isA()), + ); }); test('parses all core fields', () { final job = parser.parseJobFromMap({ 'log_dir': './logs/run1', - 'sandbox_type': 'podman', + 'sandbox': {'environment': 'podman'}, 'max_connections': 5, 'models': ['gemini-pro', 'gpt-4o'], 'save_examples': true, }); expect(job.logDir, './logs/run1'); - expect(job.sandboxType, 'podman'); + expect(job.sandbox, {'environment': 'podman'}); expect(job.maxConnections, 5); expect(job.models, ['gemini-pro', 'gpt-4o']); expect(job.saveExamples, true); }); - test('parses promoted eval_set fields', () { + test('parses sandbox string shorthand', () { final job = parser.parseJobFromMap({ - 'retry_attempts': 20, - 'max_retries': 3, - 'retry_wait': 5.0, - 'fail_on_error': 0.5, - 'continue_on_fail': true, - 'max_samples': 100, - 'max_tasks': 4, - 'log_level': 'debug', - 'tags': ['ci', 'nightly'], - 'metadata': {'run_by': 'bot'}, + 'sandbox': 'podman', + 'models': ['test-model'], }); - expect(job.retryAttempts, 20); - expect(job.maxRetries, 3); - expect(job.retryWait, 5.0); - expect(job.failOnError, 0.5); - expect(job.continueOnFail, true); - expect(job.maxSamples, 100); - expect(job.maxTasks, 4); - expect(job.logLevel, 'debug'); - expect(job.tags, ['ci', 'nightly']); - expect(job.metadata, {'run_by': 'bot'}); + expect(job.sandbox, {'environment': 'podman'}); }); - test('parses pass-through overrides', () { + test('parses inspect_eval_arguments', () { final job = parser.parseJobFromMap({ - 'eval_set_overrides': {'custom_key': 'custom_value'}, - 'task_defaults': {'time_limit': 600}, + 'models': ['test-model'], + 'inspect_eval_arguments': { + 'retry_attempts': 20, + 'max_retries': 3, + 'retry_wait': 5.0, + 'fail_on_error': 0.5, + 'continue_on_fail': true, + 'max_samples': 100, + 'max_tasks': 4, + 'log_level': 'debug', + 'tags': ['ci', 'nightly'], + 'metadata': {'run_by': 'bot'}, + }, + }); + + final evalArgs = job.inspectEvalArguments!; + expect(evalArgs['retry_attempts'], 20); + expect(evalArgs['max_retries'], 3); + expect(evalArgs['retry_wait'], 5.0); + expect(evalArgs['fail_on_error'], 0.5); + expect(evalArgs['continue_on_fail'], true); + expect(evalArgs['max_samples'], 100); + expect(evalArgs['max_tasks'], 4); + expect(evalArgs['log_level'], 'debug'); + expect(evalArgs['tags'], ['ci', 'nightly']); + expect(evalArgs['metadata'], {'run_by': 'bot'}); + }); + + test('parses nested overrides in inspect_eval_arguments', () { + final job = parser.parseJobFromMap({ + 'models': ['test-model'], + 'inspect_eval_arguments': { + 'eval_set_overrides': {'custom_key': 'custom_value'}, + 'task_defaults': {'time_limit': 600}, + }, }); - expect(job.evalSetOverrides, {'custom_key': 'custom_value'}); - expect(job.taskDefaults, {'time_limit': 600}); + final evalArgs = job.inspectEvalArguments!; + expect(evalArgs['eval_set_overrides'], {'custom_key': 'custom_value'}); + expect(evalArgs['task_defaults'], {'time_limit': 600}); }); }); diff --git a/packages/dataset_config_dart/test/parsed_task_test.dart b/packages/dataset_config_dart/test/parsed_task_test.dart index 4921e30..cd3c75c 100644 --- a/packages/dataset_config_dart/test/parsed_task_test.dart +++ b/packages/dataset_config_dart/test/parsed_task_test.dart @@ -6,7 +6,7 @@ void main() { test('has correct defaults', () { const task = ParsedTask( id: 'test', - taskFunc: 'question_answer', + func: 'question_answer', samples: [], variant: Variant(), ); @@ -14,7 +14,7 @@ void main() { expect(task.sandboxType, 'local'); expect(task.saveExamples, false); expect(task.systemMessage, isNull); - expect(task.allowedVariants, isNull); + expect(task.examplesDir, isNull); expect(task.examplesDir, isNull); expect(task.model, isNull); expect(task.config, isNull); @@ -27,12 +27,11 @@ void main() { test('stores all constructor fields', () { const task = ParsedTask( id: 'my_task', - taskFunc: 'flutter_code_gen', + func: 'flutter_code_gen', samples: [Sample(id: 's1', input: 'q', target: 'a')], variant: Variant(name: 'full'), sandboxType: 'podman', systemMessage: 'Be helpful', - allowedVariants: ['baseline', 'full'], saveExamples: true, examplesDir: '/tmp/examples', model: 'gemini-pro', @@ -49,12 +48,11 @@ void main() { ); expect(task.id, 'my_task'); - expect(task.taskFunc, 'flutter_code_gen'); + expect(task.func, 'flutter_code_gen'); expect(task.samples, hasLength(1)); expect(task.variant.name, 'full'); expect(task.sandboxType, 'podman'); expect(task.systemMessage, 'Be helpful'); - expect(task.allowedVariants, ['baseline', 'full']); expect(task.saveExamples, true); expect(task.examplesDir, '/tmp/examples'); expect(task.model, 'gemini-pro'); @@ -75,7 +73,7 @@ void main() { test('overrides specified fields', () { const original = ParsedTask( id: 'original', - taskFunc: 'func_a', + func: 'func_a', samples: [], variant: Variant(name: 'baseline'), timeLimit: 100, @@ -93,7 +91,7 @@ void main() { test('preserves fields not overridden', () { const original = ParsedTask( id: 'task', - taskFunc: 'func', + func: 'func', samples: [], variant: Variant(name: 'full'), sandboxType: 'podman', @@ -103,7 +101,7 @@ void main() { final copy = original.copyWith(id: 'new_id'); - expect(copy.taskFunc, 'func'); + expect(copy.func, 'func'); expect(copy.variant.name, 'full'); expect(copy.sandboxType, 'podman'); expect(copy.systemMessage, 'Be helpful'); @@ -113,7 +111,7 @@ void main() { test('returns a new instance (not the same object)', () { const original = ParsedTask( id: 'a', - taskFunc: 'f', + func: 'f', samples: [], variant: Variant(), ); @@ -128,7 +126,7 @@ void main() { test('can override samples list', () { const original = ParsedTask( id: 'task', - taskFunc: 'func', + func: 'func', samples: [Sample(id: 's1', input: 'q', target: 'a')], variant: Variant(), ); diff --git a/packages/dataset_config_python/src/dataset_config_python/__init__.py b/packages/dataset_config_python/src/dataset_config_python/__init__.py index 135b4cb..e6dd675 100644 --- a/packages/dataset_config_python/src/dataset_config_python/__init__.py +++ b/packages/dataset_config_python/src/dataset_config_python/__init__.py @@ -6,7 +6,23 @@ No Dart SDK or Inspect AI dependency required. """ -from dataset_config_python.resolver import resolve +from dataset_config_python.parser import ParsedTask, find_job_file, parse_job, parse_tasks +from dataset_config_python.resolver import ( + DEFAULT_SANDBOX_REGISTRY, + SandboxConfig, + resolve, + resolve_from_parsed, +) from dataset_config_python.writer import write_eval_sets -__all__ = ["resolve", "write_eval_sets"] +__all__ = [ + "DEFAULT_SANDBOX_REGISTRY", + "ParsedTask", + "SandboxConfig", + "find_job_file", + "parse_job", + "parse_tasks", + "resolve", + "resolve_from_parsed", + "write_eval_sets", +] diff --git a/packages/dataset_config_python/src/dataset_config_python/models/__init__.py b/packages/dataset_config_python/src/dataset_config_python/models/__init__.py index a90aaad..3afc978 100644 --- a/packages/dataset_config_python/src/dataset_config_python/models/__init__.py +++ b/packages/dataset_config_python/src/dataset_config_python/models/__init__.py @@ -4,7 +4,9 @@ from dataset_config_python.models.dataset import Dataset from dataset_config_python.models.eval_set import EvalSet from dataset_config_python.models.job import Job, JobTask +from dataset_config_python.models.mcp_server_config import McpServerConfig from dataset_config_python.models.sample import Sample +from dataset_config_python.models.tag_filter import TagFilter, matches_tag_filter from dataset_config_python.models.task import Task from dataset_config_python.models.variant import Variant @@ -15,7 +17,10 @@ "EvalSet", "Job", "JobTask", + "McpServerConfig", "Sample", + "TagFilter", "Task", "Variant", + "matches_tag_filter", ] diff --git a/packages/dataset_config_python/src/dataset_config_python/models/dataset.py b/packages/dataset_config_python/src/dataset_config_python/models/dataset.py index b04ceb5..fe363ee 100644 --- a/packages/dataset_config_python/src/dataset_config_python/models/dataset.py +++ b/packages/dataset_config_python/src/dataset_config_python/models/dataset.py @@ -2,16 +2,24 @@ from __future__ import annotations +from typing import Any + from pydantic import BaseModel from dataset_config_python.models.sample import Sample class Dataset(BaseModel): - """A named collection of samples.""" + """A named collection of samples, or a reference to a file-backed dataset. + + Supports three dataset formats: + - ``format="memory"`` (default): inline samples via ``samples`` list. + - ``format="json"``: loads via Inspect AI's ``json_dataset(source, **args)``. + - ``format="csv"``: loads via Inspect AI's ``csv_dataset(source, **args)``. + """ samples: list[Sample] = [] - """The sample records in this dataset.""" + """The sample records (only used when format is 'memory').""" name: str = "" """Display name for the dataset.""" @@ -21,3 +29,12 @@ class Dataset(BaseModel): shuffled: bool = False """Whether the dataset was shuffled after reading.""" + + format: str = "memory" + """Dataset format: 'memory' (inline samples), 'json', or 'csv'.""" + + source: str | None = None + """File path or URL for json/csv datasets.""" + + args: dict[str, Any] | None = None + """Extra kwargs passed to json_dataset() or csv_dataset().""" diff --git a/packages/dataset_config_python/src/dataset_config_python/models/job.py b/packages/dataset_config_python/src/dataset_config_python/models/job.py index c82ccc1..2049b91 100644 --- a/packages/dataset_config_python/src/dataset_config_python/models/job.py +++ b/packages/dataset_config_python/src/dataset_config_python/models/job.py @@ -4,7 +4,9 @@ from typing import Any -from pydantic import BaseModel, Field +from pydantic import BaseModel + +from dataset_config_python.models.tag_filter import TagFilter class JobTask(BaseModel): @@ -19,8 +21,14 @@ class JobTask(BaseModel): exclude_samples: list[str] | None = None """Exclude these sample IDs.""" - system_message: str | None = None - """Override system message for this task.""" + args: dict[str, Any] | None = None + """Per-task argument overrides passed to the task function.""" + + include_variants: list[str] | None = None + """Only run these variant names for this task.""" + + exclude_variants: list[str] | None = None + """Exclude these variant names for this task.""" @staticmethod def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask: @@ -31,7 +39,9 @@ def from_yaml(task_id: str, data: dict[str, Any] | None) -> JobTask: id=task_id, include_samples=data.get("include-samples"), exclude_samples=data.get("exclude-samples"), - system_message=data.get("system_message"), + args=data.get("args"), + include_variants=data.get("include-variants"), + exclude_variants=data.get("exclude-variants"), ) @@ -39,64 +49,23 @@ class Job(BaseModel): """A job configuration defining what to run and how to run it.""" # Core settings + description: str | None = None log_dir: str - sandbox_type: str = "local" max_connections: int = 10 - models: list[str] | None = None + models: list[str] variants: dict[str, dict[str, Any]] | None = None task_paths: list[str] | None = None tasks: dict[str, JobTask] | None = None save_examples: bool = False - # Promoted eval_set() parameters - retry_attempts: int | None = None - max_retries: int | None = None - retry_wait: float | None = None - retry_connections: float | None = None - retry_cleanup: bool | None = None - fail_on_error: float | None = None - continue_on_fail: bool | None = None - retry_on_error: int | None = None - debug_errors: bool | None = None - max_samples: int | None = None - max_tasks: int | None = None - max_subprocesses: int | None = None - max_sandboxes: int | None = None - log_level: str | None = None - log_level_transcript: str | None = None - log_format: str | None = None - tags: list[str] | None = None - metadata: dict[str, Any] | None = None - trace: bool | None = None - display: str | None = None - score: bool | None = None - limit: Any | None = None - sample_id: Any | None = None - sample_shuffle: Any | None = None - epochs: Any | None = None - approval: Any | None = None - solver: Any | None = None - sandbox_cleanup: bool | None = None - model_base_url: str | None = None - model_args: dict[str, Any] | None = None - model_roles: dict[str, str] | None = None - task_args: dict[str, Any] | None = None - message_limit: int | None = None - token_limit: int | None = None - time_limit: int | None = None - working_limit: int | None = None - cost_limit: float | None = None - model_cost_config: dict[str, Any] | None = None - log_samples: bool | None = None - log_realtime: bool | None = None - log_images: bool | None = None - log_buffer: int | None = None - log_shared: int | None = None - bundle_dir: str | None = None - bundle_overwrite: bool | None = None - log_dir_allow_dirty: bool | None = None - eval_set_id: str | None = None - - # Pass-through overrides - eval_set_overrides: dict[str, Any] | None = None - task_defaults: dict[str, Any] | None = None + # Sandbox configuration + sandbox: dict[str, Any] | None = None + """Sandbox config with keys: environment, parameters, image_prefix.""" + + # Inspect eval arguments (passed through to eval_set()) + inspect_eval_arguments: dict[str, Any] | None = None + """All Inspect AI eval_set() parameters, nested under one key.""" + + # Tag-based filtering + task_filters: TagFilter | None = None + sample_filters: TagFilter | None = None diff --git a/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py b/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py new file mode 100644 index 0000000..598eb44 --- /dev/null +++ b/packages/dataset_config_python/src/dataset_config_python/models/mcp_server_config.py @@ -0,0 +1,95 @@ +"""MCP server configuration model — declarative or Python import ref.""" + +from __future__ import annotations + +from typing import Any + +from pydantic import BaseModel, Field, model_validator + + +class McpServerConfig(BaseModel): + """MCP server configuration. + + Supports three modes: + 1. **Declarative stdio/sandbox** — specify command, args, env, etc. + 2. **Declarative HTTP** — specify url, and optionally headers/auth. + 3. **Python ref** — point to a pre-built MCPServer object via + ``ref: "my_package.module:variable_name"``. + + When ``ref`` is set, all other fields are ignored. + """ + + # Declarative fields (stdio / sandbox) + name: str | None = None + """Human-readable server name (e.g. ``"dart"``).""" + + command: str | None = None + """Executable to run (e.g. ``"dart"``). Required for stdio/sandbox transport.""" + + args: list[str] = Field(default_factory=list) + """Command-line arguments (e.g. ``["mcp-server"]``).""" + + env: dict[str, str] | None = None + """Extra environment variables for the server process.""" + + cwd: str | None = None + """Working directory for the server process.""" + + # Declarative fields (HTTP) + url: str | None = None + """URL endpoint for HTTP transport (e.g. ``"https://mcp.example.com/api"``).""" + + headers: dict[str, str] | None = None + """HTTP headers to send with requests (e.g. for authentication).""" + + authorization: str | None = None + """OAuth Bearer token for HTTP authentication. + + Maps to Inspect AI's ``authorization`` parameter on ``mcp_server_http``. + """ + + # Common + transport: str | None = None + """Transport type: ``"stdio"``, ``"sandbox"``, ``"http"``, or ``None`` (auto). + + Auto-selection logic: + - If ``url`` is set → ``"http"`` + - If ``command`` is set and sandbox is non-local → ``"sandbox"`` + - If ``command`` is set and sandbox is local → ``"stdio"`` + """ + + # Python import escape hatch + ref: str | None = None + """Python import path to a pre-built MCPServer object. + + Format: ``"module.path:variable_name"`` or ``"module.path:factory()"``. + When set, all declarative fields above are ignored. + """ + + @model_validator(mode="after") + def _validate_mode(self) -> McpServerConfig: + if self.ref is None and self.command is None and self.url is None: + raise ValueError( + "McpServerConfig requires one of: 'ref' (Python import), " + "'command' (stdio/sandbox), or 'url' (HTTP). " + "None was provided." + ) + if self.command is not None and self.url is not None: + raise ValueError( + "McpServerConfig cannot have both 'command' (stdio/sandbox) " + "and 'url' (HTTP). Use one or the other." + ) + return self + + @staticmethod + def from_yaml(raw: Any) -> McpServerConfig: + """Parse from YAML — accepts a dict or a string shorthand. + + String shorthand is treated as a ref: + ``"my_package.mcp:server"`` → ``McpServerConfig(ref=...)`` + """ + if isinstance(raw, str): + return McpServerConfig(ref=raw) + if isinstance(raw, dict): + return McpServerConfig(**raw) + raise ValueError(f"Invalid MCP server config: {raw!r}") diff --git a/packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py b/packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py new file mode 100644 index 0000000..5d298e2 --- /dev/null +++ b/packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py @@ -0,0 +1,30 @@ +"""Tag-based filter for including/excluding items by their tags.""" + +from __future__ import annotations + +from pydantic import BaseModel + + +class TagFilter(BaseModel): + """Tag-based filter for including/excluding items.""" + + include_tags: list[str] | None = None + exclude_tags: list[str] | None = None + + +def matches_tag_filter(item_tags: list[str], tag_filter: TagFilter) -> bool: + """Check whether a set of item_tags matches the given filter. + + Returns True if: + - All include_tags (if any) are present in item_tags + - No exclude_tags (if any) are present in item_tags + """ + if tag_filter.include_tags and not all( + t in item_tags for t in tag_filter.include_tags + ): + return False + if tag_filter.exclude_tags and any( + t in item_tags for t in tag_filter.exclude_tags + ): + return False + return True diff --git a/packages/dataset_config_python/src/dataset_config_python/models/task.py b/packages/dataset_config_python/src/dataset_config_python/models/task.py index cafbbe3..bfa0c4d 100644 --- a/packages/dataset_config_python/src/dataset_config_python/models/task.py +++ b/packages/dataset_config_python/src/dataset_config_python/models/task.py @@ -19,12 +19,21 @@ class Task(BaseModel): name: str = "" """Task name (e.g. ``"dart_qa:baseline"``).""" - task_func: str | None = None + func: str | None = None """Task function identifier for hydration (e.g. ``"question_answer"``).""" + system_message: str | None = None + """System message override for this task.""" + + sandbox_parameters: dict[str, Any] | None = None + """Pass-through dict for sandbox plugin configuration.""" + dataset: Dataset | None = None """Inline dataset with samples.""" + files: dict[str, str] | None = None + """Files to copy into sandbox (inherited by all samples).""" + setup: Any | None = None """Setup step (always run even when the main solver is replaced).""" diff --git a/packages/dataset_config_python/src/dataset_config_python/models/variant.py b/packages/dataset_config_python/src/dataset_config_python/models/variant.py index 690e675..81eb40c 100644 --- a/packages/dataset_config_python/src/dataset_config_python/models/variant.py +++ b/packages/dataset_config_python/src/dataset_config_python/models/variant.py @@ -2,9 +2,12 @@ from __future__ import annotations +from typing import Any + from pydantic import BaseModel, Field from dataset_config_python.models.context_file import ContextFile +from dataset_config_python.models.mcp_server_config import McpServerConfig class Variant(BaseModel): @@ -14,23 +17,23 @@ class Variant(BaseModel): performance with and without specific tooling or context. Features are implied by field presence: - - context_files populated → context injection enabled + - files populated → context injection enabled - mcp_servers populated → MCP tools enabled - - skill_paths populated → agent skills enabled + - skills populated → agent skills enabled - all empty → baseline variant """ name: str = "baseline" """User-defined variant name.""" - context_files: list[ContextFile] = Field(default_factory=list) + files: list[ContextFile] = Field(default_factory=list) """Loaded context files (paths resolved by config resolver).""" - mcp_servers: list[str] = Field(default_factory=list) - """MCP server keys to enable (e.g. ``['dart']``).""" + mcp_servers: list[McpServerConfig] = Field(default_factory=list) + """MCP server configurations (declarative or Python import refs).""" - skill_paths: list[str] = Field(default_factory=list) + skills: list[str] = Field(default_factory=list) """Resolved paths to agent skill directories.""" - flutter_channel: str | None = None - """Flutter SDK channel to use (e.g. 'stable', 'beta', 'main').""" + task_parameters: dict[str, Any] = Field(default_factory=dict) + """Optional parameters merged into the task config dict at runtime.""" diff --git a/packages/dataset_config_python/src/dataset_config_python/parser.py b/packages/dataset_config_python/src/dataset_config_python/parser.py index 0e9fc12..218b840 100644 --- a/packages/dataset_config_python/src/dataset_config_python/parser.py +++ b/packages/dataset_config_python/src/dataset_config_python/parser.py @@ -29,12 +29,11 @@ def __init__( self, *, id: str, - task_func: str, + func: str, samples: list[Sample], variant: Variant | None = None, sandbox_type: str = "local", system_message: str | None = None, - allowed_variants: list[str] | None = None, save_examples: bool = False, examples_dir: str | None = None, # Task-level settings @@ -55,14 +54,21 @@ def __init__( display_name: str | None = None, version: Any | None = None, metadata: dict[str, Any] | None = None, + sandbox_parameters: dict[str, Any] | None = None, + # Task-level files and setup + task_files: dict[str, str] | None = None, + task_setup: str | None = None, + # Dataset format metadata + dataset_format: str = "memory", + dataset_source: str | None = None, + dataset_args: dict[str, Any] | None = None, ): self.id = id - self.task_func = task_func + self.func = func self.samples = samples self.variant = variant or Variant() self.sandbox_type = sandbox_type self.system_message = system_message - self.allowed_variants = allowed_variants self.save_examples = save_examples self.examples_dir = examples_dir self.model = model @@ -82,6 +88,12 @@ def __init__( self.display_name = display_name self.version = version self.metadata = metadata + self.sandbox_parameters = sandbox_parameters + self.task_files = task_files + self.task_setup = task_setup + self.dataset_format = dataset_format + self.dataset_source = dataset_source + self.dataset_args = dataset_args _UNSET: Any = object() @@ -89,14 +101,16 @@ def copy_with( self, *, id: str | None = _UNSET, - task_func: str | None = _UNSET, + func: str | None = _UNSET, samples: list[Sample] | None = _UNSET, variant: Variant | None = _UNSET, sandbox_type: str | None = _UNSET, system_message: str | None = _UNSET, - allowed_variants: list[str] | None = _UNSET, save_examples: bool | None = _UNSET, examples_dir: str | None = _UNSET, + sandbox_parameters: dict[str, Any] | None = _UNSET, + task_files: dict[str, str] | None = _UNSET, + task_setup: str | None = _UNSET, model: str | None = _UNSET, config: dict[str, Any] | None = _UNSET, model_roles: dict[str, str] | None = _UNSET, @@ -119,14 +133,16 @@ def copy_with( _U = ParsedTask._UNSET return ParsedTask( id=self.id if id is _U else id, # type: ignore[arg-type] - task_func=self.task_func if task_func is _U else task_func, # type: ignore[arg-type] + func=self.func if func is _U else func, # type: ignore[arg-type] samples=self.samples if samples is _U else samples, # type: ignore[arg-type] variant=self.variant if variant is _U else variant, sandbox_type=self.sandbox_type if sandbox_type is _U else sandbox_type, # type: ignore[arg-type] system_message=self.system_message if system_message is _U else system_message, - allowed_variants=self.allowed_variants if allowed_variants is _U else allowed_variants, save_examples=self.save_examples if save_examples is _U else save_examples, # type: ignore[arg-type] examples_dir=self.examples_dir if examples_dir is _U else examples_dir, + sandbox_parameters=self.sandbox_parameters if sandbox_parameters is _U else sandbox_parameters, + task_files=self.task_files if task_files is _U else task_files, + task_setup=self.task_setup if task_setup is _U else task_setup, model=self.model if model is _U else model, config=self.config if config is _U else config, model_roles=self.model_roles if model_roles is _U else model_roles, @@ -144,6 +160,9 @@ def copy_with( display_name=self.display_name if display_name is _U else display_name, version=self.version if version is _U else version, metadata=self.metadata if metadata is _U else metadata, + dataset_format=self.dataset_format, + dataset_source=self.dataset_source, + dataset_args=self.dataset_args, ) @@ -180,32 +199,8 @@ def _resolve_log_dir(logs_dir: str, base_dir: str) -> str: return os.path.normpath(os.path.join(base_dir, logs_dir, timestamp)) -def _pre_resolve_to_abs(resource: Any, task_dir: str) -> Any: - """Pre-resolve a task-level resource to an absolute path.""" - if resource is None: - return None - if isinstance(resource, str): - if resource.startswith("./") or resource.startswith("../") or resource.startswith("/"): - return {"path": os.path.normpath(os.path.join(task_dir, resource))} - return resource - if isinstance(resource, dict): - if "path" in resource: - return {**resource, "path": os.path.normpath(os.path.join(task_dir, resource["path"]))} - return resource - return resource - - -def _resolve_resource_path(resource: Any, base_dir: str) -> str | None: - """Resolve a workspace/tests resource reference to an absolute path.""" - if resource is None: - return None - if isinstance(resource, str): - if resource.startswith("./") or resource.startswith("../") or resource.startswith("/"): - return os.path.normpath(os.path.join(base_dir, resource)) - return None - if isinstance(resource, dict) and "path" in resource: - return os.path.normpath(os.path.join(base_dir, resource["path"])) - return None + + # --------------------------------------------------------------------------- @@ -237,51 +232,101 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]: task_dir = os.path.dirname(task_path) task_id = data.get("id") or os.path.basename(task_dir) - task_func = data.get("func") or task_id + func_name = data.get("func") or task_id - task_workspace_raw = data.get("workspace") - task_tests_raw = data.get("tests") system_message = data.get("system_message") - task_workspace = _pre_resolve_to_abs(task_workspace_raw, task_dir) - task_tests = _pre_resolve_to_abs(task_tests_raw, task_dir) + # Parse task-level files and setup + task_files = data.get("files") + if isinstance(task_files, dict): + task_files = {str(k): str(v) for k, v in task_files.items()} + else: + task_files = None + task_setup = data.get("setup") + if isinstance(task_setup, str): + pass # already a string + else: + task_setup = None - allowed_variants = data.get("allowed_variants") + # Parse dataset section (replaces the old top-level 'samples' key) + dataset_raw = data.get("dataset") + samples: list[Sample] = [] + dataset_format = "memory" + dataset_source: str | None = None + dataset_args: dict[str, Any] | None = None - # Parse samples section - samples_raw = data.get("samples") - if not isinstance(samples_raw, dict): - raise ValueError( - f"Task '{task_id}': 'samples' must be a dict with 'inline' and/or " - f"'paths' keys, got {type(samples_raw).__name__}" - ) - samples = _load_samples_section(samples_raw, dataset_root, task_workspace, task_tests, task_dir) + if dataset_raw is not None: + if not isinstance(dataset_raw, dict): + raise ValueError( + f"Task '{task_id}': 'dataset' must be a dict with one of " + f"'samples', 'json', or 'csv' keys, got {type(dataset_raw).__name__}" + ) + + # Check for mutually exclusive format keys + format_keys = {'samples', 'json', 'csv'} + present_keys = format_keys & set(dataset_raw.keys()) + if len(present_keys) > 1: + raise ValueError( + f"Task '{task_id}': 'dataset' must have exactly one of " + f"'samples', 'json', or 'csv', found: {present_keys}" + ) + + dataset_args = dataset_raw.get("args") + if dataset_args is not None and not isinstance(dataset_args, dict): + raise ValueError( + f"Task '{task_id}': 'dataset.args' must be a dict, " + f"got {type(dataset_args).__name__}" + ) + + if "samples" in dataset_raw: + # Inline/path-based samples (existing MemoryDataset behavior) + samples_section = dataset_raw["samples"] + if not isinstance(samples_section, dict): + raise ValueError( + f"Task '{task_id}': 'dataset.samples' must be a dict with " + f"'inline' and/or 'paths' keys, got {type(samples_section).__name__}" + ) + samples = _load_samples_section(samples_section, dataset_root, task_files, task_dir) + elif "json" in dataset_raw: + dataset_format = "json" + dataset_source = str(dataset_raw["json"]) + elif "csv" in dataset_raw: + dataset_format = "csv" + dataset_source = str(dataset_raw["csv"]) + + # Task-level Inspect AI args are nested under inspect_task_args + task_args = data.get("inspect_task_args") or {} return [ ParsedTask( id=task_id, - task_func=task_func, + func=func_name, variant=Variant(), samples=samples, system_message=system_message, - allowed_variants=allowed_variants, - model=data.get("model"), - config=data.get("config") if isinstance(data.get("config"), dict) else None, - model_roles=data.get("model_roles") if isinstance(data.get("model_roles"), dict) else None, - sandbox=data.get("sandbox"), - approval=data.get("approval"), - epochs=data.get("epochs"), - fail_on_error=data.get("fail_on_error"), - continue_on_fail=data.get("continue_on_fail"), - message_limit=data.get("message_limit"), - token_limit=data.get("token_limit"), - time_limit=data.get("time_limit"), - working_limit=data.get("working_limit"), - cost_limit=float(data["cost_limit"]) if data.get("cost_limit") is not None else None, - early_stopping=data.get("early_stopping"), + model=task_args.get("model"), + config=task_args.get("config") if isinstance(task_args.get("config"), dict) else None, + model_roles=task_args.get("model_roles") if isinstance(task_args.get("model_roles"), dict) else None, + sandbox=task_args.get("sandbox"), + approval=task_args.get("approval"), + epochs=task_args.get("epochs"), + fail_on_error=task_args.get("fail_on_error"), + continue_on_fail=task_args.get("continue_on_fail"), + message_limit=task_args.get("message_limit"), + token_limit=task_args.get("token_limit"), + time_limit=task_args.get("time_limit"), + working_limit=task_args.get("working_limit"), + cost_limit=float(task_args["cost_limit"]) if task_args.get("cost_limit") is not None else None, + early_stopping=task_args.get("early_stopping"), display_name=data.get("display_name"), version=data.get("version"), metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None, + sandbox_parameters=data.get("sandbox_parameters") if isinstance(data.get("sandbox_parameters"), dict) else None, + task_files=task_files, + task_setup=task_setup, + dataset_format=dataset_format, + dataset_source=dataset_source, + dataset_args=dataset_args, ) ] @@ -294,8 +339,7 @@ def _load_task_file(task_path: str, dataset_root: str) -> list[ParsedTask]: def _load_samples_section( samples_map: dict[str, Any], dataset_root: str, - task_workspace: Any, - task_tests: Any, + task_files: dict[str, str] | None, task_dir: str, ) -> list[Sample]: """Load samples from 'paths' and 'inline' subsections.""" @@ -314,12 +358,12 @@ def _load_samples_section( if not matched: raise FileNotFoundError(f"No sample files matched pattern: {pattern}") - samples.extend(_load_samples_from_files(matched, dataset_root, task_workspace, task_tests)) + samples.extend(_load_samples_from_files(matched, dataset_root, task_files)) for defn in inline_defs: if not defn: continue - samples.append(_resolve_sample(defn, task_dir, dataset_root, task_workspace, task_tests)) + samples.append(_resolve_sample(defn, task_dir, dataset_root, task_files)) return samples @@ -327,8 +371,7 @@ def _load_samples_section( def _load_samples_from_files( sample_files: list[str], dataset_root: str, - task_workspace: Any, - task_tests: Any, + task_files: dict[str, str] | None, ) -> list[Sample]: """Load samples from external YAML files.""" samples: list[Sample] = [] @@ -350,7 +393,7 @@ def _load_samples_from_files( data = yaml.safe_load(doc) if isinstance(data, dict): samples.append( - _resolve_sample(data, sample_dir, dataset_root, task_workspace, task_tests) + _resolve_sample(data, sample_dir, dataset_root, task_files) ) return samples @@ -360,8 +403,7 @@ def _resolve_sample( doc: dict[str, Any], base_dir: str, dataset_root: str, - task_workspace: Any, - task_tests: Any, + task_files: dict[str, str] | None, ) -> Sample: """Resolve a single sample dict into a Sample.""" for field in ("id", "input", "target"): @@ -370,31 +412,11 @@ def _resolve_sample( f"Sample '{doc.get('id', 'unknown')}' missing required field: {field}" ) - sample_workspace = doc.get("workspace") - sample_tests = doc.get("tests") - - effective_workspace = sample_workspace if sample_workspace is not None else task_workspace - - workspace = None - workspace_git = None - workspace_git_ref = None - - if effective_workspace is not None: - if isinstance(effective_workspace, dict) and "git" in effective_workspace: - workspace_git = effective_workspace.get("git") - workspace_git_ref = effective_workspace.get("ref") - else: - resolve_dir = base_dir if sample_workspace is not None else dataset_root - workspace = _resolve_resource_path(effective_workspace, resolve_dir) - - tests = None - if sample_tests is not None: - tests = _resolve_resource_path(sample_tests, base_dir) - elif task_tests is not None: - tests = _resolve_resource_path(task_tests, dataset_root) + # Read metadata fields from the metadata dict + meta_raw: dict[str, Any] = doc.get("metadata") or {} - # Normalize tags - raw_tags = doc.get("tags") + # Normalize tags from metadata + raw_tags = meta_raw.get("tags") if isinstance(raw_tags, str): tags = [t.strip() for t in raw_tags.split(",")] elif isinstance(raw_tags, list): @@ -403,17 +425,21 @@ def _resolve_sample( tags = [] # Build metadata - meta: dict[str, Any] = {**(doc.get("metadata") or {})} - meta["difficulty"] = doc.get("difficulty", "medium") + meta: dict[str, Any] = {**meta_raw} + meta["difficulty"] = meta_raw.get("difficulty", "medium") meta["tags"] = tags - if workspace is not None: - meta["workspace"] = workspace - if tests is not None: - meta["tests"] = tests - if workspace_git is not None: - meta["workspace_git"] = workspace_git - if workspace_git_ref is not None: - meta["workspace_git_ref"] = workspace_git_ref + + # Parse sample-level files + sample_files = doc.get("files") + if isinstance(sample_files, dict): + sample_files = {str(k): str(v) for k, v in sample_files.items()} + else: + sample_files = None + + # Stack files: task-level + sample-level (sample wins on conflict) + merged_files: dict[str, str] | None = None + if task_files is not None or sample_files is not None: + merged_files = {**(task_files or {}), **(sample_files or {})} return Sample( id=doc["id"], @@ -422,7 +448,7 @@ def _resolve_sample( metadata=meta, choices=doc.get("choices"), sandbox=doc.get("sandbox"), - files=doc.get("files"), + files=merged_files, setup=doc.get("setup"), ) @@ -442,7 +468,14 @@ def parse_job(job_path: str, dataset_root: str) -> Job: logs_dir = data.get("logs_dir") or _DEFAULT_LOGS_DIR log_dir = _resolve_log_dir(logs_dir, dataset_root) - sandbox_type = data.get("sandbox_type") or "local" + # Parse sandbox config + sandbox_raw = data.get("sandbox") + sandbox = None + if isinstance(sandbox_raw, dict): + sandbox = sandbox_raw + elif isinstance(sandbox_raw, str): + sandbox = {"environment": sandbox_raw} + max_connections = data.get("max_connections") or 10 # Parse task filters @@ -457,7 +490,7 @@ def parse_job(job_path: str, dataset_root: str) -> Job: for tid, tdata in inline_tasks.items(): tasks[tid] = JobTask.from_yaml(tid, tdata) - # Parse variants + # Parse variants — supports inline dict or list of file paths variants = None variants_raw = data.get("variants") if isinstance(variants_raw, dict): @@ -467,81 +500,51 @@ def parse_job(job_path: str, dataset_root: str) -> Job: variants[str(key)] = dict(value) else: variants[str(key)] = {} + elif isinstance(variants_raw, list): + # List of relative paths to variant definition files + job_dir = os.path.dirname(job_path) + variants = {} + for rel_path in variants_raw: + variant_file = os.path.normpath(os.path.join(job_dir, str(rel_path))) + if not os.path.isfile(variant_file): + raise FileNotFoundError( + f"Variant file not found: {variant_file} " + f"(referenced from {job_path})" + ) + file_data = _read_yaml_file(variant_file) + for vname, vdef in file_data.items(): + variants[str(vname)] = dict(vdef) if isinstance(vdef, dict) else {} + + # Parse inspect_eval_arguments + inspect_eval_arguments = data.get("inspect_eval_arguments") + if isinstance(inspect_eval_arguments, dict): + inspect_eval_arguments = dict(inspect_eval_arguments) + else: + inspect_eval_arguments = None + + # Parse models (required) + models_raw = data.get("models") + if not models_raw or not isinstance(models_raw, list) or len(models_raw) == 0: + raise ValueError( + f"Job file '{job_path}' is missing required 'models' field. " + "Specify at least one model, e.g.:\n" + " models:\n - google/gemini-2.5-flash" + ) + models: list[str] = [str(m) for m in models_raw] return Job( log_dir=log_dir, - sandbox_type=sandbox_type, max_connections=max_connections, - models=data.get("models"), + models=models, variants=variants, task_paths=task_paths, tasks=tasks, save_examples=data.get("save_examples") is True, - retry_attempts=data.get("retry_attempts"), - max_retries=data.get("max_retries"), - retry_wait=float(data["retry_wait"]) if data.get("retry_wait") is not None else None, - retry_connections=( - float(data["retry_connections"]) if data.get("retry_connections") is not None else None - ), - retry_cleanup=data.get("retry_cleanup"), - fail_on_error=( - float(data["fail_on_error"]) if data.get("fail_on_error") is not None else None - ), - continue_on_fail=data.get("continue_on_fail"), - retry_on_error=data.get("retry_on_error"), - debug_errors=data.get("debug_errors"), - max_samples=data.get("max_samples"), - max_tasks=data.get("max_tasks"), - max_subprocesses=data.get("max_subprocesses"), - max_sandboxes=data.get("max_sandboxes"), - log_level=data.get("log_level"), - log_level_transcript=data.get("log_level_transcript"), - log_format=data.get("log_format"), - tags=data.get("tags"), - metadata=data.get("metadata") if isinstance(data.get("metadata"), dict) else None, - trace=data.get("trace"), - display=data.get("display"), - score=data.get("score"), - limit=data.get("limit"), - sample_id=data.get("sample_id"), - sample_shuffle=data.get("sample_shuffle"), - epochs=data.get("epochs"), - approval=data.get("approval"), - solver=data.get("solver"), - sandbox_cleanup=data.get("sandbox_cleanup"), - model_base_url=data.get("model_base_url"), - model_args=data.get("model_args") if isinstance(data.get("model_args"), dict) else None, - model_roles=( - data.get("model_roles") if isinstance(data.get("model_roles"), dict) else None - ), - task_args=data.get("task_args") if isinstance(data.get("task_args"), dict) else None, - message_limit=data.get("message_limit"), - token_limit=data.get("token_limit"), - time_limit=data.get("time_limit"), - working_limit=data.get("working_limit"), - cost_limit=float(data["cost_limit"]) if data.get("cost_limit") is not None else None, - model_cost_config=( - data.get("model_cost_config") - if isinstance(data.get("model_cost_config"), dict) - else None - ), - log_samples=data.get("log_samples"), - log_realtime=data.get("log_realtime"), - log_images=data.get("log_images"), - log_buffer=data.get("log_buffer"), - log_shared=data.get("log_shared"), - bundle_dir=data.get("bundle_dir"), - bundle_overwrite=data.get("bundle_overwrite"), - log_dir_allow_dirty=data.get("log_dir_allow_dirty"), - eval_set_id=data.get("eval_set_id"), - eval_set_overrides=( - data.get("eval_set_overrides") - if isinstance(data.get("eval_set_overrides"), dict) - else None - ), - task_defaults=( - data.get("task_defaults") if isinstance(data.get("task_defaults"), dict) else None - ), + description=data.get("description"), + sandbox=sandbox, + inspect_eval_arguments=inspect_eval_arguments, + task_filters=data.get("task_filters"), + sample_filters=data.get("sample_filters"), ) diff --git a/packages/dataset_config_python/src/dataset_config_python/resolver.py b/packages/dataset_config_python/src/dataset_config_python/resolver.py index 0801b3c..21469c3 100644 --- a/packages/dataset_config_python/src/dataset_config_python/resolver.py +++ b/packages/dataset_config_python/src/dataset_config_python/resolver.py @@ -4,43 +4,35 @@ import glob as globmod import os +from dataclasses import dataclass, field from typing import Any from dataset_config_python.models.context_file import ContextFile from dataset_config_python.models.dataset import Dataset from dataset_config_python.models.eval_set import EvalSet +from dataset_config_python.models.job import Job +from dataset_config_python.models.mcp_server_config import McpServerConfig from dataset_config_python.models.sample import Sample +from dataset_config_python.models.tag_filter import matches_tag_filter from dataset_config_python.models.task import Task from dataset_config_python.models.variant import Variant from dataset_config_python.parser import ParsedTask, find_job_file, parse_job, parse_tasks -# Default models when a job doesn't specify its own. -DEFAULT_MODELS: list[str] = [ - "anthropic/claude-haiku-4-5", - "anthropic/claude-sonnet-4-5", - "anthropic/claude-opus-4-6", - "google/gemini-2.5-flash", - "google/gemini-3-pro-preview", - "google/gemini-3-flash-preview", - "openai/gpt-5-mini", - "openai/gpt-5-nano", - "openai/gpt-5", - "openai/gpt-5-pro", -] - -# Available sandbox configurations. -SANDBOX_REGISTRY: dict[str, dict[str, str]] = { + +# Default sandbox configurations for Flutter evaluations. +# Consumers can pass these to resolve() or provide their own. +DEFAULT_SANDBOX_REGISTRY: dict[str, dict[str, str]] = { "podman": {"name": "podman", "path": "./sandboxes/podman/compose.yaml"}, "podman-beta": {"name": "podman", "path": "./sandboxes/podman/compose-beta.yaml"}, "podman-main": {"name": "podman", "path": "./sandboxes/podman/compose-main.yaml"}, } -# Maps Flutter SDK channel names to sandbox registry keys. -SDK_CHANNELS: dict[str, str] = { - "stable": "podman", - "beta": "podman-beta", - "main": "podman-main", -} + +@dataclass +class SandboxConfig: + """Sandbox registry for named sandbox definitions.""" + + registry: dict[str, dict[str, str]] = field(default_factory=dict) def _is_glob(pattern: str) -> bool: @@ -50,14 +42,19 @@ def _is_glob(pattern: str) -> bool: def resolve( dataset_path: str, job_names: list[str], + *, + sandbox_config: SandboxConfig | None = None, ) -> list[EvalSet]: """Resolve dataset + job(s) into EvalSet objects. - This is the main public API of the package. + This is a convenience wrapper around :func:`resolve_from_parsed` that + handles parsing automatically. Use ``resolve_from_parsed`` when you + need to inspect or mutate the parsed config before resolution. Args: dataset_path: Root directory containing ``tasks/`` and ``jobs/``. job_names: Job names (looked up in ``jobs/``) or paths. + sandbox_config: Sandbox registry and branch-channel mapping. Returns: A list of EvalSet objects ready for JSON serialization. @@ -68,37 +65,79 @@ def resolve( for job_name in job_names: job_path = find_job_file(dataset_path, job_name) job = parse_job(job_path, dataset_path) - results.extend(_resolve_job(task_configs, job, dataset_path)) + results.extend( + resolve_from_parsed( + task_configs=task_configs, + job=job, + dataset_path=dataset_path, + sandbox_config=sandbox_config, + ) + ) return results +def resolve_from_parsed( + task_configs: list[ParsedTask], + job: Job, + dataset_path: str, + *, + sandbox_config: SandboxConfig | None = None, +) -> list[EvalSet]: + """Resolve pre-parsed task configs and a job into EvalSet objects. + + Use this instead of :func:`resolve` when you need to inspect or modify + the parsed configuration before resolution. A typical workflow:: + + tasks = parse_tasks(dataset_path) + job = parse_job(find_job_file(dataset_path, "my_job"), dataset_path) + + # Patch values before resolution + job.log_dir = f"{job.log_dir}/{execution_id}" + + eval_sets = resolve_from_parsed(tasks, job, dataset_path) + + Args: + task_configs: Parsed task configs (from :func:`parse_tasks`). + job: A parsed Job object (from :func:`parse_job`). + dataset_path: Root directory of the dataset (used for path resolution). + sandbox_config: Sandbox registry and branch-channel mapping. + + Returns: + A list of EvalSet objects ready for JSON serialization. + """ + sandbox_cfg = sandbox_config or SandboxConfig() + registry = sandbox_cfg.registry + return _resolve_job(task_configs, job, dataset_path, registry) + + def _resolve_job( dataset_tasks: list[ParsedTask], job: Any, dataset_root: str, + sandbox_registry: dict[str, dict[str, str]], ) -> list[EvalSet]: """Resolve task configs and job into EvalSet objects.""" - models = job.models if job.models else list(DEFAULT_MODELS) - sandbox_type_str = job.sandbox_type + if not job.models: + raise ValueError( + "job.models is required and must contain at least one model. " + "Specify models in your job YAML, e.g.:\n" + " models:\n - google/gemini-2.5-flash" + ) + models = job.models + sandbox_cfg = job.sandbox or {} + sandbox_type_str = sandbox_cfg.get("environment", "local") expanded_tasks = _expand_task_configs(dataset_tasks, job, sandbox_type_str, dataset_root) - # Group by flutter channel - groups: dict[str | None, list[ParsedTask]] = {} - for tc in expanded_tasks: - key = tc.variant.flutter_channel - groups.setdefault(key, []).append(tc) - return [ _build_eval_set( - task_configs=group, + task_configs=expanded_tasks, log_dir=job.log_dir, models=models, - sandbox=_resolve_sandbox(dataset_root, job, flutter_channel=channel), + sandbox=_resolve_sandbox(dataset_root, job, sandbox_registry), job=job, ) - for channel, group in groups.items() ] @@ -117,8 +156,10 @@ def _build_eval_set( ) -> EvalSet: """Build an EvalSet from resolved ParsedTasks.""" inspect_tasks: list[Task] = [] - is_container = job.sandbox_type and job.sandbox_type != "local" - task_defaults = job.task_defaults or {} + sandbox_cfg = job.sandbox or {} + sandbox_type_str = sandbox_cfg.get("environment", "local") + eval_args = job.inspect_eval_arguments or {} + task_defaults = eval_args.get("task_defaults") or {} for tc in task_configs: # Enrich each sample with task-level metadata @@ -132,21 +173,13 @@ def _build_eval_set( enriched["examples_dir"] = tc.examples_dir enriched["task_variant"] = f"{tc.id}:{tc.variant.name}" - # Build files + setup for sandbox provisioning - files = dict(sample.files) if sample.files else None - setup = sample.setup - workspace = (sample.metadata or {}).get("workspace") - workspace_git = (sample.metadata or {}).get("workspace_git") - workspace_git_ref = (sample.metadata or {}).get("workspace_git_ref") - - if workspace is not None and is_container: - files = {**(files or {}), "/workspace": workspace} - setup = setup or "cd /workspace && flutter pub get" - enriched["workspace"] = "/workspace" - if workspace_git is not None: - enriched["workspace_git"] = workspace_git - if workspace_git_ref is not None: - enriched["workspace_git_ref"] = workspace_git_ref + # Stack files: task-level + sample-level (sample wins on conflict) + files: dict[str, str] | None = None + if tc.task_files is not None or sample.files is not None: + files = {**(tc.task_files or {}), **(sample.files or {})} + + # Setup: sample overrides task + setup = sample.setup or tc.task_setup inspect_samples.append( Sample( @@ -164,34 +197,42 @@ def _build_eval_set( dataset = Dataset( samples=inspect_samples, name=f"{tc.id}:{tc.variant.name}", + format=tc.dataset_format, + source=tc.dataset_source, + args=tc.dataset_args, ) # Task metadata (variant config, system message, etc.) task_metadata: dict[str, Any] = {"variant": tc.variant.name} - if tc.variant.context_files: - task_metadata["variant_config"] = { - "context_files": [ - { - "title": cf.metadata.title, - "version": cf.metadata.version, - "content": cf.content, - } - for cf in tc.variant.context_files - ], - "mcp_servers": tc.variant.mcp_servers, - "skill_paths": tc.variant.skill_paths, - } - elif tc.variant.mcp_servers or tc.variant.skill_paths: - task_metadata["variant_config"] = { - "mcp_servers": tc.variant.mcp_servers, - "skill_paths": tc.variant.skill_paths, - } + variant_config: dict[str, Any] = {} + if tc.variant.files: + variant_config["files"] = [ + { + "title": cf.metadata.title, + "version": cf.metadata.version, + "content": cf.content, + } + for cf in tc.variant.files + ] + if tc.variant.mcp_servers: + variant_config["mcp_servers"] = [ + s.model_dump(exclude_none=True) for s in tc.variant.mcp_servers + ] + if tc.variant.skills: + variant_config["skills"] = tc.variant.skills + if tc.variant.task_parameters: + variant_config["task_parameters"] = tc.variant.task_parameters + if variant_config: + task_metadata["variant_config"] = variant_config if tc.system_message is not None: task_metadata["system_message"] = tc.system_message if tc.save_examples: task_metadata["save_examples"] = True if tc.examples_dir is not None: task_metadata["examples_dir"] = tc.examples_dir + # Propagate image_prefix from job for container image resolution + if (job.sandbox or {}).get("image_prefix"): + task_metadata["image_prefix"] = job.sandbox["image_prefix"] if tc.metadata: task_metadata.update(tc.metadata) @@ -199,7 +240,7 @@ def _build_eval_set( task_sandbox = None if tc.sandbox is not None: task_sandbox = tc.sandbox - elif tc.sandbox_type and tc.sandbox_type != "local": + elif sandbox_type_str != "local": task_sandbox = _serialize_sandbox(sandbox) # Resolve task-level settings with precedence: @@ -207,38 +248,58 @@ def _build_eval_set( resolved_time_limit = ( tc.time_limit or task_defaults.get("time_limit") - or (300 if job.sandbox_type != "local" else None) + or (300 if sandbox_type_str != "local" else None) ) inspect_tasks.append( Task( name=f"{tc.id}:{tc.variant.name}", - task_func=tc.task_func, + func=tc.func, dataset=dataset, sandbox=task_sandbox, metadata=task_metadata, + system_message=tc.system_message, + sandbox_parameters=tc.sandbox_parameters, model=tc.model or task_defaults.get("model"), config=tc.config or task_defaults.get("config"), model_roles=tc.model_roles or task_defaults.get("model_roles"), approval=tc.approval or task_defaults.get("approval"), epochs=tc.epochs or task_defaults.get("epochs"), fail_on_error=tc.fail_on_error or task_defaults.get("fail_on_error"), - continue_on_fail=tc.continue_on_fail if tc.continue_on_fail is not None else task_defaults.get("continue_on_fail"), + continue_on_fail=tc.continue_on_fail + if tc.continue_on_fail is not None + else task_defaults.get("continue_on_fail"), message_limit=tc.message_limit or task_defaults.get("message_limit"), token_limit=tc.token_limit or task_defaults.get("token_limit"), time_limit=resolved_time_limit, working_limit=tc.working_limit or task_defaults.get("working_limit"), - cost_limit=tc.cost_limit if tc.cost_limit is not None else ( - float(task_defaults["cost_limit"]) if task_defaults.get("cost_limit") is not None else None + cost_limit=tc.cost_limit + if tc.cost_limit is not None + else ( + float(task_defaults["cost_limit"]) + if task_defaults.get("cost_limit") is not None + else None ), early_stopping=tc.early_stopping or task_defaults.get("early_stopping"), display_name=tc.display_name or task_defaults.get("display_name"), - version=tc.version if tc.version is not None else (task_defaults.get("version") or 0), + version=tc.version + if tc.version is not None + else (task_defaults.get("version") or 0), ) ) - # Build EvalSet with all job-level parameters - overrides = job.eval_set_overrides or {} + # Build EvalSet with all job-level parameters from inspect_eval_arguments + eval_set_overrides = eval_args.get("eval_set_overrides") or {} + + # Helper to get a value from eval_args then overrides + def _get(key: str, default: Any = None) -> Any: + v = eval_args.get(key) + if v is not None: + return v + v = eval_set_overrides.get(key) + if v is not None: + return v + return default return EvalSet( tasks=inspect_tasks, @@ -246,75 +307,64 @@ def _build_eval_set( model=models, sandbox=_serialize_sandbox(sandbox), # Retry - retry_attempts=job.retry_attempts or overrides.get("retry_attempts") or 10, - retry_wait=job.retry_wait or overrides.get("retry_wait") or 60.0, - retry_connections=job.retry_connections or overrides.get("retry_connections") or 0.5, - retry_cleanup=job.retry_cleanup if job.retry_cleanup is not None else overrides.get("retry_cleanup"), - retry_on_error=job.retry_on_error or job.max_retries or overrides.get("retry_on_error"), + retry_attempts=_get("retry_attempts", 10), + retry_wait=float(_get("retry_wait", 60.0)), + retry_connections=float(_get("retry_connections", 0.5)), + retry_cleanup=_get("retry_cleanup"), + retry_on_error=_get("retry_on_error") or _get("max_retries"), # Error handling - fail_on_error=job.fail_on_error if job.fail_on_error is not None else (overrides.get("fail_on_error") or 0.05), - continue_on_fail=job.continue_on_fail if job.continue_on_fail is not None else overrides.get("continue_on_fail"), - debug_errors=job.debug_errors if job.debug_errors is not None else overrides.get("debug_errors"), + fail_on_error=float(_get("fail_on_error", 0.05)), + continue_on_fail=_get("continue_on_fail"), + debug_errors=_get("debug_errors"), # Concurrency - max_samples=job.max_samples or overrides.get("max_samples"), - max_tasks=job.max_tasks or overrides.get("max_tasks"), - max_subprocesses=job.max_subprocesses or overrides.get("max_subprocesses"), - max_sandboxes=job.max_sandboxes or overrides.get("max_sandboxes"), + max_samples=_get("max_samples"), + max_tasks=_get("max_tasks"), + max_subprocesses=_get("max_subprocesses"), + max_sandboxes=_get("max_sandboxes"), # Logging - log_level=job.log_level or overrides.get("log_level") or "info", - log_level_transcript=job.log_level_transcript or overrides.get("log_level_transcript"), - log_format=job.log_format or overrides.get("log_format") or "json", - log_samples=job.log_samples if job.log_samples is not None else overrides.get("log_samples"), - log_realtime=job.log_realtime if job.log_realtime is not None else overrides.get("log_realtime"), - log_images=job.log_images if job.log_images is not None else overrides.get("log_images"), - log_buffer=job.log_buffer or overrides.get("log_buffer"), - log_shared=job.log_shared or overrides.get("log_shared"), - log_dir_allow_dirty=job.log_dir_allow_dirty if job.log_dir_allow_dirty is not None else overrides.get("log_dir_allow_dirty"), + log_level=_get("log_level", "info"), + log_level_transcript=_get("log_level_transcript"), + log_format=_get("log_format", "json"), + log_samples=_get("log_samples"), + log_realtime=_get("log_realtime"), + log_images=_get("log_images"), + log_buffer=_get("log_buffer"), + log_shared=_get("log_shared"), + log_dir_allow_dirty=_get("log_dir_allow_dirty"), # Model config - model_base_url=job.model_base_url or overrides.get("model_base_url"), - model_args=job.model_args or overrides.get("model_args") or {}, - model_roles=job.model_roles or overrides.get("model_roles"), - task_args=job.task_args or overrides.get("task_args") or {}, - model_cost_config=job.model_cost_config or overrides.get("model_cost_config"), + model_base_url=_get("model_base_url"), + model_args=_get("model_args", {}), + model_roles=_get("model_roles"), + task_args=_get("task_args", {}), + model_cost_config=_get("model_cost_config"), # Sandbox - sandbox_cleanup=job.sandbox_cleanup if job.sandbox_cleanup is not None else overrides.get("sandbox_cleanup"), + sandbox_cleanup=_get("sandbox_cleanup"), # Sample control - limit=job.limit or overrides.get("limit"), - sample_id=job.sample_id or overrides.get("sample_id"), - sample_shuffle=job.sample_shuffle or overrides.get("sample_shuffle"), - epochs=job.epochs or overrides.get("epochs"), + limit=_get("limit"), + sample_id=_get("sample_id"), + sample_shuffle=_get("sample_shuffle"), + epochs=_get("epochs"), # Misc - tags=job.tags or overrides.get("tags"), - metadata=job.metadata or overrides.get("metadata"), - trace=job.trace if job.trace is not None else overrides.get("trace"), - display=job.display or overrides.get("display"), - approval=job.approval or overrides.get("approval"), - solver=job.solver or overrides.get("solver"), - score=job.score if job.score is not None else (overrides.get("score") if overrides.get("score") is not None else True), + tags=_get("tags"), + metadata=_get("metadata"), + trace=_get("trace"), + display=_get("display"), + approval=_get("approval"), + solver=_get("solver"), + score=_get("score", True), # Limits - message_limit=job.message_limit or overrides.get("message_limit"), - token_limit=job.token_limit or overrides.get("token_limit"), - time_limit=job.time_limit or overrides.get("time_limit"), - working_limit=job.working_limit or overrides.get("working_limit"), - cost_limit=job.cost_limit if job.cost_limit is not None else ( - float(overrides["cost_limit"]) if overrides.get("cost_limit") is not None else None - ), + message_limit=_get("message_limit"), + token_limit=_get("token_limit"), + time_limit=_get("time_limit"), + working_limit=_get("working_limit"), + cost_limit=float(_get("cost_limit")) if _get("cost_limit") is not None else None, # Bundling - bundle_dir=job.bundle_dir or overrides.get("bundle_dir"), - bundle_overwrite=job.bundle_overwrite if job.bundle_overwrite is not None else (overrides.get("bundle_overwrite") or False), - eval_set_id=job.eval_set_id or overrides.get("eval_set_id"), + bundle_dir=_get("bundle_dir"), + bundle_overwrite=_get("bundle_overwrite", False), + eval_set_id=_get("eval_set_id"), ) -# --------------------------------------------------------------------------- -# Model resolution -# --------------------------------------------------------------------------- - - -def _resolve_models(job: Any) -> list[str]: - if job.models: - return job.models - return list(DEFAULT_MODELS) # --------------------------------------------------------------------------- @@ -325,27 +375,17 @@ def _resolve_models(job: Any) -> list[str]: def _resolve_sandbox( dataset_root: str, job: Any, - *, - flutter_channel: str | None = None, + sandbox_registry: dict[str, dict[str, str]], ) -> Any: """Resolve sandbox spec for a given config.""" - sandbox_type = job.sandbox_type + sandbox_cfg = job.sandbox or {} + sandbox_type = sandbox_cfg.get("environment", "local") if not sandbox_type or sandbox_type == "local": return "local" - # Channel override - if flutter_channel and flutter_channel in SDK_CHANNELS: - registry_key = SDK_CHANNELS[flutter_channel] - if registry_key in SANDBOX_REGISTRY: - defn = SANDBOX_REGISTRY[registry_key] - sandbox_path = defn["path"] - if not os.path.isabs(sandbox_path): - sandbox_path = os.path.normpath(os.path.join(dataset_root, sandbox_path)) - return {"type": defn["name"], "path": sandbox_path} - # Named sandbox from registry - if sandbox_type in SANDBOX_REGISTRY: - defn = SANDBOX_REGISTRY[sandbox_type] + if sandbox_type in sandbox_registry: + defn = sandbox_registry[sandbox_type] sandbox_path = defn["path"] if not os.path.isabs(sandbox_path): sandbox_path = os.path.normpath(os.path.join(dataset_root, sandbox_path)) @@ -372,18 +412,29 @@ def _expand_task_configs( for tc in dataset_tasks: task_id = tc.id - # Filter by job.tasks + # Filter by job.tasks (ID-based) if job.tasks is not None and task_id not in job.tasks: continue - # Determine effective variants (intersection) - effective_variants: dict[str, dict[str, Any]] = {} - for vname, vdef in job_variants.items(): - if tc.allowed_variants is None or vname in tc.allowed_variants: - effective_variants[vname] = vdef + # Filter by job.task_filters (tag-based) + if job.task_filters is not None: + task_tags = (tc.metadata or {}).get("tags", []) + if not matches_tag_filter(task_tags, job.task_filters): + continue - # Get job-level task overrides + # Start with all job-level variants + effective_variants: dict[str, dict[str, Any]] = dict(job_variants) + + # Apply per-task include/exclude variants from job.tasks. job_task = job.tasks.get(task_id) if job.tasks else None + if job_task and job_task.include_variants: + effective_variants = { + k: v for k, v in effective_variants.items() if k in job_task.include_variants + } + if job_task and job_task.exclude_variants: + effective_variants = { + k: v for k, v in effective_variants.items() if k not in job_task.exclude_variants + } # Apply sample filtering samples = tc.samples @@ -393,10 +444,21 @@ def _expand_task_configs( if job_task.exclude_samples: samples = [s for s in samples if s.id not in job_task.exclude_samples] - # Apply system_message override + # Apply sample tag filtering (job-level) + if job.sample_filters is not None: + samples = [ + s + for s in samples + if matches_tag_filter((s.metadata or {}).get("tags", []), job.sample_filters) + ] + + # Apply system_message from task system_message = tc.system_message - if job_task and job_task.system_message is not None: - system_message = job_task.system_message + + # Merge job-task args into metadata + merged_metadata = dict(tc.metadata) if tc.metadata else None + if job_task and job_task.args: + merged_metadata = {**(merged_metadata or {}), "args": job_task.args} # Create one ParsedTask per effective variant for vname, vdef in effective_variants.items(): @@ -412,9 +474,9 @@ def _expand_task_configs( variant=variant, sandbox_type=sandbox_type, system_message=system_message, - allowed_variants=None, save_examples=job.save_examples, examples_dir=examples_dir, + metadata=merged_metadata, ) ) @@ -435,16 +497,17 @@ def _resolve_variant( if not vdef: return Variant(name=name) - # Load context files (with glob support) + # Load context files (with glob support) — YAML key is "files" context_files: list[ContextFile] = [] - cf_paths: list[str] = vdef.get("context_files") or [] + cf_paths: list[str] = vdef.get("files") or [] for cf_path in cf_paths: if _is_glob(cf_path): full_pattern = os.path.join(dataset_root, cf_path) matched = sorted( f for f in globmod.glob(full_pattern, recursive=True) - if os.path.isfile(f) and (f.endswith(".yaml") or f.endswith(".yml") or f.endswith(".md")) + if os.path.isfile(f) + and (f.endswith(".yaml") or f.endswith(".yml") or f.endswith(".md")) ) if not matched: raise FileNotFoundError(f"No context files matched pattern: {cf_path}") @@ -454,16 +517,14 @@ def _resolve_variant( full_path = os.path.normpath(os.path.join(dataset_root, cf_path)) context_files.append(ContextFile.load(full_path)) - # Resolve skill paths (with glob support) + # Resolve skill paths (with glob support) — YAML key is "skills" skill_paths: list[str] = [] - raw_skills: list[str] = vdef.get("skills") or vdef.get("skill_paths") or [] + raw_skills: list[str] = vdef.get("skills") or [] for skill_path_str in raw_skills: if _is_glob(skill_path_str): full_pattern = os.path.join(dataset_root, skill_path_str) matched_dirs = sorted( - d - for d in globmod.glob(full_pattern, recursive=True) - if os.path.isdir(d) + d for d in globmod.glob(full_pattern, recursive=True) if os.path.isdir(d) ) valid_dirs = [d for d in matched_dirs if os.path.isfile(os.path.join(d, "SKILL.md"))] if not valid_dirs: @@ -480,12 +541,21 @@ def _resolve_variant( ) skill_paths.append(skill_dir) + # Resolve MCP servers + mcp_servers: list[McpServerConfig] = [] + raw_mcp: list[Any] = vdef.get("mcp_servers") or [] + for raw in raw_mcp: + mcp_servers.append(McpServerConfig.from_yaml(raw)) + + # Task parameters + task_parameters: dict[str, Any] = vdef.get("task_parameters") or {} + return Variant( name=name, - context_files=context_files, - mcp_servers=vdef.get("mcp_servers") or [], - skill_paths=skill_paths, - flutter_channel=vdef.get("flutter_channel"), + files=context_files, + mcp_servers=mcp_servers, + skills=skill_paths, + task_parameters=task_parameters, ) diff --git a/packages/dataset_config_python/tests/test_config.py b/packages/dataset_config_python/tests/test_config.py index b79f9a9..865b7bd 100644 --- a/packages/dataset_config_python/tests/test_config.py +++ b/packages/dataset_config_python/tests/test_config.py @@ -32,21 +32,25 @@ def dataset_dir(tmp_path): task_dir.mkdir(parents=True) task_yaml = task_dir / "task.yaml" task_yaml.write_text( - """ + """\ id: dart_qa func: question_answer system_message: "You are an expert." -samples: - inline: - - id: sample_1 - input: "What is Dart?" - target: "A programming language." - difficulty: easy - - id: sample_2 - input: "What is Flutter?" - target: "A UI framework." - difficulty: medium - tags: ui, framework +dataset: + samples: + inline: + - id: sample_1 + input: "What is Dart?" + target: "A programming language." + difficulty: easy + - id: sample_2 + input: "What is Flutter?" + target: "A UI framework." + metadata: + difficulty: medium + tags: + - ui + - framework """ ) @@ -55,18 +59,16 @@ def dataset_dir(tmp_path): code_gen_dir.mkdir(parents=True) code_gen_yaml = code_gen_dir / "task.yaml" code_gen_yaml.write_text( - """ -id: code_gen + """id: code_gen func: flutter_code_gen -time_limit: 600 -allowed_variants: - - baseline - - context_only -samples: - inline: - - id: sample_1 - input: "Create a counter app." - target: "A working counter app." +inspect_task_args: + time_limit: 600 +dataset: + samples: + inline: + - id: sample_1 + input: "Create a counter app." + target: "A working counter app." """ ) @@ -75,16 +77,14 @@ def dataset_dir(tmp_path): jobs_dir.mkdir() job_yaml = jobs_dir / "local_dev.yaml" job_yaml.write_text( - """ -logs_dir: ./logs -sandbox_type: local + """logs_dir: ./logs max_connections: 5 models: - google/gemini-2.5-flash variants: baseline: {} context_only: - context_files: [] + files: [] """ ) @@ -118,9 +118,10 @@ def dataset_dir_with_sample_files(tmp_path): """ id: qa func: question_answer -samples: - paths: - - samples/basics.yaml +dataset: + samples: + paths: + - samples/basics.yaml """ ) @@ -129,6 +130,8 @@ def dataset_dir_with_sample_files(tmp_path): (jobs_dir / "default.yaml").write_text( """ logs_dir: ./logs +models: + - test/model """ ) @@ -162,10 +165,10 @@ def test_dataset_creation(self): def test_variant_defaults(self): v = Variant() assert v.name == "baseline" - assert v.context_files == [] + assert v.files == [] assert v.mcp_servers == [] - assert v.skill_paths == [] - assert v.flutter_channel is None + assert v.skills == [] + assert v.task_parameters == {} def test_job_task_from_yaml_none(self): jt = JobTask.from_yaml("my_task", None) @@ -178,7 +181,7 @@ def test_job_task_from_yaml_with_data(self): def test_eval_set_serialization(self): es = EvalSet( - tasks=[Task(name="test:baseline", task_func="qa")], + tasks=[Task(name="test:baseline", func="qa")], log_dir="/tmp/logs", model=["google/gemini-2.5-flash"], ) @@ -216,11 +219,6 @@ def test_parse_tasks_metadata(self, dataset_dir): assert s2.metadata["tags"] == ["ui", "framework"] assert s2.metadata["difficulty"] == "medium" - def test_parse_tasks_allowed_variants(self, dataset_dir): - tasks = parse_tasks(str(dataset_dir)) - code_gen = next(t for t in tasks if t.id == "code_gen") - assert code_gen.allowed_variants == ["baseline", "context_only"] - def test_parse_tasks_time_limit(self, dataset_dir): tasks = parse_tasks(str(dataset_dir)) code_gen = next(t for t in tasks if t.id == "code_gen") @@ -229,7 +227,6 @@ def test_parse_tasks_time_limit(self, dataset_dir): def test_parse_job(self, dataset_dir): job_path = os.path.join(str(dataset_dir), "jobs", "local_dev.yaml") job = parse_job(job_path, str(dataset_dir)) - assert job.sandbox_type == "local" assert job.max_connections == 5 assert job.models == ["google/gemini-2.5-flash"] @@ -255,6 +252,80 @@ def test_parse_tasks_empty_dir(self, tmp_path): tasks = parse_tasks(str(tmp_path)) assert tasks == [] + def test_parse_task_json_dataset(self, tmp_path): + """Test parsing a task with a json dataset format.""" + task_dir = tmp_path / "tasks" / "json_ds" + task_dir.mkdir(parents=True) + (task_dir / "task.yaml").write_text( + """\ +id: json_ds +func: question_answer +dataset: + json: gs://bucket/data.jsonl + args: + auto_id: true + shuffle: true +""" + ) + tasks = parse_tasks(str(tmp_path)) + assert len(tasks) == 1 + assert tasks[0].dataset_format == "json" + assert tasks[0].dataset_source == "gs://bucket/data.jsonl" + assert tasks[0].dataset_args == {"auto_id": True, "shuffle": True} + assert tasks[0].samples == [] + + def test_parse_task_csv_dataset(self, tmp_path): + """Test parsing a task with a csv dataset format.""" + task_dir = tmp_path / "tasks" / "csv_ds" + task_dir.mkdir(parents=True) + (task_dir / "task.yaml").write_text( + """\ +id: csv_ds +func: question_answer +dataset: + csv: ./data.csv + args: + delimiter: "\\t" +""" + ) + tasks = parse_tasks(str(tmp_path)) + assert len(tasks) == 1 + assert tasks[0].dataset_format == "csv" + assert tasks[0].dataset_source == "./data.csv" + + def test_parse_task_mutually_exclusive_dataset_keys(self, tmp_path): + """Test that specifying both json and csv in dataset raises error.""" + task_dir = tmp_path / "tasks" / "bad_ds" + task_dir.mkdir(parents=True) + (task_dir / "task.yaml").write_text( + """\ +id: bad_ds +func: question_answer +dataset: + json: ./data.jsonl + csv: ./data.csv +""" + ) + with pytest.raises(ValueError, match="exactly one"): + parse_tasks(str(tmp_path)) + + def test_parse_job_missing_models(self, tmp_path): + """Test that a job without models raises a validation error.""" + jobs_dir = tmp_path / "jobs" + jobs_dir.mkdir() + (jobs_dir / "bad.yaml").write_text( + """\ +logs_dir: ./logs +""" + ) + job_path = str(jobs_dir / "bad.yaml") + with pytest.raises(ValueError, match="models"): + parse_job(job_path, str(tmp_path)) + + +# Runner integration tests for json/csv datasets are in: +# packages/dash_evals/tests/test_json_runner.py + # --------------------------------------------------------------------------- # Resolver tests @@ -312,11 +383,11 @@ def test_write_single(self, dataset_dir, tmp_path): def test_write_multiple(self, tmp_path): es1 = EvalSet( - tasks=[Task(name="t1:baseline", task_func="qa")], + tasks=[Task(name="t1:baseline", func="qa")], log_dir="/tmp/logs1", ) es2 = EvalSet( - tasks=[Task(name="t2:baseline", task_func="qa")], + tasks=[Task(name="t2:baseline", func="qa")], log_dir="/tmp/logs2", ) output_dir = str(tmp_path / "output") diff --git a/packages/devals_cli/lib/src/commands/create_job_command.dart b/packages/devals_cli/lib/src/commands/create_job_command.dart index ba4dc03..f9acf7b 100644 --- a/packages/devals_cli/lib/src/commands/create_job_command.dart +++ b/packages/devals_cli/lib/src/commands/create_job_command.dart @@ -1,5 +1,5 @@ import 'package:args/command_runner.dart'; -import 'package:dataset_config_dart/dataset_config_dart.dart'; + import 'package:devals/src/dataset/dataset_reader.dart'; import 'package:devals/src/dataset/eval_writer.dart'; import 'package:devals/src/dataset/file_templates/job_template.dart'; @@ -19,7 +19,14 @@ class CreateJobCommand extends Command { terminal.writeln(); // Get available options from the generated registries and filesystem - final models = List.of(kDefaultModels); + // Suggested models for model selection prompt + final models = [ + 'google/gemini-2.5-flash', + 'google/gemini-3-flash-preview', + 'google/gemini-3-pro-preview', + 'anthropic/claude-sonnet-4-5', + 'openai/gpt-5-mini', + ]; final variants = datasetReader.getVariants(); final tasks = datasetReader.getTasks(); @@ -65,9 +72,14 @@ class CreateJobCommand extends Command { 'Select models', help: 'Tasks will run against each of these', options: models.map((m) => Option(label: m, value: m)).toList(), - key: 'models', + validator: (List? selection) { + if (selection == null || selection.isEmpty) { + return 'You must select at least one model.'; + } + return null; + }, defaultValue: models - .where((String name) => name.contains('gemini')) + .where((name) => name.contains('gemini')) .toList(), ), Multiselect( diff --git a/packages/devals_cli/lib/src/commands/create_pipeline_command.dart b/packages/devals_cli/lib/src/commands/create_pipeline_command.dart index 22b1e61..17942da 100644 --- a/packages/devals_cli/lib/src/commands/create_pipeline_command.dart +++ b/packages/devals_cli/lib/src/commands/create_pipeline_command.dart @@ -1,7 +1,7 @@ import 'dart:io'; import 'package:args/command_runner.dart'; -import 'package:dataset_config_dart/dataset_config_dart.dart'; + import 'package:devals/src/cli_exception.dart'; import 'package:devals/src/dataset/eval_writer.dart'; import 'package:devals/src/dataset/file_templates/job_template.dart'; @@ -34,7 +34,13 @@ class CreatePipelineCommand extends Command { } final availableVariants = datasetReader.getVariants(); - final models = List.of(kDefaultModels); + final models = [ + 'google/gemini-2.5-flash', + 'google/gemini-3-flash-preview', + 'google/gemini-3-pro-preview', + 'anthropic/claude-sonnet-4-5', + 'openai/gpt-5-mini', + ]; if (models.isEmpty) { throw CliException( 'No models configured.', @@ -212,7 +218,7 @@ class CreatePipelineCommand extends Command { 'Models', help: 'Choose which models to evaluate. You need API keys for each provider.', - options: models.map((m) => Option(label: m, value: m)).toList(), + options: models.map((m) => Option(label: m, value: m)).toList(), defaultValue: [defaultModel], key: 'models', ), diff --git a/packages/devals_cli/lib/src/dataset/dry_run.dart b/packages/devals_cli/lib/src/dataset/dry_run.dart index 891f700..1a61dcc 100644 --- a/packages/devals_cli/lib/src/dataset/dry_run.dart +++ b/packages/devals_cli/lib/src/dataset/dry_run.dart @@ -32,11 +32,11 @@ bool _validateConfig(EvalSet config) { final taskSummaries = {}; for (final task in config.tasks) { - final name = task.name ?? task.taskFunc ?? '(unknown)'; + final name = task.name ?? task.func ?? '(unknown)'; - if (task.taskFunc == null) { + if (task.func == null) { warnings.add( - 'Task "$name" has no task_func — Mode 2 hydration required', + 'Task "$name" has no func — Mode 2 hydration required', ); } diff --git a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart index b3007c5..e4fbb72 100644 --- a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart +++ b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_job_template.dart @@ -63,8 +63,12 @@ $modelsList # Example: # variants: # baseline: {} # no extra features -# context_only: { context_files: [../../context/flutter.md] } -# mcp_only: { mcp_servers: [dart] } +# context_only: { files: [../../context/flutter.md] } +# mcp_only: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] } +# +# Variants can also be loaded from separate files: +# variants: +# - ./variants/common.yaml # ============================================================================= # TASKS @@ -80,11 +84,10 @@ $modelsList # tasks: # inline: # task_id: -# # (use allowed_variants in task.yaml to whitelist variants) # include-samples: [sample1] # Only run specific samples (mutually exclusive with exclude) # exclude-samples: [sample2] # Skip specific samples (mutually exclusive with include) -# system_message: | # Override system prompt for this task -# Custom instructions... +# include-variants: [baseline] # Only run these variants for this task +# exclude-variants: [with_mcp] # Skip these variants for this task # # Simple format (run all samples with job-level settings): # tasks: diff --git a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart index 589a123..1621d35 100644 --- a/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart +++ b/packages/devals_cli/lib/src/dataset/file_templates/init_templates/init_sample_template.dart @@ -1,34 +1,36 @@ /// Template for the starter task created by `devals init`. /// /// Creates a task.yaml at tasks/get_started/task.yaml that points at -/// the parent project as its workspace. +/// the parent project via files. String initTaskTemplate() { return ''' # ============================================================================= # Starter Task # ============================================================================= -# This task points at your project root as its workspace and runs a simple +# This task copies your project root into the sandbox and runs a simple # codebase analysis evaluation. func: analyze_codebase -# Workspace: points to the project root containing pubspec.yaml -workspace: - path: ../../ +# Files: copies the project root into /workspace in the sandbox +files: + /workspace: ../../ +setup: "cd /workspace && flutter pub get" -samples: - inline: - - id: get_started - difficulty: easy - tags: [] - # Input: The prompt given to the model - input: | - Explore this codebase and suggest one improvement - to the code quality, readability, or architecture. - # Target: Expected output or grading criteria - target: | - The suggestion should be specific, actionable, and reference - actual code in the project. It should explain why the change - improves the codebase. +dataset: + samples: + inline: + - id: get_started + difficulty: easy + tags: [] + # Input: The prompt given to the model + input: | + Explore this codebase and suggest one improvement + to the code quality, readability, or architecture. + # Target: Expected output or grading criteria + target: | + The suggestion should be specific, actionable, and reference + actual code in the project. It should explain why the change + improves the codebase. '''; } diff --git a/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart index b76a758..32b3ed6 100644 --- a/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart +++ b/packages/devals_cli/lib/src/dataset/file_templates/job_template.dart @@ -80,13 +80,14 @@ $modelsList # Each variant defines what tools/context the agent has access to. # # Format: variant_name: { config } -# baseline: {} # no extra features -# context_only: { context_files: [./path/to.md] } # injects context files -# mcp_only: { mcp_servers: [dart] } # enables MCP servers -# full: { context_files: [...], mcp_servers: [...] } +# baseline: {} # no extra features +# context_only: { files: [./path/to.md] } # injects context files +# mcp_only: { mcp_servers: [{name: dart, command: dart, args: [...]}] } # enables MCP servers +# full: { files: [...], mcp_servers: [...] } # -# Tasks can optionally restrict which variants they support -# via `allowed_variants:` in their task.yaml. +# Variants can also be loaded from separate files: +# variants: +# - ./variants/common.yaml variants: ${variantsMap.toString().trimRight()} @@ -105,8 +106,8 @@ ${variantsMap.toString().trimRight()} # task_id: # include-samples: [sample1] # Only run specific samples (mutually exclusive with exclude) # exclude-samples: [sample2] # Skip specific samples (mutually exclusive with include) -# system_message: | # Override system prompt for this task -# Custom instructions... +# include-variants: [baseline] # Only run these variants for this task +# exclude-variants: [with_mcp] # Exclude these variants for this task # # Simple format (run all samples with job-level settings): # tasks: diff --git a/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart index 0a2b40e..582f9be 100644 --- a/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart +++ b/packages/devals_cli/lib/src/dataset/file_templates/sample_template.dart @@ -11,7 +11,7 @@ String sampleTemplate({ TemplatePackage? templatePackage, String? workspaceValue, }) { - final workspaceSection = _buildSampleWorkspaceSection( + final filesSection = _buildSampleFilesSection( workspaceType, templatePackage: templatePackage, workspaceValue: workspaceValue, @@ -20,7 +20,7 @@ String sampleTemplate({ return ''' - id: $id difficulty: $difficulty - tags: []$workspaceSection + tags: []$filesSection input: | # Write prompt here target: | @@ -28,35 +28,25 @@ String sampleTemplate({ '''; } -/// Builds workspace/tests lines for an inline sample block. +/// Builds files lines for an inline sample block. /// -/// Only needed if the sample overrides the task-level workspace. -String _buildSampleWorkspaceSection( +/// Only needed if the sample overrides the task-level files. +String _buildSampleFilesSection( WorkspaceType? workspaceType, { TemplatePackage? templatePackage, String? workspaceValue, }) { return switch (workspaceType) { - WorkspaceType.git => - ''' - - workspace: - git: ${workspaceValue ?? ''}''', WorkspaceType.path => ''' - workspace: - path: ${workspaceValue ?? ''}''', - WorkspaceType.template => - ''' - - workspace: - template: ${templatePackage?.yamlValue ?? ''}''', + files: + /workspace: ${workspaceValue ?? ''}''', WorkspaceType.create => ''' - workspace: - path: ./project''', + files: + /workspace: ./project''', _ => '', }; } diff --git a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart index 4aa092d..16e6a75 100644 --- a/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart +++ b/packages/devals_cli/lib/src/dataset/file_templates/task_template.dart @@ -12,15 +12,14 @@ String taskTemplate({ List variants = const [], String? systemMessage, }) { - final workspaceSection = _buildTaskWorkspaceSection( + final filesSection = _buildTaskFilesSection( workspaceType, templatePackage: templatePackage, workspaceValue: workspaceValue, ); - final variantsLine = variants.isNotEmpty - ? 'allowed_variants: [${variants.join(', ')}]\n' - : ''; + final variantsLine = ''; + final systemMessageBlock = systemMessage != null && systemMessage.isNotEmpty ? 'system_message: |\n ${systemMessage.replaceAll('\n', '\n ')}\n' @@ -30,55 +29,44 @@ String taskTemplate({ # Task configuration # See docs/configuration_reference.md for full schema reference. func: $taskFunc -$variantsLine$systemMessageBlock$workspaceSection -samples: - inline: - - id: sample_1 - difficulty: medium - input: | - # Write prompt here - target: | - # Write target here +$variantsLine$systemMessageBlock$filesSection +dataset: + samples: + inline: + - id: sample_1 + difficulty: medium + input: | + # Write prompt here + target: | + # Write target here '''; } -/// Builds the workspace section for a task-level definition. -String _buildTaskWorkspaceSection( +/// Builds the files/setup section for a task-level definition. +String _buildTaskFilesSection( WorkspaceType? workspaceType, { TemplatePackage? templatePackage, String? workspaceValue, }) { return switch (workspaceType) { - WorkspaceType.git => - ''' -workspace: - git: ${workspaceValue ?? ''} - # ref: # Optional -''', WorkspaceType.path => ''' -workspace: - path: ${workspaceValue ?? './project'} -''', - WorkspaceType.template => - ''' -workspace: - template: ${templatePackage?.yamlValue ?? ''} +files: + /workspace: ${workspaceValue ?? './project'} +setup: "cd /workspace && flutter pub get" ''', WorkspaceType.create => ''' -workspace: - path: ./project +files: + /workspace: ./project +setup: "cd /workspace && flutter pub get" ''', _ => ''' -# Workspace configuration (uncomment one): -# workspace: -# template: flutter_app # OR dart_package OR jaspr_app -# workspace: -# path: ./project -# workspace: -# git: +# Files to copy into the sandbox (uncomment as needed): +# files: +# /workspace: ./project +# setup: "cd /workspace && flutter pub get" ''', }; } diff --git a/packages/devals_cli/lib/src/dataset/variant_defaults.dart b/packages/devals_cli/lib/src/dataset/variant_defaults.dart index 0f475c9..41a42fe 100644 --- a/packages/devals_cli/lib/src/dataset/variant_defaults.dart +++ b/packages/devals_cli/lib/src/dataset/variant_defaults.dart @@ -11,17 +11,17 @@ enum DefaultVariants { flutterRules( 'flutter_rules', 'Run with Flutter rules context files.', - 'flutter_rules: { context_files: [./context_files/flutter.md] }', + 'flutter_rules: { files: [./context_files/flutter.md] }', ), withSkills( 'with_skills', 'Run with skills files.', - 'with_skills: { skill_paths: [./skills/*] }', + 'with_skills: { skills: [./skills/*] }', ), withMCP( 'with_mcp', 'Run with Dart MCP server available.', - 'with_mcp: { mcp_servers: [dart] }', + 'with_mcp: { mcp_servers: [{name: dart, command: dart, args: [mcp-server]}] }', ) ; diff --git a/packages/devals_cli/test/dataset/sample_template_test.dart b/packages/devals_cli/test/dataset/sample_template_test.dart index 51257b6..6ba3140 100644 --- a/packages/devals_cli/test/dataset/sample_template_test.dart +++ b/packages/devals_cli/test/dataset/sample_template_test.dart @@ -21,41 +21,30 @@ void main() { expect(result, contains('tags: []')); }); - test('with git workspace includes git section', () { - final result = sampleTemplate( - id: 'test', - difficulty: 'easy', - workspaceType: WorkspaceType.git, - workspaceValue: 'https://github.com/example/repo.git', - ); - expect(result, contains('git:')); - expect(result, contains('https://github.com/example/repo.git')); - }); - - test('with path workspace includes path section', () { + test('with path workspace includes files section', () { final result = sampleTemplate( id: 'test', difficulty: 'easy', workspaceType: WorkspaceType.path, workspaceValue: './project', ); - expect(result, contains('path:')); - expect(result, contains('./project')); + expect(result, contains('files:')); + expect(result, contains('/workspace: ./project')); }); - test('with template workspace includes template section', () { + test('with create workspace includes files section', () { final result = sampleTemplate( id: 'test', difficulty: 'easy', - workspaceType: WorkspaceType.template, - templatePackage: TemplatePackage.flutterApp, + workspaceType: WorkspaceType.create, ); - expect(result, contains('flutter_app')); + expect(result, contains('files:')); + expect(result, contains('/workspace: ./project')); }); - test('without workspace type has no workspace section', () { + test('without workspace type has no files section', () { final result = sampleTemplate(id: 'test', difficulty: 'easy'); - expect(result, isNot(contains('workspace:'))); + expect(result, isNot(contains('files:'))); }); test('generates indented block for appending to task file', () { @@ -64,22 +53,32 @@ void main() { expect(result, contains(' - id: test')); }); - test('git type with null value uses placeholder', () { + test('path type with null value uses placeholder', () { + final result = sampleTemplate( + id: 'test', + difficulty: 'easy', + workspaceType: WorkspaceType.path, + ); + expect(result, contains('')); + }); + + test('git type falls through to no files section', () { final result = sampleTemplate( id: 'test', difficulty: 'easy', workspaceType: WorkspaceType.git, ); - expect(result, contains('')); + expect(result, isNot(contains('files:'))); }); - test('path type with null value uses placeholder', () { + test('template type falls through to no files section', () { final result = sampleTemplate( id: 'test', difficulty: 'easy', - workspaceType: WorkspaceType.path, + workspaceType: WorkspaceType.template, + templatePackage: TemplatePackage.flutterApp, ); - expect(result, contains('')); + expect(result, isNot(contains('files:'))); }); }); } diff --git a/packages/devals_cli/test/dataset/task_template_test.dart b/packages/devals_cli/test/dataset/task_template_test.dart index 463ae62..2ac92a9 100644 --- a/packages/devals_cli/test/dataset/task_template_test.dart +++ b/packages/devals_cli/test/dataset/task_template_test.dart @@ -17,12 +17,13 @@ void main() { expect(result, contains('target: |')); }); - test('includes variants when provided', () { + test('does not include variants (variants are job-level)', () { final result = taskTemplate( taskFunc: 'flutter_code_gen', variants: ['baseline', 'mcp_only'], ); - expect(result, contains('variants: [baseline, mcp_only]')); + // Variants are now configured at the job level, not task level + expect(result, isNot(contains('variants:'))); }); test('omits variants line when list is empty', () { @@ -52,83 +53,58 @@ void main() { expect(result, isNot(contains('system_message:'))); }); - group('workspace section', () { - test('generates git workspace', () { - final result = taskTemplate( - taskFunc: 'flutter_bug_fix', - workspaceType: WorkspaceType.git, - workspaceValue: 'https://github.com/example/repo', - ); - expect(result, contains('workspace:')); - expect(result, contains('git: https://github.com/example/repo')); - }); - - test('generates path workspace', () { + group('files section', () { + test('generates path files with workspace value', () { final result = taskTemplate( taskFunc: 'flutter_bug_fix', workspaceType: WorkspaceType.path, workspaceValue: './my_project', ); - expect(result, contains('workspace:')); - expect(result, contains('path: ./my_project')); + expect(result, contains('files:')); + expect(result, contains('/workspace: ./my_project')); + expect(result, contains('setup:')); }); - test('generates template workspace', () { + test('generates path files with default when value is null', () { final result = taskTemplate( - taskFunc: 'flutter_code_gen', - workspaceType: WorkspaceType.template, - templatePackage: TemplatePackage.flutterApp, + taskFunc: 'flutter_bug_fix', + workspaceType: WorkspaceType.path, ); - expect(result, contains('workspace:')); - expect(result, contains('template: flutter_app')); + expect(result, contains('files:')); + expect(result, contains('/workspace: ./project')); }); - test('generates create workspace as path', () { + test('generates create workspace as files', () { final result = taskTemplate( taskFunc: 'flutter_bug_fix', workspaceType: WorkspaceType.create, ); - expect(result, contains('workspace:')); - expect(result, contains('path: ./project')); + expect(result, contains('files:')); + expect(result, contains('/workspace: ./project')); + expect(result, contains('setup:')); }); - test('generates commented workspace section when type is null', () { + test('generates commented files section when type is null', () { final result = taskTemplate(taskFunc: 'question_answer'); - expect(result, contains('# Workspace configuration')); - expect(result, contains('# template: flutter_app')); + expect(result, contains('# files:')); + expect(result, contains('# /workspace: ./project')); }); - test('generates git with default URL when workspaceValue is null', () { + test('git type falls through to commented section', () { final result = taskTemplate( taskFunc: 'flutter_bug_fix', workspaceType: WorkspaceType.git, ); - expect(result, contains('git: ')); + expect(result, contains('# files:')); }); - test('generates path with default when workspaceValue is null', () { + test('template type falls through to commented section', () { final result = taskTemplate( - taskFunc: 'flutter_bug_fix', - workspaceType: WorkspaceType.path, + taskFunc: 'flutter_code_gen', + workspaceType: WorkspaceType.template, ); - expect(result, contains('path: ./project')); + expect(result, contains('# files:')); }); - - test( - 'generates template with placeholder when templatePackage is null', - () { - final result = taskTemplate( - taskFunc: 'flutter_code_gen', - workspaceType: WorkspaceType.template, - ); - expect( - result, - contains( - 'template: ', - ), - ); - }, - ); }); }); } diff --git a/tool/config_parity/bin/config_partiy.dart b/tool/config_parity/bin/config_parity.dart similarity index 100% rename from tool/config_parity/bin/config_partiy.dart rename to tool/config_parity/bin/config_parity.dart diff --git a/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml b/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml index e0e884b..5ec75d4 100644 --- a/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml +++ b/tool/config_parity/fixtures/multi_variant/jobs/dev.yaml @@ -5,7 +5,7 @@ models: variants: baseline: {} context_only: - context_files: [] + files: [] full_mcp: mcp_servers: - my_server diff --git a/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml b/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml index d004f01..fb1872a 100644 --- a/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml +++ b/tool/config_parity/fixtures/multi_variant/tasks/code_gen/task.yaml @@ -1,9 +1,6 @@ id: code_gen func: flutter_code_gen time_limit: 600 -allowed_variants: - - baseline - - context_only samples: inline: - id: sample_1 diff --git a/tool/config_parity/pubspec.lock b/tool/config_parity/pubspec.lock deleted file mode 100644 index dd2733b..0000000 --- a/tool/config_parity/pubspec.lock +++ /dev/null @@ -1,108 +0,0 @@ -# Generated by pub -# See https://dart.dev/tools/pub/glossary#lockfile -packages: - async: - dependency: transitive - description: - name: async - sha256: "758e6d74e971c3e5aceb4110bfd6698efc7f501675bcfe0c775459a8140750eb" - url: "https://pub.dev" - source: hosted - version: "2.13.0" - collection: - dependency: transitive - description: - name: collection - sha256: "2f5709ae4d3d59dd8f7cd309b4e023046b57d8a6c82130785d2b0e5868084e76" - url: "https://pub.dev" - source: hosted - version: "1.19.1" - dataset_config_dart: - dependency: "direct main" - description: - path: "../../packages/dataset_config_dart" - relative: true - source: path - version: "0.0.1" - file: - dependency: transitive - description: - name: file - sha256: a3b4f84adafef897088c160faf7dfffb7696046cb13ae90b508c2cbc95d3b8d4 - url: "https://pub.dev" - source: hosted - version: "7.0.1" - freezed_annotation: - dependency: transitive - description: - name: freezed_annotation - sha256: "7294967ff0a6d98638e7acb774aac3af2550777accd8149c90af5b014e6d44d8" - url: "https://pub.dev" - source: hosted - version: "3.1.0" - glob: - dependency: transitive - description: - name: glob - sha256: c3f1ee72c96f8f78935e18aa8cecced9ab132419e8625dc187e1c2408efc20de - url: "https://pub.dev" - source: hosted - version: "2.1.3" - json_annotation: - dependency: transitive - description: - name: json_annotation - sha256: cb09e7dac6210041fad964ed7fbee004f14258b4eca4040f72d1234062ace4c8 - url: "https://pub.dev" - source: hosted - version: "4.11.0" - meta: - dependency: transitive - description: - name: meta - sha256: "9f29b9bcc8ee287b1a31e0d01be0eae99a930dbffdaecf04b3f3d82a969f296f" - url: "https://pub.dev" - source: hosted - version: "1.18.1" - path: - dependency: "direct main" - description: - name: path - sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5" - url: "https://pub.dev" - source: hosted - version: "1.9.1" - source_span: - dependency: transitive - description: - name: source_span - sha256: "56a02f1f4cd1a2d96303c0144c93bd6d909eea6bee6bf5a0e0b685edbd4c47ab" - url: "https://pub.dev" - source: hosted - version: "1.10.2" - string_scanner: - dependency: transitive - description: - name: string_scanner - sha256: "921cd31725b72fe181906c6a94d987c78e3b98c2e205b397ea399d4054872b43" - url: "https://pub.dev" - source: hosted - version: "1.4.1" - term_glyph: - dependency: transitive - description: - name: term_glyph - sha256: "7f554798625ea768a7518313e58f83891c7f5024f88e46e7182a4558850a4b8e" - url: "https://pub.dev" - source: hosted - version: "1.2.2" - yaml: - dependency: transitive - description: - name: yaml - sha256: b9da305ac7c39faa3f030eccd175340f968459dae4af175130b3fc47e40d76ce - url: "https://pub.dev" - source: hosted - version: "3.1.3" -sdks: - dart: ">=3.10.0 <4.0.0"