Skip to content

Yardstick config updates#14

Merged
ericwindmill merged 21 commits intomainfrom
yardstick-config-updates
Mar 20, 2026
Merged

Yardstick config updates#14
ericwindmill merged 21 commits intomainfrom
yardstick-config-updates

Conversation

@ericwindmill
Copy link
Collaborator

Updating config to serve both our local use cases and Yardstick.

…on fields, refactoring `configuration_reference.md` to link to it and updating `index.md` and `custom.css`.
…resolvers and support colon syntax for task function resolution.
…, and expanded sandbox configuration options in documentation and API.
…rguments to resolved task metadata, updating the config parity tool.
@ericwindmill ericwindmill marked this pull request as ready for review March 14, 2026 00:56
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the configuration system to improve its adaptability and extensibility, particularly for Yardstick and diverse local use cases. It introduces new configuration options, standardizes naming conventions, and decouples environment-specific settings, allowing for more flexible and maintainable evaluation setups. A major enhancement is the introduction of tag-based filtering, which provides powerful new ways to manage task and sample selection.

Highlights

  • New Configuration Fields: Introduced several new fields across Job and Task models, including 'description' and 'imagePrefix' for Job, 'args' for JobTask, and 'systemMessage' and 'sandboxParameters' for Task, enhancing configuration flexibility.
  • Renamed Task Function Field: The 'taskFunc' field in the Task model was renamed to 'func' for consistency with YAML key names, requiring updates in both Dart and Python packages.
  • Configurable Sandbox Registry: The sandbox registry and SDK channel mappings are now data-driven and configurable, allowing non-Flutter projects to define their own sandbox setups and removing hardcoded values.
  • Enhanced Workspace Resolution: Workspace resolution now directly uses Inspect AI's native 'Sample.files' and 'Sample.setup' fields, making the setup command configurable and no longer hardcoded to Flutter-specific commands.
  • Tag-Based Filtering: Implemented a new 'TagFilter' model to enable tag-based filtering for tasks, samples, and variants, providing more granular control over evaluation execution.
  • Module:Task Syntax Support: Added support for the 'module.path:function_name' syntax for Python task function references, improving clarity and organization for task definitions.
  • Comprehensive Documentation Updates: Added a new 'CHANGELOG.md' and 'IMPLEMENTATION_PLAN.md', along with extensive updates to existing documentation, including a new 'yaml_config.md' reference, to reflect all configuration changes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CHANGELOG.md
    • Added a new changelog file detailing new features, breaking changes, and documentation updates across multiple releases, including new Job and Task fields, tag-based filtering, configurable sandbox, and the module:task syntax.
    • Documented the renaming of Task.taskFunc to Task.func and changes to workspace resolution.
  • IMPLEMENTATION_PLAN.md
    • Added a new implementation plan outlining the steps for config improvements, including model changes, parser/resolver changes, tag-based filtering, and a file index of modifications.
  • docs/_static/custom.css
    • Modified CSS to adjust sidebar and article container widths for wide screens.
  • docs/guides/config.md
    • Updated configuration guide with sections on tag-based filtering, task function references, sandbox configuration, and workspace setup.
  • docs/reference/configuration_reference.md
    • Updated the configuration reference to link to the new yaml_config.md and removed detailed field tables, deferring to the new dedicated reference.
  • docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
    • Updated Dart API documentation for EvalSetResolver, Job, JobTask, ParsedTask, TagFilter, Task, TaskMetadata, and matchesTagFilter to reflect new fields, renames, and configurable sandbox.
  • docs/reference/index.md
    • Added yaml_config to the reference index.
  • docs/reference/yaml_config.md
    • Added a new comprehensive YAML configuration reference file detailing Job, Task, and Sample fields with Dart and Python cross-references.
  • packages/dash_evals/src/dash_evals/runner/json_runner.py
    • Updated task function resolution to support module:task syntax and changed references from task_func to func.
  • packages/dataset_config_dart/lib/src/models/context_file.g.dart
    • Modified JSON serialization for ContextFile to directly use metadata instead of toJson().
  • packages/dataset_config_dart/lib/src/models/dataset.g.dart
    • Modified JSON serialization for Dataset to directly use samples instead of map((e) => e.toJson()).toList().
  • packages/dataset_config_dart/lib/src/models/eval_log.g.dart
    • Modified JSON serialization for EvalLog and related classes to directly use nested objects instead of toJson() or map((e) => e.toJson()).toList().
  • packages/dataset_config_dart/lib/src/models/eval_set.g.dart
    • Modified JSON serialization for EvalSet to directly use tasks instead of map((e) => e.toJson()).toList().
  • packages/dataset_config_dart/lib/src/models/job.dart
    • Added description, imagePrefix, taskFilters, sampleFilters to Job and args to JobTask.
  • packages/dataset_config_dart/lib/src/models/job.freezed.dart
    • Updated Job and JobTask freezed classes to include new fields (description, imagePrefix, taskFilters, sampleFilters, args) and their copyWith, ==, hashCode, and toString methods.
  • packages/dataset_config_dart/lib/src/models/job.g.dart
    • Updated JSON serialization for Job and JobTask to include new fields (description, imagePrefix, taskFilters, sampleFilters, args) and to directly use nested objects.
  • packages/dataset_config_dart/lib/src/models/models.dart
    • Exported the new tag_filter.dart model.
  • packages/dataset_config_dart/lib/src/models/tag_filter.dart
    • Added a new TagFilter model and matchesTagFilter utility function.
  • packages/dataset_config_dart/lib/src/models/tag_filter.freezed.dart
    • Added freezed class for TagFilter.
  • packages/dataset_config_dart/lib/src/models/tag_filter.g.dart
    • Added JSON serialization for TagFilter.
  • packages/dataset_config_dart/lib/src/models/task.dart
    • Renamed taskFunc to func and added systemMessage and sandboxParameters to Task and TaskMetadata.
  • packages/dataset_config_dart/lib/src/models/task.freezed.dart
    • Updated Task freezed class to reflect taskFunc rename to func and addition of systemMessage and sandboxParameters.
  • packages/dataset_config_dart/lib/src/models/task.g.dart
    • Updated JSON serialization for Task to reflect taskFunc rename to func and addition of systemMessage and sandboxParameters, and to directly use dataset.
  • packages/dataset_config_dart/lib/src/models/variant.dart
    • Renamed flutterChannel to branch in Variant.
  • packages/dataset_config_dart/lib/src/models/variant.freezed.dart
    • Updated Variant freezed class to reflect flutterChannel rename to branch.
  • packages/dataset_config_dart/lib/src/models/variant.g.dart
    • Updated JSON serialization for Variant to reflect flutterChannel rename to branch and to directly use context_files.
  • packages/dataset_config_dart/lib/src/parsed_task.dart
    • Renamed taskFunc to func and added variantFilters and sandboxParameters to ParsedTask.
  • packages/dataset_config_dart/lib/src/parsers/json_parser.dart
    • Updated JsonParser to use func instead of taskFunc.
  • packages/dataset_config_dart/lib/src/parsers/yaml_parser.dart
    • Updated YamlParser to use func instead of taskFunc, and to parse new fields like variant_filters, sandbox_parameters, description, image_prefix, task_filters, and sample_filters.
  • packages/dataset_config_dart/lib/src/resolvers/eval_set_resolver.dart
    • Refactored EvalSetResolver to be configurable with sandboxRegistry and branchChannels instead of hardcoded constants.
    • Renamed kSandboxRegistry to kDefaultSandboxRegistry and kSdkChannels to kDefaultBranchChannels.
    • Updated sandbox resolution logic and implemented tag-based filtering for tasks, samples, and variants.
    • Propagated image_prefix and JobTask.args to task metadata.
  • packages/dataset_config_dart/pubspec.yaml
    • Added build_runner, freezed, and json_serializable to dev dependencies.
  • packages/dataset_config_dart/test/eval_set_resolver_test.dart
    • Updated tests to reflect taskFunc rename to func.
    • Added tests for variant_filters, image_prefix, and JobTask.args.
  • packages/dataset_config_dart/test/eval_set_writer_test.dart
    • Updated test to use func instead of taskFunc.
  • packages/dataset_config_dart/test/json_parser_test.dart
    • Updated tests to use func instead of taskFunc.
  • packages/dataset_config_dart/test/parsed_task_test.dart
    • Updated tests to use func instead of taskFunc.
  • packages/dataset_config_python/src/dataset_config_python/init.py
    • Exported new SandboxConfig, DEFAULT_BRANCH_CHANNELS, and DEFAULT_SANDBOX_REGISTRY.
  • packages/dataset_config_python/src/dataset_config_python/models/init.py
    • Exported new TagFilter and matches_tag_filter.
  • packages/dataset_config_python/src/dataset_config_python/models/job.py
    • Imported TagFilter.
    • Added description, image_prefix, task_filters, sample_filters to Job, and args to JobTask.
  • packages/dataset_config_python/src/dataset_config_python/models/tag_filter.py
    • Added new TagFilter model and matches_tag_filter utility function.
  • packages/dataset_config_python/src/dataset_config_python/models/task.py
    • Renamed task_func to func and added system_message and sandbox_parameters to Task.
  • packages/dataset_config_python/src/dataset_config_python/models/variant.py
    • Renamed flutter_channel to branch in Variant.
  • packages/dataset_config_python/src/dataset_config_python/parser.py
    • Imported TagFilter.
    • Renamed task_func to func in ParsedTask.
    • Added sandbox_parameters and variant_filters to ParsedTask.
    • Updated _load_task_file and parse_job to handle these new fields.
  • packages/dataset_config_python/src/dataset_config_python/resolver.py
    • Refactored resolver to use a SandboxConfig dataclass.
    • Renamed SANDBOX_REGISTRY to DEFAULT_SANDBOX_REGISTRY and SDK_CHANNELS to DEFAULT_BRANCH_CHANNELS.
    • Updated _resolve_job and _resolve_sandbox to accept configurable sandbox settings.
    • Implemented tag-based filtering for tasks, samples, and variants.
    • Propagated image_prefix and JobTask.args to task metadata.
  • packages/dataset_config_python/tests/test_config.py
    • Updated tests to reflect flutter_channel rename to branch and task_func rename to func.
  • packages/devals_cli/lib/src/dataset/dry_run.dart
    • Updated _validateConfig to use func instead of taskFunc.
  • tool/config_parity/bin/config_parity.dart
    • Renamed from config_partiy.dart.
Activity
  • The pull request introduces a new CHANGELOG.md file, detailing several new features and breaking changes across different releases, indicating significant ongoing development.
  • An IMPLEMENTATION_PLAN.md is added, suggesting a structured approach to these configuration improvements.
  • Documentation files are updated and a new comprehensive YAML configuration reference is added, implying a focus on clarity and usability for developers.
  • The changes span both Dart and Python packages, indicating a cross-language impact and coordination.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant number of updates to the configuration system to support Yardstick and local use cases. Key changes include adding new fields to Job and Task models (like description, imagePrefix, args), introducing tag-based filtering for tasks and samples, and making the sandbox registry configurable. The Task.taskFunc field has been consistently renamed to Task.func across the codebase. Extensive documentation has been added or updated to reflect these changes, including a new detailed YAML configuration reference.

My review focuses on ensuring consistency across the implementation and documentation. I've found a couple of minor inconsistencies in the documentation regarding field names. I also noticed one part of the implementation in the Dart resolver that seems to contradict the implementation plan regarding hardcoded setup commands. Overall, this is a well-structured and comprehensive update.

…rguments into single map fields, removing individual top-level parameters and `JobTask.systemMessage`.
…lists, rename context files to files, skill paths to skills, and add task parameters.
…les` and `setup` for more flexible resource management.
…led list of sub-fields with a concise summary and external reference.
…ns, breaking changes, and documentation improvements.
@ericwindmill
Copy link
Collaborator Author

/gemini review

…ename `context_files` to `files`, and remove `allowed_variants`.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly refactors the configuration system for evaluation runs, impacting both Dart and Python packages. Key changes include introducing new fields for Job (description, image prefix, tag-based task/sample filtering) and JobTask (per-task arguments). Task models now directly support systemMessage, sandboxParameters, and task-level files and setup declarations. Breaking changes involve renaming Task.taskFunc to Task.func, making the sandbox registry configurable, replacing workspace and tests with files and setup, and consolidating various configuration arguments into Job.sandbox and Job.inspectEvalArguments. Variant fields like context_files and skill_paths have been renamed to files and skills respectively, and variant restriction logic has shifted to include-variants/exclude-variants on job tasks. Documentation has been updated with a new YAML configuration reference and guides. Review comments suggest improving the documentation for mcp_servers by adding descriptions and explanations of 'MCP server identifiers' for better user understanding. Additionally, an improvement opportunity was identified in the create_mcp_servers function to include the server's name in the ValueError message when a command is missing, which would aid in debugging problematic configurations.

…tart and tutorial content with new getting started, evaluation, and configuration guides.
…N, and CSV formats with new `dataset` key and `json_runner`.
@ericwindmill
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant updates to the Yardstick configuration, including new features, breaking changes, and documentation updates. The changes span multiple files, primarily focusing on the CHANGELOG.md, IMPLEMENTATION_PLAN.md, and various Dart and Python files within the dataset_config packages. The updates aim to enhance the flexibility and usability of the configuration system for both local use cases and Yardstick integration. The review comments address issues related to documentation requirements and style guide adherence.

- Updated `docs/reference/configuration_reference.md` with new examples and directory structure.
- Updated `docs/guides/config.md`.

## 11 March, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date 11 March, 2025 is mentioned in the changelog. It is extremely important to avoid commenting on dates, times, or versions mentioned in non-code areas such as copyright sections, as per the core instructions.

@ericwindmill ericwindmill merged commit 4648309 into main Mar 20, 2026
8 checks passed
@ericwindmill ericwindmill deleted the yardstick-config-updates branch March 20, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant