Caution
This repo is highly unstable and the APIs will change.
Evaluation framework for testing AI agents' ability to write Dart and Flutter code. Built on Inspect AI.
This repo includes
- eval runner — Python package for running LLM evaluations with configurable tasks, variants, and models
- config packages — Dart and Python packages that resolve dataset YAML into EvalSet JSON for the runner
- devals CLI — Dart CLI for creating and managing dataset samples, tasks, and jobs
- Evaluation Explorer — Dart/Flutter app for browsing and analyzing results
Tip
Full documentation at dash-evals-docs.web.app/
| Package | Description | Docs |
|---|---|---|
| dash_evals | Python evaluation runner using Inspect AI | dash_evals docs |
| dataset_config_dart | Dart library for resolving dataset YAML into EvalSet JSON (includes shared data models) | dataset_config_dart docs |
| dataset_config_python | Python configuration models | — |
| devals_cli | Dart CLI for managing evaluation tasks and jobs | CLI docs |
| eval_explorer | Dart/Flutter reporting app | eval_explorer docs |
| Doc | Description |
|---|---|
| Quick Start | Get started authoring your own evals |
| Contributing Guide | Development setup and guidelines |
| CLI Reference | Full devals CLI command reference |
| Configuration Reference | YAML configuration file reference |
| Repository Structure | Project layout |
| Glossary | Terminology guide |
See CONTRIBUTING.md for details, or go directly to the Contributing Guide.
See LICENSE for details.