Welcome to the Dart/Flutter LLM evaluation project! This repository contains tools for running and analyzing AI model evaluations on Dart and Flutter tasks.
-
Prerequisites
- Python 3.13+
- Podman or Docker (for sandbox execution)
- API keys for the models you want to test
-
Create and activate a virtual environment
cd packages/dash_evals python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -e . # Core dependencies pip install -e ".[dev]" # Development dependencies (pytest, ruff, etc.)
-
Configure API keys
You only need to configure the keys you plan on testing.
export GEMINI_API_KEY=your_key_here export ANTHROPIC_API_KEY=your_key_here export OPENAI_API_KEY=your_key_here
-
Verify installation
run-evals --help
The most common contribution is adding new evaluation samples. Each sample tests a specific capability or scenario.
-
Decide which task your sample belongs to
Review the available tasks in
dataset/tasks/or rundevals create taskto see available task functions:Task Purpose question_answerQ&A evaluation for Dart/Flutter knowledge bug_fixAgentic debugging of code in a sandbox flutter_bug_fixFlutter-specific bug fix (wraps bug_fix)code_genGenerate code from specifications flutter_code_genFlutter-specific code gen (wraps code_gen)mcp_toolTest MCP tool usage analyze_codebaseEvaluate codebase analysis skill_testTest skill file usage in sandboxes -
Create your sample file
Use
devals create samplefor interactive sample creation, or add a sample inline in the task'stask.yamlfile underdataset/tasks/<task_name>/task.yaml:id: dart_your_sample_id input: | Your prompt to the model goes here. target: | Criteria for grading the response. This is used by the scorer to determine if the model's output is acceptable. metadata: added: 2025-02-04 tags: [dart, async] # Optional categorization
For agentic tasks (bug fix, code gen), you'll also need a workspace:
id: flutter_fix_some_bug input: | The app crashes when the user taps the submit button. Debug and fix the issue. target: | The fix should handle the null check in the onPressed callback. workspace: template: flutter_app # Use a reusable template # OR path: ./project # Custom project relative to sample directory
-
Add your sample to the task's
task.yamlAdd your sample inline in the appropriate task's
sampleslist:# dataset/tasks/dart_question_answer/task.yaml func: question_answer samples: - id: dart_your_sample_id input: | Your prompt to the model goes here. target: | Criteria for grading the response.
Before committing, test your sample by creating a job file. Use devals create job interactively, or manually create one in dataset/jobs/:
# jobs/test_my_sample.yaml
name: test_my_sample
# Run only the task containing your sample
tasks:
dart_question_answer:
allowed_variants: [baseline] # Start with baseline variant
include-samples:
- dart_your_sample_id # Only run your specific sample
# Use a fast model for testing
models: [google/gemini-2.5-flash]Then run with your job:
devals run test_my_sample-
Dry run first — validates configuration without making API calls:
devals run test_my_sample --dry-run
-
Run the evaluation:
devals run test_my_sample
-
Check the output in the
logs/directory. Verify:- The model received your prompt correctly
- The scorer evaluated the response appropriately
- No errors occurred during execution
Do commit:
- Your updated task file(s) in
dataset/tasks/ - Any new workspace templates or context files
Do NOT commit:
- Test job files in
dataset/jobs/(if they were only for local testing) - Log files in
logs/ - API keys or
.envfiles
Before submitting your PR, clean up any test job files you created:
git status # Check for untracked/modified job filesIf you're adding new task types, scorers, or solvers, this section is for you.
The dash_evals runner uses Inspect AI concepts:
| Component | Purpose | Location |
|---|---|---|
| Task | Defines what to evaluate — combines dataset, solver chain, and scorers | runner/tasks/ |
| Solver | Processes inputs (e.g., injects context, runs agent loops) | runner/solvers/ |
| Scorer | Evaluates outputs (e.g., model grading, dart analyze, flutter test) | runner/scorers/ |
A typical task structure:
from inspect_ai import Task, task
from inspect_ai.dataset import MemoryDataset
@task
def your_new_task(dataset: MemoryDataset, task_def: dict) -> Task:
return Task(
name=task_def.get("name", "your_new_task"),
dataset=dataset,
solver=[
add_system_message("Your system prompt"),
context_injector(task_def),
# ... more solvers
],
scorer=[
model_graded_scorer(),
dart_analyze_scorer(),
],
)-
Create your task file at
src/dash_evals/runner/tasks/your_task.py -
Export it from
src/dash_evals/runner/tasks/__init__.py:from .your_task import your_new_task __all__ = [ # ... existing tasks ... "your_new_task", ]
Task functions are discovered dynamically via
importlib. If the function name matches a module inrunner/tasks/, it will be found automatically when referenced from atask.yamlfile. No registry is needed. -
Create a task directory in
dataset/tasks/:dataset/tasks/your_task_id/ └── task.yaml# dataset/tasks/your_task_id/task.yaml func: your_new_task # Must match the function name samples: - id: sample_001 input: | Your prompt here. target: | Expected output or grading criteria.
-
Run the test suite:
cd packages/dash_evals pytest -
Run linting:
ruff check src/dash_evals ruff format src/dash_evals
-
Test your task end-to-end:
devals run test_my_sample --dry-run # Validate config devals run test_my_sample # Run actual evaluation
A Dart/Flutter application for exploring evaluation results, built with Serverpod.
Note
The eval_explorer is under active development. Contribution guidelines coming soon!
The package is located in packages/eval_explorer/ and consists of:
| Package | Description |
|---|---|
eval_explorer_client |
Dart client package |
eval_explorer_flutter |
Flutter web/mobile app |
eval_explorer_server |
Serverpod backend |
eval_explorer_shared |
Shared models |