Contributing

Welcome to the Dart/Flutter LLM evaluation project! This repository contains tools for running and analyzing AI model evaluations on Dart and Flutter tasks.

dash_evals
eval_explorer

Setup

Prerequisites
- Python 3.13+
- Podman or Docker (for sandbox execution)
- API keys for the models you want to test

Create and activate a virtual environment

cd packages/dash_evals
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies

pip install -e .          # Core dependencies
pip install -e ".[dev]"   # Development dependencies (pytest, ruff, etc.)

Configure API keys

You only need to configure the keys you plan on testing.

export GEMINI_API_KEY=your_key_here
export ANTHROPIC_API_KEY=your_key_here
export OPENAI_API_KEY=your_key_here

Verify installation
```
run-evals --help
```

Write a New Eval

The most common contribution is adding new evaluation samples. Each sample tests a specific capability or scenario.

Add Your Sample to the Dataset

Decide which task your sample belongs to

Review the available tasks in dataset/tasks/ or run devals create task to see available task functions:

Task	Purpose
`question_answer`	Q&A evaluation for Dart/Flutter knowledge
`bug_fix`	Agentic debugging of code in a sandbox
`flutter_bug_fix`	Flutter-specific bug fix (wraps `bug_fix`)
`code_gen`	Generate code from specifications
`flutter_code_gen`	Flutter-specific code gen (wraps `code_gen`)
`mcp_tool`	Test MCP tool usage
`analyze_codebase`	Evaluate codebase analysis
`skill_test`	Test skill file usage in sandboxes

Create your sample file

Use devals create sample for interactive sample creation, or add a sample inline in the task's task.yaml file under dataset/tasks/<task_name>/task.yaml:

id: dart_your_sample_id
input: |
  Your prompt to the model goes here.
target: |
  Criteria for grading the response. This is used by the scorer
  to determine if the model's output is acceptable.
metadata:
  added: 2025-02-04
  tags: [dart, async]  # Optional categorization

For agentic tasks (bug fix, code gen), you'll also need a workspace:

id: flutter_fix_some_bug
input: |
  The app crashes when the user taps the submit button.
  Debug and fix the issue.
target: |
  The fix should handle the null check in the onPressed callback.
workspace:
  template: flutter_app   # Use a reusable template
  # OR
  path: ./project         # Custom project relative to sample directory

Add your sample to the task's task.yaml

Add your sample inline in the appropriate task's samples list:

# dataset/tasks/dart_question_answer/task.yaml
func: question_answer
samples:
  - id: dart_your_sample_id
    input: |
      Your prompt to the model goes here.
    target: |
      Criteria for grading the response.

Edit the Config to Run Only Your New Sample

Before committing, test your sample by creating a job file. Use devals create job interactively, or manually create one in dataset/jobs/:

# jobs/test_my_sample.yaml
name: test_my_sample

# Run only the task containing your sample
tasks:
  dart_question_answer:
    allowed_variants: [baseline]  # Start with baseline variant
    include-samples:
      - dart_your_sample_id  # Only run your specific sample

# Use a fast model for testing
models: [google/gemini-2.5-flash]

Then run with your job:

devals run test_my_sample

Verify the Sample Works

Dry run first — validates configuration without making API calls:
```
devals run test_my_sample --dry-run
```
Run the evaluation:
```
devals run test_my_sample
```
Check the output in the logs/ directory. Verify:
- The model received your prompt correctly
- The scorer evaluated the response appropriately
- No errors occurred during execution

What to Commit (and Not Commit!)

Do commit:

Your updated task file(s) in dataset/tasks/
Any new workspace templates or context files

Do NOT commit:

Test job files in dataset/jobs/ (if they were only for local testing)
Log files in logs/
API keys or .env files

Before submitting your PR, clean up any test job files you created:

git status  # Check for untracked/modified job files

Add Functionality to the Runner

If you're adding new task types, scorers, or solvers, this section is for you.

Understand Tasks, Solvers, and Scorers

The dash_evals runner uses Inspect AI concepts:

Component	Purpose	Location
Task	Defines what to evaluate — combines dataset, solver chain, and scorers	`runner/tasks/`
Solver	Processes inputs (e.g., injects context, runs agent loops)	`runner/solvers/`
Scorer	Evaluates outputs (e.g., model grading, dart analyze, flutter test)	`runner/scorers/`

A typical task structure:

from inspect_ai import Task, task
from inspect_ai.dataset import MemoryDataset

@task
def your_new_task(dataset: MemoryDataset, task_def: dict) -> Task:
    return Task(
        name=task_def.get("name", "your_new_task"),
        dataset=dataset,
        solver=[
            add_system_message("Your system prompt"),
            context_injector(task_def),
            # ... more solvers
        ],
        scorer=[
            model_graded_scorer(),
            dart_analyze_scorer(),
        ],
    )

Add a New Task

Create your task file at src/dash_evals/runner/tasks/your_task.py
Export it from src/dash_evals/runner/tasks/__init__.py:
```
from .your_task import your_new_task

__all__ = [
    # ... existing tasks ...
    "your_new_task",
]
```
Task functions are discovered dynamically via importlib. If the function name matches a module in runner/tasks/, it will be found automatically when referenced from a task.yaml file. No registry is needed.

Create a task directory in dataset/tasks/:

dataset/tasks/your_task_id/
└── task.yaml

# dataset/tasks/your_task_id/task.yaml
func: your_new_task  # Must match the function name
samples:
  - id: sample_001
    input: |
      Your prompt here.
    target: |
      Expected output or grading criteria.

Test and Verify

Run the test suite:
```
cd packages/dash_evals
pytest
```

Run linting:

ruff check src/dash_evals
ruff format src/dash_evals

Test your task end-to-end:

devals run test_my_sample --dry-run  # Validate config
devals run test_my_sample   # Run actual evaluation

eval_explorer

A Dart/Flutter application for exploring evaluation results, built with Serverpod.

Note

The eval_explorer is under active development. Contribution guidelines coming soon!

The package is located in packages/eval_explorer/ and consists of:

Package	Description
`eval_explorer_client`	Dart client package
`eval_explorer_flutter`	Flutter web/mobile app
`eval_explorer_server`	Serverpod backend
`eval_explorer_shared`	Shared models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing

Table of Contents

Setup

Write a New Eval

Add Your Sample to the Dataset

Edit the Config to Run Only Your New Sample

Verify the Sample Works

What to Commit (and Not Commit!)

Add Functionality to the Runner

Understand Tasks, Solvers, and Scorers

Add a New Task

Test and Verify

eval_explorer

FilesExpand file tree

guide.md

Latest commit

History

guide.md

File metadata and controls

Contributing

Table of Contents

Setup

Write a New Eval

Add Your Sample to the Dataset

Edit the Config to Run Only Your New Sample

Verify the Sample Works

What to Commit (and Not Commit!)

Add Functionality to the Runner

Understand Tasks, Solvers, and Scorers

Add a New Task

Test and Verify

eval_explorer