Skip to content

Latest commit

 

History

History
92 lines (68 loc) · 2.97 KB

File metadata and controls

92 lines (68 loc) · 2.97 KB

dash_evals

Python package for running LLM evaluations on Dart and Flutter tasks using Inspect AI. Located in packages/dash_evals/.

For setup instructions, see the Quick Start or Contributing Guide.


Available Tasks

Task Description
question_answer Q&A evaluation for Dart/Flutter knowledge
bug_fix Agentic debugging of code in a sandbox
flutter_bug_fix Flutter-specific bug fix (wraps bug_fix)
code_gen Generate code from specifications
flutter_code_gen Flutter-specific code gen (wraps code_gen)
mcp_tool Evaluate MCP tool usage (pub.dev search, project creation, etc.)
analyze_codebase Evaluate codebase analysis and comprehension
skill_test Evaluate use of skill files in a sandbox

Architecture

src/dash_evals/
├── main.py              # CLI entry point (dual-mode)
├── runner/
│   ├── json_runner.py   # Mode 1: run from EvalSet JSON manifest
│   ├── args_runner.py   # Mode 2: run from direct CLI arguments
│   ├── tasks/           # @task functions (question_answer, bug_fix, code_gen, etc.)
│   ├── scorers/         # Scoring logic (model_graded, dart_analyze, flutter_test, etc.)
│   ├── solvers/         # Solver chains (context injection, system messages)
│   └── sandboxes/       # Sandbox environments (podman)
├── models/              # Data models
└── utils/               # Logging and helpers

Data Flow

  1. Configure: The Dart dataset_config_dart package parses dataset YAML and resolves it into an EvalSet JSON manifest
  2. Load: The Python runner reads the JSON manifest via json_runner.py, resolving task functions dynamically with importlib
  3. Hydrate: Config dicts are converted to Inspect AI objects (datasets, MCP servers, skills) using shared helpers from dataset_config_python.hydrate
  4. Execute: Each task function receives its dataset and task definition, producing an inspect_ai.Task
  5. Score: Scorers evaluate model outputs against targets
  6. Log: Results written to the configured log_dir

Alternatively, the runner can be invoked directly with --task and --model arguments (via args_runner.py), bypassing the Dart config pipeline.


Usage

# Mode 1: Run from JSON manifest (produced by Dart CLI)
run-evals --json ./eval_set.json

# Mode 2: Run a single task directly
run-evals --task flutter_code_gen --model google/gemini-2.5-flash --dataset samples.jsonl

# Additional options (both modes)
run-evals --task bug_fix --model openai/gpt-4o \
  --log-dir ./logs \
  --sandbox podman compose.yaml \
  --max-connections 10

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=dash_evals

# Run specific test
pytest tests/test_parsers.py -v

Linting

# Run ruff
ruff check src/dash_evals
ruff format src/dash_evals