Skip to content

feat(ci): add GitHub Actions workflow to run evals#893

Merged
christso merged 21 commits intomainfrom
feat/892-ci-evals
Apr 1, 2026
Merged

feat(ci): add GitHub Actions workflow to run evals#893
christso merged 21 commits intomainfrom
feat/892-ci-evals

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 1, 2026

Summary

  • Adds .github/workflows/evals.yml — a workflow_dispatch workflow to run AgentV evals in CI
  • Uses bun apps/cli/dist/cli.js (from source) instead of globally installed agentv
  • Configures GitHub Copilot CLI as the agent and GitHub Models for LLM inference
  • Publishes JUnit results, uploads artifacts, and enforces score threshold

Test plan

  • Trigger workflow manually from Actions tab with default inputs
  • Verify JUnit results appear in the test reporter
  • Verify artifacts are uploaded
  • Verify threshold enforcement (exit 1 on failure)

Closes #892

🤖 Generated with Claude Code

Adds a workflow_dispatch workflow that runs AgentV evals in CI using
GitHub Copilot CLI and GitHub Models. Runs from source (bun apps/cli/dist/cli.js)
instead of installing agentv from npm.

Closes #892

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 1, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3c29df3
Status: ✅  Deploy successful!
Preview URL: https://6a92c979.agentv.pages.dev
Branch Preview URL: https://feat-892-ci-evals.agentv.pages.dev

View logs

christso and others added 20 commits April 1, 2026 03:52
Replace .env-only credentials with a proper .agentv/targets.yaml
that sets GitHub Models as the default target via OpenAI provider.
Remove Copilot CLI dependency — evals use the LLM target directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update .agentv/targets.yaml:
- Change default target to GitHub Models (openai provider, models.github.ai)
- Add copilot-cli and copilot-sdk targets using GH_MODELS_MODEL
- Keep existing pi, codex, gemini, openai, openrouter targets

Update evals workflow:
- Restore Copilot CLI install step
- Write .env with GH_MODELS_TOKEN/GH_MODELS_MODEL (targets.yaml references these)
- Remove inline targets.yaml generation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge examples/features/.agentv/ and examples/showcase/.agentv/ into
the root .agentv/ directory. Adds all missing targets (azure, azure-llm,
claude, claude-sdk, pi with tools, codex with cwd/log_dir).

Per-eval .agentv folders are preserved for eval-specific overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Since bun install links the local workspace package, bunx agentv
resolves to the source without needing a global npm install.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot CLI accepts --model to set the AI model. Use a separate
COPILOT_MODEL env var (default: gpt-5-mini) for copilot-cli and
copilot-sdk targets instead of reusing GH_MODELS_MODEL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded model names in pi and pi-cli targets with
${{ OPENROUTER_MODEL }} env var. Default: openai/gpt-5.1-codex.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub Models expects bare model names (gpt-5-mini), not the
openai/gpt-5-mini format used by OpenRouter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The GH_MODELS_TOKEN secret appears invalid. Use GITHUB_TOKEN
directly to diagnose — it has GitHub Models access by default.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GITHUB_TOKEN alone doesn't have GitHub Models access. Restore the
original fallback chain — users need to set GH_MODELS_TOKEN secret
with a PAT that has GitHub Models permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Vercel AI SDK v3 defaults to the OpenAI Responses API (/responses),
which isn't supported by third-party OpenAI-compatible endpoints like
GitHub Models. Use openai.chat() instead of openai() when a custom
base_url is configured to force /chat/completions.

Also fix base_url to include /v1 suffix for GitHub Models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The GITHUB_TOKEN needs explicit models:read permission to access
the GitHub Models inference API. Without it, all requests return 404.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
COPILOT_PAT (fine-grained PAT with Copilot permission) also has
GitHub Models access. Use it as the primary token, falling back
to GH_MODELS_TOKEN then GITHUB_TOKEN.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bunx agentv downloads the published npm version, ignoring the
locally built source. Use the dist path directly to run from
the workspace build which includes the .chat() fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add examples/features/**/*.EVAL.yaml alongside evals/**/eval.yaml so
the multi-provider-skill-trigger eval runs in CI automatically.

Pattern priority: workflow_dispatch input > vars.EVAL_PATTERNS repo
variable > hardcoded default. Patterns are passed unquoted so the
shell splits them into separate positional args for the CLI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure targets.yaml:
- Add explicit "grader" target (GH Models) for LLM-as-judge scoring
- Keep "default" as alias so existing example evals still work
- All agent targets now reference grader_target: grader
- Organize targets into grader / agent / LLM sections

Update CI workflow:
- Default target changed to copilot-cli (agent with skill support)
- Add configurable --target input (override via vars.EVAL_TARGET)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-eval .agentv/targets.yaml files (e.g. agent-skills-evals uses echo
provider) don't define copilot-cli. Use --targets to force the root
targets.yaml so all evals use the same CI target configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mplate)

csv-analyzer.EVAL.yaml expects a csv-analyzer skill but the workspace
template only includes acme-deploy. Narrow the glob to specifically
target multi-provider-skill-trigger.EVAL.yaml which has a proper
workspace template with the required skill.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rated

Comma-separated is more standard for list values. Patterns are split
into separate positional args via bash array expansion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Invert the logic: use .chat() (Chat Completions) by default since it
is universally supported by all OpenAI-compatible endpoints. Only use
the Responses API for actual api.openai.com, which is the only
provider that supports /responses.

Verified:
- GH Models: /responses → 404, /chat/completions → 200
- Local evals with grader target: 3/3 at 1.000
- All 351 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso merged commit 5a24352 into main Apr 1, 2026
4 checks passed
@christso christso deleted the feat/892-ci-evals branch April 1, 2026 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(ci): add GitHub Actions workflow to run evals

1 participant