feat(ci): add GitHub Actions workflow to run evals by christso · Pull Request #893 · EntityProcess/agentv

christso · 2026-04-01T03:48:59Z

Summary

Adds .github/workflows/evals.yml — a workflow_dispatch workflow to run AgentV evals in CI
Uses bun apps/cli/dist/cli.js (from source) instead of globally installed agentv
Configures GitHub Copilot CLI as the agent and GitHub Models for LLM inference
Publishes JUnit results, uploads artifacts, and enforces score threshold

Test plan

Trigger workflow manually from Actions tab with default inputs
Verify JUnit results appear in the test reporter
Verify artifacts are uploaded
Verify threshold enforcement (exit 1 on failure)

Closes #892

🤖 Generated with Claude Code

Adds a workflow_dispatch workflow that runs AgentV evals in CI using GitHub Copilot CLI and GitHub Models. Runs from source (bun apps/cli/dist/cli.js) instead of installing agentv from npm. Closes #892 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-01T03:49:53Z

Deploying agentv with Cloudflare Pages

Latest commit:	`3c29df3`
Status:	✅ Deploy successful!
Preview URL:	https://6a92c979.agentv.pages.dev
Branch Preview URL:	https://feat-892-ci-evals.agentv.pages.dev

View logs

Replace .env-only credentials with a proper .agentv/targets.yaml that sets GitHub Models as the default target via OpenAI provider. Remove Copilot CLI dependency — evals use the LLM target directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update .agentv/targets.yaml: - Change default target to GitHub Models (openai provider, models.github.ai) - Add copilot-cli and copilot-sdk targets using GH_MODELS_MODEL - Keep existing pi, codex, gemini, openai, openrouter targets Update evals workflow: - Restore Copilot CLI install step - Write .env with GH_MODELS_TOKEN/GH_MODELS_MODEL (targets.yaml references these) - Remove inline targets.yaml generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge examples/features/.agentv/ and examples/showcase/.agentv/ into the root .agentv/ directory. Adds all missing targets (azure, azure-llm, claude, claude-sdk, pi with tools, codex with cwd/log_dir). Per-eval .agentv folders are preserved for eval-specific overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Since bun install links the local workspace package, bunx agentv resolves to the source without needing a global npm install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot CLI accepts --model to set the AI model. Use a separate COPILOT_MODEL env var (default: gpt-5-mini) for copilot-cli and copilot-sdk targets instead of reusing GH_MODELS_MODEL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hardcoded model names in pi and pi-cli targets with ${{ OPENROUTER_MODEL }} env var. Default: openai/gpt-5.1-codex. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GitHub Models expects bare model names (gpt-5-mini), not the openai/gpt-5-mini format used by OpenRouter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The GH_MODELS_TOKEN secret appears invalid. Use GITHUB_TOKEN directly to diagnose — it has GitHub Models access by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GITHUB_TOKEN alone doesn't have GitHub Models access. Restore the original fallback chain — users need to set GH_MODELS_TOKEN secret with a PAT that has GitHub Models permissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Vercel AI SDK v3 defaults to the OpenAI Responses API (/responses), which isn't supported by third-party OpenAI-compatible endpoints like GitHub Models. Use openai.chat() instead of openai() when a custom base_url is configured to force /chat/completions. Also fix base_url to include /v1 suffix for GitHub Models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The GITHUB_TOKEN needs explicit models:read permission to access the GitHub Models inference API. Without it, all requests return 404. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

COPILOT_PAT (fine-grained PAT with Copilot permission) also has GitHub Models access. Use it as the primary token, falling back to GH_MODELS_TOKEN then GITHUB_TOKEN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bunx agentv downloads the published npm version, ignoring the locally built source. Use the dist path directly to run from the workspace build which includes the .chat() fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add examples/features/**/*.EVAL.yaml alongside evals/**/eval.yaml so the multi-provider-skill-trigger eval runs in CI automatically. Pattern priority: workflow_dispatch input > vars.EVAL_PATTERNS repo variable > hardcoded default. Patterns are passed unquoted so the shell splits them into separate positional args for the CLI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restructure targets.yaml: - Add explicit "grader" target (GH Models) for LLM-as-judge scoring - Keep "default" as alias so existing example evals still work - All agent targets now reference grader_target: grader - Organize targets into grader / agent / LLM sections Update CI workflow: - Default target changed to copilot-cli (agent with skill support) - Add configurable --target input (override via vars.EVAL_TARGET) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Per-eval .agentv/targets.yaml files (e.g. agent-skills-evals uses echo provider) don't define copilot-cli. Use --targets to force the root targets.yaml so all evals use the same CI target configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mplate) csv-analyzer.EVAL.yaml expects a csv-analyzer skill but the workspace template only includes acme-deploy. Narrow the glob to specifically target multi-provider-skill-trigger.EVAL.yaml which has a proper workspace template with the required skill. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rated Comma-separated is more standard for list values. Patterns are split into separate positional args via bash array expansion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Invert the logic: use .chat() (Chat Completions) by default since it is universally supported by all OpenAI-compatible endpoints. Only use the Responses API for actual api.openai.com, which is the only provider that supports /responses. Verified: - GH Models: /responses → 404, /chat/completions → 200 - Local evals with grader target: 3/3 at 1.000 - All 351 tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso and others added 20 commits April 1, 2026 03:52

fix(ci): use bunx agentv instead of dist path

48115dd

Since bun install links the local workspace package, bunx agentv resolves to the source without needing a global npm install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: use OPENROUTER_MODEL env var for pi targets

86ef97c

Replace hardcoded model names in pi and pi-cli targets with ${{ OPENROUTER_MODEL }} env var. Default: openai/gpt-5.1-codex. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: remove duplicate azure target, keep azure-llm

289ce6e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(ci): use gpt-5-mini without openai/ prefix for GitHub Models

c59bdcb

GitHub Models expects bare model names (gpt-5-mini), not the openai/gpt-5-mini format used by OpenRouter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(ci): use GITHUB_TOKEN directly for GitHub Models

a807256

The GH_MODELS_TOKEN secret appears invalid. Use GITHUB_TOKEN directly to diagnose — it has GitHub Models access by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(ci): add models:read permission for GitHub Models API

a9fa1ed

The GITHUB_TOKEN needs explicit models:read permission to access the GitHub Models inference API. Without it, all requests return 404. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(ci): prefer COPILOT_PAT for GitHub Models token

967dcc5

COPILOT_PAT (fine-grained PAT with Copilot permission) also has GitHub Models access. Use it as the primary token, falling back to GH_MODELS_TOKEN then GITHUB_TOKEN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(ci): use bun apps/cli/dist/cli.js instead of bunx

909b138

bunx agentv downloads the published npm version, ignoring the locally built source. Use the dist path directly to run from the workspace build which includes the .chat() fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(ci): use comma-separated eval patterns instead of space-sepa…

ca8e7a1

…rated Comma-separated is more standard for list values. Patterns are split into separate positional args via bash array expansion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso merged commit 5a24352 into main Apr 1, 2026
4 checks passed

christso deleted the feat/892-ci-evals branch April 1, 2026 08:56

christso mentioned this pull request Apr 1, 2026

feat: add api_format option to OpenAI/Azure targets (chat vs responses) #896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add GitHub Actions workflow to run evals#893

feat(ci): add GitHub Actions workflow to run evals#893
christso merged 21 commits intomainfrom
feat/892-ci-evals

christso commented Apr 1, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 1, 2026

Summary

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages bot commented Apr 1, 2026 •

edited

Loading