feat(ci): add GitHub Actions workflow to run evals#893
Merged
Conversation
Adds a workflow_dispatch workflow that runs AgentV evals in CI using GitHub Copilot CLI and GitHub Models. Runs from source (bun apps/cli/dist/cli.js) instead of installing agentv from npm. Closes #892 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
3c29df3
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://6a92c979.agentv.pages.dev |
| Branch Preview URL: | https://feat-892-ci-evals.agentv.pages.dev |
Replace .env-only credentials with a proper .agentv/targets.yaml that sets GitHub Models as the default target via OpenAI provider. Remove Copilot CLI dependency — evals use the LLM target directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update .agentv/targets.yaml: - Change default target to GitHub Models (openai provider, models.github.ai) - Add copilot-cli and copilot-sdk targets using GH_MODELS_MODEL - Keep existing pi, codex, gemini, openai, openrouter targets Update evals workflow: - Restore Copilot CLI install step - Write .env with GH_MODELS_TOKEN/GH_MODELS_MODEL (targets.yaml references these) - Remove inline targets.yaml generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge examples/features/.agentv/ and examples/showcase/.agentv/ into the root .agentv/ directory. Adds all missing targets (azure, azure-llm, claude, claude-sdk, pi with tools, codex with cwd/log_dir). Per-eval .agentv folders are preserved for eval-specific overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Since bun install links the local workspace package, bunx agentv resolves to the source without needing a global npm install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot CLI accepts --model to set the AI model. Use a separate COPILOT_MODEL env var (default: gpt-5-mini) for copilot-cli and copilot-sdk targets instead of reusing GH_MODELS_MODEL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded model names in pi and pi-cli targets with
${{ OPENROUTER_MODEL }} env var. Default: openai/gpt-5.1-codex.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub Models expects bare model names (gpt-5-mini), not the openai/gpt-5-mini format used by OpenRouter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The GH_MODELS_TOKEN secret appears invalid. Use GITHUB_TOKEN directly to diagnose — it has GitHub Models access by default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GITHUB_TOKEN alone doesn't have GitHub Models access. Restore the original fallback chain — users need to set GH_MODELS_TOKEN secret with a PAT that has GitHub Models permissions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Vercel AI SDK v3 defaults to the OpenAI Responses API (/responses), which isn't supported by third-party OpenAI-compatible endpoints like GitHub Models. Use openai.chat() instead of openai() when a custom base_url is configured to force /chat/completions. Also fix base_url to include /v1 suffix for GitHub Models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The GITHUB_TOKEN needs explicit models:read permission to access the GitHub Models inference API. Without it, all requests return 404. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
COPILOT_PAT (fine-grained PAT with Copilot permission) also has GitHub Models access. Use it as the primary token, falling back to GH_MODELS_TOKEN then GITHUB_TOKEN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bunx agentv downloads the published npm version, ignoring the locally built source. Use the dist path directly to run from the workspace build which includes the .chat() fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add examples/features/**/*.EVAL.yaml alongside evals/**/eval.yaml so the multi-provider-skill-trigger eval runs in CI automatically. Pattern priority: workflow_dispatch input > vars.EVAL_PATTERNS repo variable > hardcoded default. Patterns are passed unquoted so the shell splits them into separate positional args for the CLI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure targets.yaml: - Add explicit "grader" target (GH Models) for LLM-as-judge scoring - Keep "default" as alias so existing example evals still work - All agent targets now reference grader_target: grader - Organize targets into grader / agent / LLM sections Update CI workflow: - Default target changed to copilot-cli (agent with skill support) - Add configurable --target input (override via vars.EVAL_TARGET) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-eval .agentv/targets.yaml files (e.g. agent-skills-evals uses echo provider) don't define copilot-cli. Use --targets to force the root targets.yaml so all evals use the same CI target configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mplate) csv-analyzer.EVAL.yaml expects a csv-analyzer skill but the workspace template only includes acme-deploy. Narrow the glob to specifically target multi-provider-skill-trigger.EVAL.yaml which has a proper workspace template with the required skill. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rated Comma-separated is more standard for list values. Patterns are split into separate positional args via bash array expansion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Invert the logic: use .chat() (Chat Completions) by default since it is universally supported by all OpenAI-compatible endpoints. Only use the Responses API for actual api.openai.com, which is the only provider that supports /responses. Verified: - GH Models: /responses → 404, /chat/completions → 200 - Local evals with grader target: 3/3 at 1.000 - All 351 tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.github/workflows/evals.yml— aworkflow_dispatchworkflow to run AgentV evals in CIbun apps/cli/dist/cli.js(from source) instead of globally installedagentvTest plan
Closes #892
🤖 Generated with Claude Code