-
Notifications
You must be signed in to change notification settings - Fork 1
test: add eval framework for installer agent testing #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduces a structured evaluation system to validate the WorkOS installer agent against framework fixtures. Phase 1 includes: - Core types and interfaces for grading results - File and build graders with pattern matching - Next.js-specific grader checking AuthKit integration - Fixture manager for temp dir setup/cleanup - Eval runner orchestrating fixture → agent → grade flow - CLI entry point with --framework and --verbose flags - Minimal Next.js 14 App Router fixture The agent executor is stubbed to validate framework structure first. Run with: pnpm eval
Add CLI with filtering (--framework, --state, --json), matrix reporter, and graders for all 5 frameworks. Create fixtures for fresh, existing, and existing-auth0 states across Next.js, React SPA, React Router, TanStack Start, and Vanilla JS.
- Add history.ts for results persistence with compare functionality - Extend CLI with --debug, --keep-on-fail, --retry, --no-retry flags - Add history and compare subcommands (pnpm eval:history, eval:compare) - Implement retry loop in runner for handling LLM non-determinism - Add verbose failure output with expected/actual values - Create README documentation for eval framework usage
Replace stub implementation with real agent execution: - Add env-loader for credentials from .env.local - Configure SDK with direct auth mode (bypasses gateway) - Capture tool calls and output from message stream - Add ToolCall interface to types
- Use glob + content matching for callback route (path is configurable) - Remove process.env.WORKOS_ check (SDK abstracts env access) - Add checkFileWithPattern helper for flexible file discovery
Member
Author
|
Example run:
> tsx tests/evals/index.ts --framework=nextjs --verbose --keepOnFail --state=fresh
Running: nextjs/fresh
Installing dependencies...
[dotenv@17.2.3] injecting env (2) from .env.local -- tip: ⚙️ load multiple .env files with { path: ['.env.local', '.env'] }
Initializing agent for nextjs...
Tool: Skill
Agent: I'll start by fetching the SDK documentation from the README, which is the source of truth....
Tool: WebFetch
Agent: Now let me validate the project structure and check the existing environment variables....
Tool: Glob
Tool: Read
Tool: Read
Tool: Glob
Agent: Let me also check the existing app structure:...
Tool: Glob
Tool: Read
Tool: Read
Tool: TodoWrite
Agent: Now let me proceed with the integration. First, I'll add the missing environment variable and instal...
Tool: Edit
Tool: Bash
Agent: Let me check if there's a pnpm or yarn lockfile to determine the correct package manager:...
Tool: Bash
Tool: Bash
Tool: Bash
Tool: TodoWrite
Agent: Now let me verify the SDK is installed and create the middleware. Since this is Next.js 14, I'll cre...
Tool: Bash
Agent: Now let me create the middleware, callback route, update layout, and update the page:...
Tool: Write
Tool: Bash
Tool: Write
Tool: TodoWrite
Agent: Now let me update the layout with AuthKitProvider and the page with auth UI:...
Tool: Write
Tool: Write
Tool: TodoWrite
Agent: Now let me run the verification checklist and build:...
Tool: Bash
Tool: Bash
Tool: Bash
Agent: All verification checks pass. Now let's run the build:...
Tool: Bash
Agent: Let me check the correct import paths from the SDK:...
Tool: Bash
Tool: Bash
Tool: Bash
Agent: I see - the SDK doesn't have `getUser` and `AuthKitProvider` in the main exports. Let me check the c...
Tool: Bash
Tool: Bash
Agent: I see - `AuthKitProvider` is in `@workos-inc/authkit-nextjs/components`, and I need to use `withAuth...
Tool: Bash
Tool: Bash
Tool: Bash
Agent: Now I understand. The SDK uses `withAuth` instead of `getUser`. Let me update the files with the cor...
Tool: Write
Tool: Write
Agent: Now let's run the build again:...
Tool: Bash
Agent: The build succeeded. Let me update the todo list to show completion:...
Tool: TodoWrite
Agent: I've successfully integrated WorkOS AuthKit into your Next.js application. Here's a summary of what ...
✓ PASSED
Results saved to: /Users/nicknisi/Developer/cli/prompt-improvements/tests/eval-results/2026-02-01T05-39-28-884Z.json
┌─────────────────┬─────────┬──────────┬───────────────┐
│ Framework │ Fresh │ Existing │ Existing+Auth │
├─────────────────┼─────────┼──────────┼───────────────┤
│ nextjs │ ✓ │ - │ - │
│ react │ - │ - │ - │
│ react-router │ - │ - │ - │
│ tanstack-start │ - │ - │ - │
│ vanilla-js │ - │ - │ - │
└─────────────────┴─────────┴──────────┴───────────────┘
Results: 1/1 passed (100.0%)
pnpm eval --framework=nextjs --verbose --keepOnFail --state=fresh 66.79s user 23.09s system 32% cpu 4:39.04 total |
- Grader: support src/ directory (v1.132+) in addition to app/ - Grader: check for authkitMiddleware instead of createServerFn - Grader: fix package name to @workos/authkit-tanstack-react-start - Grader: remove AuthKitProvider requirement (optional for server-only) - Grader: support both flat and nested route patterns for callback - Skill: add directory detection guidance (src/ vs app/) - Skill: fix handleAuth() → handleCallbackRoute() - Skill: add SDK exports reference section
- Remove callback component check (SDK handles OAuth internally) - Use glob pattern to find useAuth anywhere in src/**/*.tsx - Support both Vite (main.tsx) and CRA (index.tsx) entry points - Add comprehensive header documenting SDK patterns
- Fix package name: @workos-inc/authkit-react-router (was @workos-inc/authkit) - Use glob patterns instead of hardcoded file paths - Check for authLoader in callback routes (flexible location) - Check for authkitLoader in route files for auth state - Remove unnecessary ProtectedRoute.tsx/auth.ts checks (SDK has ensureSignedIn) - Support both app/ and src/ directory structures
- Remove callback.html/callback.js checks (SDK handles OAuth internally) - Remove auth.js with getAuthorizationUrl (old pattern) - Check for createClient from @workos-inc/authkit-js or CDN WorkOS.createClient - Check for auth methods (signIn, signOut, getUser, getAccessToken) - Support both bundled (ESM import) and CDN (script tag) patterns
Phase 1: Parallel execution infrastructure - Add ParallelRunner with p-limit concurrency control - Auto-detect concurrency based on CPU/memory - Graceful shutdown with fixture cleanup on SIGINT/SIGTERM Phase 2: Event-driven live dashboard - Add EvalEventEmitter for scenario lifecycle events - Create Ink/React dashboard with real-time status updates - TTY detection: dashboard for interactive, logging for CI/pipes - Add --no-dashboard flag to disable live UI
Add debugging export layer that writes detailed JSON logs during eval runs: - LogWriter subscribes to eval events and writes incrementally - Includes all retry attempts, tool calls, and agent output (truncated at 10KB) - New CLI commands: `pnpm eval:logs` to list, `pnpm eval:show` to view - Survives interrupts by writing after each scenario completes
Prefix verbose console output with [framework/state] labels so parallel execution logs are easier to follow.
Switch from pinned semver ranges to `latest` tag for tanstack packages. This ensures evals test against what users actually install, surfacing upstream breakage as signal rather than hiding it behind old versions. Note: tanstack-start is currently broken upstream (incompatible internal deps). Tests will auto-heal when they publish a fix.
…ecture TanStack Start moved from vinxi to pure Vite. Old fixtures used the deprecated vinxi-based structure which no longer works with latest. Changes: - Replace vinxi scripts with vite dev/build - Replace @tanstack/start with @tanstack/react-start - Move app/ to src/ directory structure - Add vite.config.ts with tanstackStart plugin - Update to React 19 - Add .gitignore for fixture artifacts All three fixtures now build successfully with latest TanStack versions.
Remove redundant `fresh` fixture variants since they provided no meaningful test coverage difference from `existing` fixtures. Renamed: existing → example, existing-auth0 → example-auth0 This reduces the test matrix from 5×3=15 to 5×2=10 scenarios while maintaining coverage for both greenfield installs and Auth0 migrations.
The grader pattern only checked routes/**/*.tsx but in React Router v7 Framework mode, authkitLoader belongs in app/root.tsx (the root layout). Updated pattern to match both root.tsx and routes/**/*.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Why
Need automated testing to validate installer agent behavior across different project configurations before releases.
Notes
pnpm eval --framework=nextjs --state=freshANTHROPIC_API_KEYin.env.local