Skip to content

Conversation

@nicknisi
Copy link
Member

@nicknisi nicknisi commented Feb 1, 2026

capture_20260202_071330

Summary

  • Add comprehensive eval framework with fixtures for all 5 frameworks × 3 states (15 scenarios)
  • Wire AgentExecutor to use Claude Agent SDK directly with direct auth mode
  • Add retry logic, debug tooling, and history tracking for eval runs

Why

Need automated testing to validate installer agent behavior across different project configurations before releases.

Notes

  • Run evals with pnpm eval --framework=nextjs --state=fresh
  • Requires ANTHROPIC_API_KEY in .env.local
  • Fixtures cover fresh projects, existing apps, and apps with competing auth (Auth0)

Introduces a structured evaluation system to validate the WorkOS installer
agent against framework fixtures. Phase 1 includes:

- Core types and interfaces for grading results
- File and build graders with pattern matching
- Next.js-specific grader checking AuthKit integration
- Fixture manager for temp dir setup/cleanup
- Eval runner orchestrating fixture → agent → grade flow
- CLI entry point with --framework and --verbose flags
- Minimal Next.js 14 App Router fixture

The agent executor is stubbed to validate framework structure first.
Run with: pnpm eval
Add CLI with filtering (--framework, --state, --json), matrix reporter,
and graders for all 5 frameworks. Create fixtures for fresh, existing,
and existing-auth0 states across Next.js, React SPA, React Router,
TanStack Start, and Vanilla JS.
- Add history.ts for results persistence with compare functionality
- Extend CLI with --debug, --keep-on-fail, --retry, --no-retry flags
- Add history and compare subcommands (pnpm eval:history, eval:compare)
- Implement retry loop in runner for handling LLM non-determinism
- Add verbose failure output with expected/actual values
- Create README documentation for eval framework usage
Replace stub implementation with real agent execution:
- Add env-loader for credentials from .env.local
- Configure SDK with direct auth mode (bypasses gateway)
- Capture tool calls and output from message stream
- Add ToolCall interface to types
- Use glob + content matching for callback route (path is configurable)
- Remove process.env.WORKOS_ check (SDK abstracts env access)
- Add checkFileWithPattern helper for flexible file discovery
@nicknisi
Copy link
Member Author

nicknisi commented Feb 1, 2026

Example run:

Framework State
Next.js Fresh
> tsx tests/evals/index.ts --framework=nextjs --verbose --keepOnFail --state=fresh


Running: nextjs/fresh
  Installing dependencies...
[dotenv@17.2.3] injecting env (2) from .env.local -- tip: ⚙️  load multiple .env files with { path: ['.env.local', '.env'] }
  Initializing agent for nextjs...
  Tool: Skill
  Agent: I'll start by fetching the SDK documentation from the README, which is the source of truth....
  Tool: WebFetch
  Agent: Now let me validate the project structure and check the existing environment variables....
  Tool: Glob
  Tool: Read
  Tool: Read
  Tool: Glob
  Agent: Let me also check the existing app structure:...
  Tool: Glob
  Tool: Read
  Tool: Read
  Tool: TodoWrite
  Agent: Now let me proceed with the integration. First, I'll add the missing environment variable and instal...
  Tool: Edit
  Tool: Bash
  Agent: Let me check if there's a pnpm or yarn lockfile to determine the correct package manager:...
  Tool: Bash
  Tool: Bash
  Tool: Bash
  Tool: TodoWrite
  Agent: Now let me verify the SDK is installed and create the middleware. Since this is Next.js 14, I'll cre...
  Tool: Bash
  Agent: Now let me create the middleware, callback route, update layout, and update the page:...
  Tool: Write
  Tool: Bash
  Tool: Write
  Tool: TodoWrite
  Agent: Now let me update the layout with AuthKitProvider and the page with auth UI:...
  Tool: Write
  Tool: Write
  Tool: TodoWrite
  Agent: Now let me run the verification checklist and build:...
  Tool: Bash
  Tool: Bash
  Tool: Bash
  Agent: All verification checks pass. Now let's run the build:...
  Tool: Bash
  Agent: Let me check the correct import paths from the SDK:...
  Tool: Bash
  Tool: Bash
  Tool: Bash
  Agent: I see - the SDK doesn't have `getUser` and `AuthKitProvider` in the main exports. Let me check the c...
  Tool: Bash
  Tool: Bash
  Agent: I see - `AuthKitProvider` is in `@workos-inc/authkit-nextjs/components`, and I need to use `withAuth...
  Tool: Bash
  Tool: Bash
  Tool: Bash
  Agent: Now I understand. The SDK uses `withAuth` instead of `getUser`. Let me update the files with the cor...
  Tool: Write
  Tool: Write
  Agent: Now let's run the build again:...
  Tool: Bash
  Agent: The build succeeded. Let me update the todo list to show completion:...
  Tool: TodoWrite
  Agent: I've successfully integrated WorkOS AuthKit into your Next.js application. Here's a summary of what ...
✓ PASSED

Results saved to: /Users/nicknisi/Developer/cli/prompt-improvements/tests/eval-results/2026-02-01T05-39-28-884Z.json

┌─────────────────┬─────────┬──────────┬───────────────┐
│ Framework       │  Fresh  │ Existing │ Existing+Auth │
├─────────────────┼─────────┼──────────┼───────────────┤
│ nextjs          │   ✓   │   -     │   -        │
│ react           │   -   │   -     │   -        │
│ react-router    │   -   │   -     │   -        │
│ tanstack-start  │   -   │   -     │   -        │
│ vanilla-js      │   -   │   -     │   -        │
└─────────────────┴─────────┴──────────┴───────────────┘

Results: 1/1 passed (100.0%)
pnpm eval --framework=nextjs --verbose --keepOnFail --state=fresh  66.79s user 23.09s system 32% cpu 4:39.04 total

- Grader: support src/ directory (v1.132+) in addition to app/
- Grader: check for authkitMiddleware instead of createServerFn
- Grader: fix package name to @workos/authkit-tanstack-react-start
- Grader: remove AuthKitProvider requirement (optional for server-only)
- Grader: support both flat and nested route patterns for callback
- Skill: add directory detection guidance (src/ vs app/)
- Skill: fix handleAuth() → handleCallbackRoute()
- Skill: add SDK exports reference section
- Remove callback component check (SDK handles OAuth internally)
- Use glob pattern to find useAuth anywhere in src/**/*.tsx
- Support both Vite (main.tsx) and CRA (index.tsx) entry points
- Add comprehensive header documenting SDK patterns
- Fix package name: @workos-inc/authkit-react-router (was @workos-inc/authkit)
- Use glob patterns instead of hardcoded file paths
- Check for authLoader in callback routes (flexible location)
- Check for authkitLoader in route files for auth state
- Remove unnecessary ProtectedRoute.tsx/auth.ts checks (SDK has ensureSignedIn)
- Support both app/ and src/ directory structures
- Remove callback.html/callback.js checks (SDK handles OAuth internally)
- Remove auth.js with getAuthorizationUrl (old pattern)
- Check for createClient from @workos-inc/authkit-js or CDN WorkOS.createClient
- Check for auth methods (signIn, signOut, getUser, getAccessToken)
- Support both bundled (ESM import) and CDN (script tag) patterns
Phase 1: Parallel execution infrastructure
- Add ParallelRunner with p-limit concurrency control
- Auto-detect concurrency based on CPU/memory
- Graceful shutdown with fixture cleanup on SIGINT/SIGTERM

Phase 2: Event-driven live dashboard
- Add EvalEventEmitter for scenario lifecycle events
- Create Ink/React dashboard with real-time status updates
- TTY detection: dashboard for interactive, logging for CI/pipes
- Add --no-dashboard flag to disable live UI
Add debugging export layer that writes detailed JSON logs during eval runs:
- LogWriter subscribes to eval events and writes incrementally
- Includes all retry attempts, tool calls, and agent output (truncated at 10KB)
- New CLI commands: `pnpm eval:logs` to list, `pnpm eval:show` to view
- Survives interrupts by writing after each scenario completes
Prefix verbose console output with [framework/state] labels so parallel
execution logs are easier to follow.
Switch from pinned semver ranges to `latest` tag for tanstack packages.
This ensures evals test against what users actually install, surfacing
upstream breakage as signal rather than hiding it behind old versions.

Note: tanstack-start is currently broken upstream (incompatible internal
deps). Tests will auto-heal when they publish a fix.
…ecture

TanStack Start moved from vinxi to pure Vite. Old fixtures used the
deprecated vinxi-based structure which no longer works with latest.

Changes:
- Replace vinxi scripts with vite dev/build
- Replace @tanstack/start with @tanstack/react-start
- Move app/ to src/ directory structure
- Add vite.config.ts with tanstackStart plugin
- Update to React 19
- Add .gitignore for fixture artifacts

All three fixtures now build successfully with latest TanStack versions.
Remove redundant `fresh` fixture variants since they provided no
meaningful test coverage difference from `existing` fixtures.

Renamed: existing → example, existing-auth0 → example-auth0

This reduces the test matrix from 5×3=15 to 5×2=10 scenarios while
maintaining coverage for both greenfield installs and Auth0 migrations.
The grader pattern only checked routes/**/*.tsx but in React Router v7
Framework mode, authkitLoader belongs in app/root.tsx (the root layout).
Updated pattern to match both root.tsx and routes/**/*.
@nicknisi nicknisi changed the title feat: add eval framework for installer agent testing chore: add eval framework for installer agent testing Feb 2, 2026
@nicknisi nicknisi changed the title chore: add eval framework for installer agent testing test: add eval framework for installer agent testing Feb 2, 2026
@nicknisi nicknisi merged commit 23f8175 into main Feb 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants