Skip to content

feat: team platform — eval infra wiring, costs, cache, trends (v0.3.10)#78

Open
garrytan wants to merge 22 commits intomainfrom
garrytan/team-supabase-store
Open

feat: team platform — eval infra wiring, costs, cache, trends (v0.3.10)#78
garrytan wants to merge 22 commits intomainfrom
garrytan/team-supabase-store

Conversation

@garrytan
Copy link
Owner

Summary

  • Per-model cost tracking — eval results now include costs[] with exact API-reported per-model token usage (including cache tokens). computeCosts() prefers exact cost over MODEL_PRICING estimates (~4x more accurate with prompt caching).
  • LLM judge caching — SHA-based cache for judge evals via eval-cache.ts. Unchanged SKILL.md = $0 API cost on repeat runs. EVAL_CACHE=0 to bypass.
  • Dynamic model selectionEVAL_JUDGE_TIER for judge model, EVAL_TIER for E2E model pinning via --model to claude -p.
  • eval:trend CLI — per-test pass rate over N runs with flaky/improving/degrading classification and sparkline table.
  • Team sync infrastructure — Supabase schema, sync client (push/pull), auth, offline queue, config resolution. All non-fatal, zero impact when not configured.
  • Universal eval formatStandardEvalResult with validation, normalization, and bidirectional legacy conversion.
  • Unified eval CLIgstack eval list|compare|summary|trend|push|cost|cache|watch

Test plan

  • 255 free tests pass (174 + 81 across 16 test files)
  • bun run eval:trend shows real data from existing eval runs
  • bun run eval:cost <file> shows per-model breakdown
  • EVALS=1 bun run test:evals — verify costs[] populated and cache works
  • Second eval run shows (cached) and $0 cost for judge tests
  • EVAL_JUDGE_TIER=haiku EVALS=1 bun run test:evals — haiku for judge

🤖 Generated with Claude Code

garrytan and others added 22 commits March 15, 2026 01:32
Design doc for Supabase-backed team data store and universal eval
infrastructure. Covers architecture, credential storage, eval formats,
YAML test case spec, Supabase schema, phased rollout, and security model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k-config

- Replace project-specific references with generic language
- Add missing fields to eval result format: prompt_sha, by_category,
  timestamp, response_preview
- Enrich failure format with details array, scores dict, expectation_type
- Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support,
  dedup on push, run scopes, model aliases, judge profiles
- Restructure credential storage to 4 layers with gstack-config (v0.3.9)
  for user preferences (sync_enabled, sync_transcripts)
- Update integration points, observability, and reuse map

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug,
and sanitizeForFilename from eval-store.ts, session-runner.ts, and
eval-watch.ts into a shared module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json
- lib/auth.ts: device auth flow (browser OAuth, local HTTP callback)
- lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache
- lib/cli-sync.ts: CLI handler for gstack-sync commands
- bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain)
- .gstack-sync.json.example: template for team setup

Zero new dependencies — uses raw fetch() against PostgREST API.
All sync is non-fatal with 5s timeout and offline queue fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 001_teams.sql: teams + team_members + RLS
- 002_eval_runs.sql: eval results with universal format, indexes, upsert key
- 003_data_tables.sql: retro, QA, ship, greptile, transcripts + RLS

All tables use RLS: team members read/insert, admins delete.
Transcript table has tighter policy (admin-only read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun()
  hook in finalize() (non-blocking, non-fatal)
- session-runner.ts: import shared atomicWriteSync/sanitizeForFilename
- eval-store.test.ts: fix pre-existing bug in double-finalize test
  (was counting _partial file)
- 30 new tests for lib/util, lib/sync-config, lib/sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up eval I/O duplicated across scripts/eval-list.ts,
eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant,
formatTimestamp(), listEvalFiles(), loadEvalResults() with
--limit support. 13 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(),
  normalizeFromLegacy/normalizeToLegacy round-trip converters
- lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env,
  tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full)
- lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(),
  formatCostDashboard(), aggregateCosts(), fallback for unknown models
- 42 tests across 3 test files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys
from source file contents + test input via Bun.CryptoHasher SHA256.
Supports read/write/stats/clear/verify operations. EVAL_CACHE=0
skips reads for force-rerun. 16 tests including corrupt JSON handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch
  subcommands. Ports logic from 4 separate scripts into unified entry.
  Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list.
- bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern
- package.json: eval:* scripts now point to lib/cli-eval.ts
- supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS
- docs/eval-result-format.md: public format spec for any language
- test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess)
  including 3 push failure modes (file-not-found, invalid schema,
  sync unavailable)

215 tests passing across 13 files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract per-model token usage from resultLine.modelUsage (including
cache tokens and exact API cost), flow CostEntry[] through EvalCollector,
aggregate in finalize(). Extend CostEntry with cache_read_input_tokens,
cache_creation_input_tokens, cost_usd. computeCosts() prefers exact
cost_usd over MODEL_PRICING when available (~4x more accurate with
prompt caching).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
callJudge/judge now return {result, meta} with SHA-based caching
(~$0.18/run savings when SKILL.md unchanged) and dynamic model
selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from
EVAL_TIER to claude -p. outcomeJudge retains simple return type.
All 8 LLM eval test sites updated with real costs and costs[].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
computeTrends() classifies tests as stable-pass/stable-fail/flaky/
improving/degrading based on pass rate, flip count, and recent streak.
gstack eval trend shows sparkline table with --limit, --tier, --test
filters. Guard CLI main block with import.meta.main to prevent
execution on import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explain what team sync gives you, that it's optional, and how to
set it up. Points to TEAM_COORDINATION_STORE.md for full guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove _comment hacks from JSON example file. Add short team sync
section to README explaining what it is, that it's optional, and
how to set it up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract pushWithSync() helper to eliminate boilerplate across 6 push
functions. Add pushHeartbeat() for connectivity testing. Add push-greptile
to CLI. New commands: gstack-sync test (validates full push/pull flow
via sync_heartbeats table), gstack-sync show (terminal team data
dashboard with summary/evals/ships/retros views). Guard main block
with import.meta.main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add non-fatal sync steps to all 4 skill templates:
- /ship Step 8.5: write ship log JSON + push after PR creation
- /retro Step 13: push snapshot after JSON save
- /qa Phase 6.7: write qa-sync.json + push after health score
- greptile-triage: push each triage entry after history file writes

All calls use || true for zero disruption. Silent when sync not
configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 005_sync_heartbeats.sql migration for connectivity testing
- eval:trend --team flag pulls team eval data (graceful fallback)
- docs/TEAM_SYNC_SETUP.md step-by-step setup guide
- Design doc status updated to Phase 2 complete
- 10 new tests for sync show formatting functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant