feat: team platform — eval infra wiring, costs, cache, trends (v0.3.10)#78
Open
feat: team platform — eval infra wiring, costs, cache, trends (v0.3.10)#78
Conversation
Design doc for Supabase-backed team data store and universal eval infrastructure. Covers architecture, credential storage, eval formats, YAML test case spec, Supabase schema, phased rollout, and security model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k-config - Replace project-specific references with generic language - Add missing fields to eval result format: prompt_sha, by_category, timestamp, response_preview - Enrich failure format with details array, scores dict, expectation_type - Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support, dedup on push, run scopes, model aliases, judge profiles - Restructure credential storage to 4 layers with gstack-config (v0.3.9) for user preferences (sync_enabled, sync_transcripts) - Update integration points, observability, and reuse map Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug, and sanitizeForFilename from eval-store.ts, session-runner.ts, and eval-watch.ts into a shared module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json - lib/auth.ts: device auth flow (browser OAuth, local HTTP callback) - lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache - lib/cli-sync.ts: CLI handler for gstack-sync commands - bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain) - .gstack-sync.json.example: template for team setup Zero new dependencies — uses raw fetch() against PostgREST API. All sync is non-fatal with 5s timeout and offline queue fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 001_teams.sql: teams + team_members + RLS - 002_eval_runs.sql: eval results with universal format, indexes, upsert key - 003_data_tables.sql: retro, QA, ship, greptile, transcripts + RLS All tables use RLS: team members read/insert, admins delete. Transcript table has tighter policy (admin-only read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun() hook in finalize() (non-blocking, non-fatal) - session-runner.ts: import shared atomicWriteSync/sanitizeForFilename - eval-store.test.ts: fix pre-existing bug in double-finalize test (was counting _partial file) - 30 new tests for lib/util, lib/sync-config, lib/sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up eval I/O duplicated across scripts/eval-list.ts, eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant, formatTimestamp(), listEvalFiles(), loadEvalResults() with --limit support. 13 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(), normalizeFromLegacy/normalizeToLegacy round-trip converters - lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env, tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full) - lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(), formatCostDashboard(), aggregateCosts(), fallback for unknown models - 42 tests across 3 test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys
from source file contents + test input via Bun.CryptoHasher SHA256.
Supports read/write/stats/clear/verify operations. EVAL_CACHE=0
skips reads for force-rerun. 16 tests including corrupt JSON handling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch subcommands. Ports logic from 4 separate scripts into unified entry. Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list. - bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern - package.json: eval:* scripts now point to lib/cli-eval.ts - supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS - docs/eval-result-format.md: public format spec for any language - test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess) including 3 push failure modes (file-not-found, invalid schema, sync unavailable) 215 tests passing across 13 files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract per-model token usage from resultLine.modelUsage (including cache tokens and exact API cost), flow CostEntry[] through EvalCollector, aggregate in finalize(). Extend CostEntry with cache_read_input_tokens, cache_creation_input_tokens, cost_usd. computeCosts() prefers exact cost_usd over MODEL_PRICING when available (~4x more accurate with prompt caching). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
callJudge/judge now return {result, meta} with SHA-based caching
(~$0.18/run savings when SKILL.md unchanged) and dynamic model
selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from
EVAL_TIER to claude -p. outcomeJudge retains simple return type.
All 8 LLM eval test sites updated with real costs and costs[].
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
computeTrends() classifies tests as stable-pass/stable-fail/flaky/ improving/degrading based on pass rate, flip count, and recent streak. gstack eval trend shows sparkline table with --limit, --tier, --test filters. Guard CLI main block with import.meta.main to prevent execution on import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explain what team sync gives you, that it's optional, and how to set it up. Points to TEAM_COORDINATION_STORE.md for full guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove _comment hacks from JSON example file. Add short team sync section to README explaining what it is, that it's optional, and how to set it up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract pushWithSync() helper to eliminate boilerplate across 6 push functions. Add pushHeartbeat() for connectivity testing. Add push-greptile to CLI. New commands: gstack-sync test (validates full push/pull flow via sync_heartbeats table), gstack-sync show (terminal team data dashboard with summary/evals/ships/retros views). Guard main block with import.meta.main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add non-fatal sync steps to all 4 skill templates: - /ship Step 8.5: write ship log JSON + push after PR creation - /retro Step 13: push snapshot after JSON save - /qa Phase 6.7: write qa-sync.json + push after health score - greptile-triage: push each triage entry after history file writes All calls use || true for zero disruption. Silent when sync not configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 005_sync_heartbeats.sql migration for connectivity testing - eval:trend --team flag pulls team eval data (graceful fallback) - docs/TEAM_SYNC_SETUP.md step-by-step setup guide - Design doc status updated to Phase 2 complete - 10 new tests for sync show formatting functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
costs[]with exact API-reported per-model token usage (including cache tokens).computeCosts()prefers exact cost over MODEL_PRICING estimates (~4x more accurate with prompt caching).eval-cache.ts. Unchanged SKILL.md = $0 API cost on repeat runs.EVAL_CACHE=0to bypass.EVAL_JUDGE_TIERfor judge model,EVAL_TIERfor E2E model pinning via--modeltoclaude -p.eval:trendCLI — per-test pass rate over N runs with flaky/improving/degrading classification and sparkline table.StandardEvalResultwith validation, normalization, and bidirectional legacy conversion.gstack eval list|compare|summary|trend|push|cost|cache|watchTest plan
bun run eval:trendshows real data from existing eval runsbun run eval:cost <file>shows per-model breakdownEVALS=1 bun run test:evals— verify costs[] populated and cache works(cached)and $0 cost for judge testsEVAL_JUDGE_TIER=haiku EVALS=1 bun run test:evals— haiku for judge🤖 Generated with Claude Code