garrytan · garrytan · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,4 @@ bun.lock
 .env.local
 .env.*
 !.env.example
+.gstack-sync.json
diff --git a/.gstack-sync.json.example b/.gstack-sync.json.example
@@ -0,0 +1,5 @@
+{
+  "supabase_url": "https://YOUR_PROJECT.supabase.co",
+  "supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.YOUR_ANON_KEY_HERE",
+  "team_slug": "your-team-name"
+}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,28 @@
 # Changelog
 
+## 0.3.10 — 2026-03-15
+
+### Added
+- **Team sync via Supabase (optional)** — shared data store for eval results, retro snapshots, QA reports, ship logs, and Greptile triage across team members. All sync operations are non-fatal and non-blocking — skills never wait on network. Offline queue with automatic retry (up to 5 attempts). Zero impact when not configured: without `.gstack-sync.json`, everything works locally as before. See `docs/designs/TEAM_COORDINATION_STORE.md` for architecture and setup.
+- **Supabase migration SQL** — 4 migration files in `supabase/migrations/` for teams, eval_runs, data tables (retros, QA, ships, Greptile), and eval costs. Row-level security policies ensure team members can only access their own team's data.
+- **Sync config + auth** — `.gstack-sync.json` for project-level config (Supabase URL, anon key, team slug). `~/.gstack/auth.json` for user-level tokens (keyed by Supabase URL for multi-team support). `GSTACK_SUPABASE_ACCESS_TOKEN` env var for CI/automation. Token refresh built in.
+- **`gstack sync` CLI** — `status`, `push`, `pull`, `drain`, `login`, `logout` subcommands for managing team sync.
+- **Universal eval format** — `StandardEvalResult` schema with validation, normalization, and bidirectional legacy conversion. Any language can produce JSON matching this format and push via `gstack eval push`.
+- **Unified eval CLI** — `gstack eval list|compare|summary|trend|push|cost|cache|watch` consolidating all eval tools into one entry point.
+- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in the `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
+- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
+- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
+- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters.
+- **Shared utilities** — `lib/util.ts` extracted with `atomicWriteJSON`, `readJSON`, `getGitInfo`, `getRemoteSlug`, `listEvalFiles`, `loadEvalResults`, `formatTimestamp`, and path constants.
+- 52+ new tests across eval cache, cost, format, tier, trend, sync config, sync client, and LLM judge integration.
+
+### Changed
+- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
+- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown and attempts team sync (non-blocking).
+- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
+- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
+- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.
+
 ## 0.3.9 — 2026-03-15
 
 ### Added

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -15,6 +15,7 @@ bun run dev:skill    # watch mode: auto-regen + validate on change
 bun run eval:list    # list all eval runs from ~/.gstack-dev/evals/
 bun run eval:compare # compare two eval runs (auto-picks most recent)
 bun run eval:summary # aggregate stats across all eval runs
+bun run eval:trend   # per-test pass rate trends (flaky detection)
 ```
 
 `test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -134,6 +134,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
 bun run eval:list            # list all eval runs
 bun run eval:compare         # compare two runs (auto-picks most recent)
 bun run eval:summary         # aggregate stats across all runs
+bun run eval:trend           # per-test pass rate over last N runs (flaky detection)
+bun run eval:cache stats     # check LLM judge cache hit rate
 ```
 
 Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
@@ -152,7 +154,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
 # Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
 ```
 
-- Uses `claude-sonnet-4-6` for scoring stability
+- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
+- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
 - Tests live in `test/skill-llm-eval.test.ts`
 - Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code
 

diff --git a/README.md b/README.md
@@ -629,6 +629,12 @@ bun run eval:watch            # live dashboard during E2E runs
 
 E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure.
 
+### Team sync (optional)
+
+For teams, gstack can sync eval results, retro snapshots, QA reports, and ship logs to a shared Supabase store. Without this, everything works locally as before — sync is purely additive.
+
+To set up: copy `.gstack-sync.json.example` to `.gstack-sync.json`, create a Supabase project, run the migrations in `supabase/migrations/`, and fill in your credentials. See `docs/designs/TEAM_COORDINATION_STORE.md` for the full guide.
+
 ## License
 
 MIT
diff --git a/TODOS.md b/TODOS.md
@@ -231,7 +231,7 @@
 
 **Why:** Spot quality trends — is the app getting better or worse?
 
-**Context:** QA already writes structured reports. This adds cross-run comparison.
+**Context:** `eval:trend` now tracks test-level pass rates (eval infrastructure). QA-run-level trending (health scores over time across QA report files) is a separate feature that could reuse `computeTrends` pattern from `lib/cli-eval.ts`.
 
 **Effort:** S
 **Priority:** P2
@@ -277,6 +277,44 @@
 **Priority:** P3
 **Depends on:** Browse sessions
 
+## Team Sync
+
+### Streaming parser for large session files
+
+**What:** Replace readFileSync with readline/createReadStream for session files >10MB.
+
+**Why:** Currently skip files >10MB. Long sessions (1000+ turns, 35MB) lose enrichment data (tools_used, full turn count).
+
+**Context:** Current 10MB cap is defensive. Session files at `~/.claude/projects/{hash}/{sid}.jsonl` can be 35MB for marathon sessions. Streaming parser removes the cap while keeping memory usage constant.
+
+**Effort:** S
+**Priority:** P3
+**Depends on:** Transcript sync (Phase 3)
+
+### Session effectiveness scoring
+
+**What:** Compute a 1-5 effectiveness score per session based on turns to achieve goal, tool diversity, whether code was shipped, and session duration.
+
+**Why:** Enables `show sessions --best` and team-level AI effectiveness metrics. Raw data (tools_used, turns, duration, summary) already in Supabase after transcript sync.
+
+**Context:** Year 2 roadmap item. Scoring heuristics need iteration. Could start with: fewer turns = more efficient, more tool diversity = better problem decomposition, shipped code (detected via git) = successful outcome.
+
+**Effort:** M
+**Priority:** P2
+**Depends on:** Transcript sync (Phase 3)
+
+### Weekly AI usage digest
+
+**What:** Supabase edge function that runs weekly, aggregates session_transcripts + eval_runs, sends team summary to Slack/email.
+
+**Why:** Passive team visibility without running commands. "Your team ran 47 sessions this week. Top tools: Edit(156), Bash(89). Sarah shipped 3 PRs via /ship."
+
+**Context:** Design doc Phase 4 item. Requires Supabase edge functions + Slack/email integration. Transcript data from Phase 3 is the primary input alongside eval_runs.
+
+**Effort:** L
+**Priority:** P2
+**Depends on:** Transcript sync (Phase 3), Supabase edge functions
+
 ## Infrastructure
 
 ### /setup-gstack-upload skill (S3 bucket)
@@ -335,6 +373,8 @@
 
 **Why:** Reduce E2E test cost and flakiness.
 
+**Status:** Model pinning shipped (session-runner.ts passes `--model` from `EVAL_TIER` env). Retry:2 still TODO.
+
 **Effort:** XS
 **Priority:** P2
 

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.3.9
+0.3.10
diff --git a/bin/gstack-eval b/bin/gstack-eval
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# gstack eval — unified eval CLI
+# Delegates to lib/cli-eval.ts via bun
+
+GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
+exec bun run "$GSTACK_DIR/lib/cli-eval.ts" "$@"
diff --git a/bin/gstack-sync b/bin/gstack-sync
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+# gstack-sync — team data sync CLI.
+#
+# Usage:
+#   gstack-sync setup                    — interactive auth flow
+#   gstack-sync status                   — show sync status
+#   gstack-sync test                     — validate full sync flow
+#   gstack-sync show [evals|ships|retros] — view team data
+#   gstack-sync push-{eval,retro,qa,ship,greptile} <file> — push data
+#   gstack-sync push-transcript            — sync Claude session transcripts
+#   gstack-sync pull                     — pull team data to local cache
+#   gstack-sync drain                    — drain the offline queue
+#   gstack-sync logout                   — clear auth tokens
+#
+# Env overrides (for testing):
+#   GSTACK_DIR          — override auto-detected gstack root
+#   GSTACK_STATE_DIR    — override ~/.gstack state directory
+set -euo pipefail
+
+GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
+
+case "${1:-}" in
+  setup)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" setup
+    ;;
+  status)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" status
+    ;;
+  push-eval)
+    FILE="${2:?Usage: gstack-sync push-eval <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-eval "$FILE"
+    ;;
+  push-retro)
+    FILE="${2:?Usage: gstack-sync push-retro <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-retro "$FILE"
+    ;;
+  push-qa)
+    FILE="${2:?Usage: gstack-sync push-qa <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-qa "$FILE"
+    ;;
+  push-ship)
+    FILE="${2:?Usage: gstack-sync push-ship <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-ship "$FILE"
+    ;;
+  push-greptile)
+    FILE="${2:?Usage: gstack-sync push-greptile <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-greptile "$FILE"
+    ;;
+  push-transcript)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-transcript
+    ;;
+  test)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" test
+    ;;
+  show)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" show "${@:2}"
+    ;;
+  pull)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" pull
+    ;;
+  drain)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" drain
+    ;;
+  logout)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" logout
+    ;;
+  *)
+    echo "Usage: gstack-sync <command> [args]"
+    echo ""
+    echo "Commands:"
+    echo "  setup                 Interactive auth flow (opens browser)"
+    echo "  status                Show sync status (queue, cache, connection)"
+    echo "  test                  Validate full sync flow (push + pull)"
+    echo "  show [evals|ships|retros|sessions]  View team data in terminal"
+    echo "  push-eval <file>      Push eval result JSON to team store"
+    echo "  push-retro <file>     Push retro snapshot JSON"
+    echo "  push-qa <file>        Push QA report JSON"
+    echo "  push-ship <file>      Push ship log JSON"
+    echo "  push-greptile <file>  Push Greptile triage entry JSON"
+    echo "  push-transcript       Sync Claude session transcripts"
+    echo "  pull                  Pull team data to local cache"
+    echo "  drain                 Drain the offline sync queue"
+    echo "  logout                Clear auth tokens"
+    exit 1
+    ;;
+esac
diff --git a/docs/TEAM_SYNC_SETUP.md b/docs/TEAM_SYNC_SETUP.md
@@ -0,0 +1,132 @@
+# Team Sync Setup Guide
+
+Team sync lets your team share eval results, retro snapshots, QA reports, ship logs, and Greptile triage data via a shared Supabase store. All sync is optional and non-fatal — without it, everything works locally as before.
+
+## Prerequisites
+
+- A [Supabase](https://supabase.com) project (free tier works)
+- gstack v0.3.10+
+
+## Step 1: Create a Supabase project
+
+1. Go to [supabase.com](https://supabase.com) and create a new project
+2. Note your **Project URL** (e.g., `https://xxxx.supabase.co`)
+3. Note your **anon/public key** from Settings > API
+
+## Step 2: Run migrations
+
+In the Supabase SQL Editor, run these files **in order**:
+
+```
+supabase/migrations/001_teams.sql
+supabase/migrations/002_eval_runs.sql
+supabase/migrations/003_data_tables.sql
+supabase/migrations/004_eval_costs.sql
+supabase/migrations/005_sync_heartbeats.sql
+```
+
+Copy-paste each file's contents into the SQL editor and run.
+
+## Step 3: Create your team
+
+In the SQL editor, create a team and add yourself:
+
+```sql
+-- Create team
+INSERT INTO teams (name, slug) VALUES ('Your Team', 'your-team-slug');
+
+-- After authenticating (Step 5), add yourself as owner:
+-- INSERT INTO team_members (team_id, user_id, role)
+-- VALUES ('<team-id>', '<your-user-id>', 'owner');
+```
+
+Note the team slug — you'll need it in the next step.
+
+## Step 4: Configure your project
+
+Copy the example config to your project root:
+
+```bash
+cp .gstack-sync.json.example .gstack-sync.json
+```
+
+Edit `.gstack-sync.json` with your Supabase details:
+
+```json
+{
+  "supabase_url": "https://YOUR_PROJECT.supabase.co",
+  "supabase_anon_key": "eyJ...",
+  "team_slug": "your-team-slug"
+}
+```
+
+**Important:** Add `.gstack-sync.json` to `.gitignore` if it contains sensitive keys, or commit it if your team uses the same Supabase project (the anon key is safe to commit — RLS protects the data).
+
+## Step 5: Authenticate
+
+```bash
+gstack-sync setup
+```
+
+This opens your browser for Supabase OAuth. After authenticating, tokens are saved to `~/.gstack/auth.json` (mode 0600).
+
+**For CI/automation:** Set the `GSTACK_SUPABASE_ACCESS_TOKEN` env var instead of running setup.
+
+## Step 6: Verify
+
+```bash
+gstack-sync test
+```
+
+Expected output:
+```
+gstack sync test
+────────────────────────────────────
+  1. Config:        ok (team: your-team-slug)
+  2. Auth:          ok (you@email.com)
+  3. Push:          ok (123ms)
+  4. Pull:          ok (1 heartbeats, 95ms)
+────────────────────────────────────
+  Sync test passed ✓
+```
+
+## Step 7: See your data
+
+```bash
+gstack-sync show              # team summary dashboard
+gstack-sync show evals        # recent eval runs
+gstack-sync show ships        # recent ship logs
+gstack-sync show retros       # recent retro snapshots
+gstack-sync status            # sync health check
+bun run eval:trend --team     # team-wide test trends
+```
+
+## How it works
+
+When sync is configured, skills automatically push data after completing their primary task:
+
+- `/ship` pushes a ship log after PR creation (Step 8.5)
+- `/retro` pushes the snapshot after saving to `.context/retros/` (Step 13)
+- `/qa` pushes a report after computing the health score (Phase 6)
+- `/review` pushes Greptile triage entries after history file writes
+- Eval runs are pushed automatically by `EvalCollector.finalize()`
+
+All pushes are non-fatal. If sync fails, entries are queued in `~/.gstack/sync-queue.json` and retried on the next push or via `gstack-sync drain`.
+
+## Troubleshooting
+
+| Problem | Fix |
+|---|---|
+| "No .gstack-sync.json found" | Copy `.gstack-sync.json.example` and fill in your values |
+| "Not authenticated" | Run `gstack-sync setup` |
+| Push fails with 404 | Run the migration SQL files in order |
+| "Connection failed" | Check your Supabase URL and that the project is running |
+| Queue growing | Run `gstack-sync drain` to flush |
+
+## Adding team members
+
+Each team member needs to:
+
+1. Have `.gstack-sync.json` in their project (commit it or share it)
+2. Run `gstack-sync setup` to authenticate
+3. Be added to `team_members` in Supabase (by an admin)