Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f87bc21
docs: add team coordination store design doc
garrytan Mar 15, 2026
8931165
Merge remote-tracking branch 'origin/main' into garrytan/team-supabas…
garrytan Mar 15, 2026
5c1ea08
docs: scrub proprietary refs, close eval format gaps, integrate gstac…
garrytan Mar 15, 2026
caed287
feat: extract shared utilities into lib/util.ts
garrytan Mar 15, 2026
3713c3b
feat: add team sync infrastructure (config, auth, push/pull, CLI)
garrytan Mar 15, 2026
f7ae465
feat: add Supabase migration SQL for team data store
garrytan Mar 15, 2026
82e2041
feat: hook eval-store sync, use shared utils, add 30 lib tests
garrytan Mar 15, 2026
7f7035f
feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts
garrytan Mar 15, 2026
9bc6c94
feat: add eval format validation, tier selection, cost tracking
garrytan Mar 15, 2026
1f5b788
feat: add SHA-based eval caching with EVAL_CACHE=0 bypass
garrytan Mar 15, 2026
4ad73f7
feat: unified gstack eval CLI with list, compare, push, cache, cost
garrytan Mar 15, 2026
02925cf
feat: wire costs[] from modelUsage into eval results
garrytan Mar 15, 2026
59752fc
feat: wire eval-cache + eval-tier into LLM judge, pin E2E model
garrytan Mar 15, 2026
daea165
feat: add eval:trend CLI for per-test pass rate tracking
garrytan Mar 15, 2026
33c9552
chore: update gitignore
garrytan Mar 15, 2026
e280333
chore: bump v0.3.10, update CHANGELOG and docs
garrytan Mar 15, 2026
eb7ef21
docs: add setup comments to .gstack-sync.json.example
garrytan Mar 15, 2026
1432046
docs: CHANGELOG covers full branch scope including team sync
garrytan Mar 15, 2026
704fe34
docs: clean up sync example, add team sync section to README
garrytan Mar 15, 2026
dc3fcc8
feat: DRY push functions, add push-greptile + sync test/show commands
garrytan Mar 16, 2026
06f2da2
feat: wire team sync push into ship, retro, qa, and greptile skills
garrytan Mar 16, 2026
87cb769
feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests
garrytan Mar 16, 2026
0e29d7d
feat: add enriched transcript sync — Haiku summaries, session file en…
garrytan Mar 16, 2026
a104471
feat: add push-transcript CLI, show sessions, interactive setup, 36 t…
garrytan Mar 16, 2026
3a57a3f
feat: add /setup-team-sync skill, auto-push transcript hooks in skills
garrytan Mar 16, 2026
6e14689
docs: add team sync TODOs — streaming parser, effectiveness scoring, …
garrytan Mar 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ bun.lock
.env.local
.env.*
!.env.example
.gstack-sync.json
5 changes: 5 additions & 0 deletions .gstack-sync.json.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"supabase_url": "https://YOUR_PROJECT.supabase.co",
"supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.YOUR_ANON_KEY_HERE",
"team_slug": "your-team-name"
}
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# Changelog

## 0.3.10 — 2026-03-15

### Added
- **Team sync via Supabase (optional)** — shared data store for eval results, retro snapshots, QA reports, ship logs, and Greptile triage across team members. All sync operations are non-fatal and non-blocking — skills never wait on network. Offline queue with automatic retry (up to 5 attempts). Zero impact when not configured: without `.gstack-sync.json`, everything works locally as before. See `docs/designs/TEAM_COORDINATION_STORE.md` for architecture and setup.
- **Supabase migration SQL** — 4 migration files in `supabase/migrations/` for teams, eval_runs, data tables (retros, QA, ships, Greptile), and eval costs. Row-level security policies ensure team members can only access their own team's data.
- **Sync config + auth** — `.gstack-sync.json` for project-level config (Supabase URL, anon key, team slug). `~/.gstack/auth.json` for user-level tokens (keyed by Supabase URL for multi-team support). `GSTACK_SUPABASE_ACCESS_TOKEN` env var for CI/automation. Token refresh built in.
- **`gstack sync` CLI** — `status`, `push`, `pull`, `drain`, `login`, `logout` subcommands for managing team sync.
- **Universal eval format** — `StandardEvalResult` schema with validation, normalization, and bidirectional legacy conversion. Any language can produce JSON matching this format and push via `gstack eval push`.
- **Unified eval CLI** — `gstack eval list|compare|summary|trend|push|cost|cache|watch` consolidating all eval tools into one entry point.
- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in the `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters.
- **Shared utilities** — `lib/util.ts` extracted with `atomicWriteJSON`, `readJSON`, `getGitInfo`, `getRemoteSlug`, `listEvalFiles`, `loadEvalResults`, `formatTimestamp`, and path constants.
- 52+ new tests across eval cache, cost, format, tier, trend, sync config, sync client, and LLM judge integration.

### Changed
- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown and attempts team sync (non-blocking).
- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.

## 0.3.9 — 2026-03-15

### Added
Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ bun run dev:skill # watch mode: auto-regen + validate on change
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runs
bun run eval:trend # per-test pass rate trends (flaky detection)
```

`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
Expand Down
5 changes: 4 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
bun run eval:list # list all eval runs
bun run eval:compare # compare two runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all runs
bun run eval:trend # per-test pass rate over last N runs (flaky detection)
bun run eval:cache stats # check LLM judge cache hit rate
```

Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
Expand All @@ -152,7 +154,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
```

- Uses `claude-sonnet-4-6` for scoring stability
- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
- Tests live in `test/skill-llm-eval.test.ts`
- Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code

Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,12 @@ bun run eval:watch # live dashboard during E2E runs

E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure.

### Team sync (optional)

For teams, gstack can sync eval results, retro snapshots, QA reports, and ship logs to a shared Supabase store. Without this, everything works locally as before — sync is purely additive.

To set up: copy `.gstack-sync.json.example` to `.gstack-sync.json`, create a Supabase project, run the migrations in `supabase/migrations/`, and fill in your credentials. See `docs/designs/TEAM_COORDINATION_STORE.md` for the full guide.

## License

MIT
42 changes: 41 additions & 1 deletion TODOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@

**Why:** Spot quality trends — is the app getting better or worse?

**Context:** QA already writes structured reports. This adds cross-run comparison.
**Context:** `eval:trend` now tracks test-level pass rates (eval infrastructure). QA-run-level trending (health scores over time across QA report files) is a separate feature that could reuse `computeTrends` pattern from `lib/cli-eval.ts`.

**Effort:** S
**Priority:** P2
Expand Down Expand Up @@ -277,6 +277,44 @@
**Priority:** P3
**Depends on:** Browse sessions

## Team Sync

### Streaming parser for large session files

**What:** Replace readFileSync with readline/createReadStream for session files >10MB.

**Why:** Currently skip files >10MB. Long sessions (1000+ turns, 35MB) lose enrichment data (tools_used, full turn count).

**Context:** Current 10MB cap is defensive. Session files at `~/.claude/projects/{hash}/{sid}.jsonl` can be 35MB for marathon sessions. Streaming parser removes the cap while keeping memory usage constant.

**Effort:** S
**Priority:** P3
**Depends on:** Transcript sync (Phase 3)

### Session effectiveness scoring

**What:** Compute a 1-5 effectiveness score per session based on turns to achieve goal, tool diversity, whether code was shipped, and session duration.

**Why:** Enables `show sessions --best` and team-level AI effectiveness metrics. Raw data (tools_used, turns, duration, summary) already in Supabase after transcript sync.

**Context:** Year 2 roadmap item. Scoring heuristics need iteration. Could start with: fewer turns = more efficient, more tool diversity = better problem decomposition, shipped code (detected via git) = successful outcome.

**Effort:** M
**Priority:** P2
**Depends on:** Transcript sync (Phase 3)

### Weekly AI usage digest

**What:** Supabase edge function that runs weekly, aggregates session_transcripts + eval_runs, sends team summary to Slack/email.

**Why:** Passive team visibility without running commands. "Your team ran 47 sessions this week. Top tools: Edit(156), Bash(89). Sarah shipped 3 PRs via /ship."

**Context:** Design doc Phase 4 item. Requires Supabase edge functions + Slack/email integration. Transcript data from Phase 3 is the primary input alongside eval_runs.

**Effort:** L
**Priority:** P2
**Depends on:** Transcript sync (Phase 3), Supabase edge functions

## Infrastructure

### /setup-gstack-upload skill (S3 bucket)
Expand Down Expand Up @@ -335,6 +373,8 @@

**Why:** Reduce E2E test cost and flakiness.

**Status:** Model pinning shipped (session-runner.ts passes `--model` from `EVAL_TIER` env). Retry:2 still TODO.

**Effort:** XS
**Priority:** P2

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.3.9
0.3.10
8 changes: 8 additions & 0 deletions bin/gstack-eval
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash
set -euo pipefail

# gstack eval — unified eval CLI
# Delegates to lib/cli-eval.ts via bun

GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
exec bun run "$GSTACK_DIR/lib/cli-eval.ts" "$@"
86 changes: 86 additions & 0 deletions bin/gstack-sync
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/usr/bin/env bash
# gstack-sync — team data sync CLI.
#
# Usage:
# gstack-sync setup — interactive auth flow
# gstack-sync status — show sync status
# gstack-sync test — validate full sync flow
# gstack-sync show [evals|ships|retros] — view team data
# gstack-sync push-{eval,retro,qa,ship,greptile} <file> — push data
# gstack-sync push-transcript — sync Claude session transcripts
# gstack-sync pull — pull team data to local cache
# gstack-sync drain — drain the offline queue
# gstack-sync logout — clear auth tokens
#
# Env overrides (for testing):
# GSTACK_DIR — override auto-detected gstack root
# GSTACK_STATE_DIR — override ~/.gstack state directory
set -euo pipefail

GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"

case "${1:-}" in
setup)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" setup
;;
status)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" status
;;
push-eval)
FILE="${2:?Usage: gstack-sync push-eval <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-eval "$FILE"
;;
push-retro)
FILE="${2:?Usage: gstack-sync push-retro <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-retro "$FILE"
;;
push-qa)
FILE="${2:?Usage: gstack-sync push-qa <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-qa "$FILE"
;;
push-ship)
FILE="${2:?Usage: gstack-sync push-ship <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-ship "$FILE"
;;
push-greptile)
FILE="${2:?Usage: gstack-sync push-greptile <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-greptile "$FILE"
;;
push-transcript)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-transcript
;;
test)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" test
;;
show)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" show "${@:2}"
;;
pull)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" pull
;;
drain)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" drain
;;
logout)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" logout
;;
*)
echo "Usage: gstack-sync <command> [args]"
echo ""
echo "Commands:"
echo " setup Interactive auth flow (opens browser)"
echo " status Show sync status (queue, cache, connection)"
echo " test Validate full sync flow (push + pull)"
echo " show [evals|ships|retros|sessions] View team data in terminal"
echo " push-eval <file> Push eval result JSON to team store"
echo " push-retro <file> Push retro snapshot JSON"
echo " push-qa <file> Push QA report JSON"
echo " push-ship <file> Push ship log JSON"
echo " push-greptile <file> Push Greptile triage entry JSON"
echo " push-transcript Sync Claude session transcripts"
echo " pull Pull team data to local cache"
echo " drain Drain the offline sync queue"
echo " logout Clear auth tokens"
exit 1
;;
esac
132 changes: 132 additions & 0 deletions docs/TEAM_SYNC_SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Team Sync Setup Guide

Team sync lets your team share eval results, retro snapshots, QA reports, ship logs, and Greptile triage data via a shared Supabase store. All sync is optional and non-fatal — without it, everything works locally as before.

## Prerequisites

- A [Supabase](https://supabase.com) project (free tier works)
- gstack v0.3.10+

## Step 1: Create a Supabase project

1. Go to [supabase.com](https://supabase.com) and create a new project
2. Note your **Project URL** (e.g., `https://xxxx.supabase.co`)
3. Note your **anon/public key** from Settings > API

## Step 2: Run migrations

In the Supabase SQL Editor, run these files **in order**:

```
supabase/migrations/001_teams.sql
supabase/migrations/002_eval_runs.sql
supabase/migrations/003_data_tables.sql
supabase/migrations/004_eval_costs.sql
supabase/migrations/005_sync_heartbeats.sql
```

Copy-paste each file's contents into the SQL editor and run.

## Step 3: Create your team

In the SQL editor, create a team and add yourself:

```sql
-- Create team
INSERT INTO teams (name, slug) VALUES ('Your Team', 'your-team-slug');

-- After authenticating (Step 5), add yourself as owner:
-- INSERT INTO team_members (team_id, user_id, role)
-- VALUES ('<team-id>', '<your-user-id>', 'owner');
```

Note the team slug — you'll need it in the next step.

## Step 4: Configure your project

Copy the example config to your project root:

```bash
cp .gstack-sync.json.example .gstack-sync.json
```

Edit `.gstack-sync.json` with your Supabase details:

```json
{
"supabase_url": "https://YOUR_PROJECT.supabase.co",
"supabase_anon_key": "eyJ...",
"team_slug": "your-team-slug"
}
```

**Important:** Add `.gstack-sync.json` to `.gitignore` if it contains sensitive keys, or commit it if your team uses the same Supabase project (the anon key is safe to commit — RLS protects the data).

## Step 5: Authenticate

```bash
gstack-sync setup
```

This opens your browser for Supabase OAuth. After authenticating, tokens are saved to `~/.gstack/auth.json` (mode 0600).

**For CI/automation:** Set the `GSTACK_SUPABASE_ACCESS_TOKEN` env var instead of running setup.

## Step 6: Verify

```bash
gstack-sync test
```

Expected output:
```
gstack sync test
────────────────────────────────────
1. Config: ok (team: your-team-slug)
2. Auth: ok (you@email.com)
3. Push: ok (123ms)
4. Pull: ok (1 heartbeats, 95ms)
────────────────────────────────────
Sync test passed ✓
```

## Step 7: See your data

```bash
gstack-sync show # team summary dashboard
gstack-sync show evals # recent eval runs
gstack-sync show ships # recent ship logs
gstack-sync show retros # recent retro snapshots
gstack-sync status # sync health check
bun run eval:trend --team # team-wide test trends
```

## How it works

When sync is configured, skills automatically push data after completing their primary task:

- `/ship` pushes a ship log after PR creation (Step 8.5)
- `/retro` pushes the snapshot after saving to `.context/retros/` (Step 13)
- `/qa` pushes a report after computing the health score (Phase 6)
- `/review` pushes Greptile triage entries after history file writes
- Eval runs are pushed automatically by `EvalCollector.finalize()`

All pushes are non-fatal. If sync fails, entries are queued in `~/.gstack/sync-queue.json` and retried on the next push or via `gstack-sync drain`.

## Troubleshooting

| Problem | Fix |
|---|---|
| "No .gstack-sync.json found" | Copy `.gstack-sync.json.example` and fill in your values |
| "Not authenticated" | Run `gstack-sync setup` |
| Push fails with 404 | Run the migration SQL files in order |
| "Connection failed" | Check your Supabase URL and that the project is running |
| Queue growing | Run `gstack-sync drain` to flush |

## Adding team members

Each team member needs to:

1. Have `.gstack-sync.json` in their project (commit it or share it)
2. Run `gstack-sync setup` to authenticate
3. Be added to `team_members` in Supabase (by an admin)
Loading