Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
f87bc21
docs: add team coordination store design doc
garrytan Mar 15, 2026
8931165
Merge remote-tracking branch 'origin/main' into garrytan/team-supabas…
garrytan Mar 15, 2026
5c1ea08
docs: scrub proprietary refs, close eval format gaps, integrate gstac…
garrytan Mar 15, 2026
caed287
feat: extract shared utilities into lib/util.ts
garrytan Mar 15, 2026
3713c3b
feat: add team sync infrastructure (config, auth, push/pull, CLI)
garrytan Mar 15, 2026
f7ae465
feat: add Supabase migration SQL for team data store
garrytan Mar 15, 2026
82e2041
feat: hook eval-store sync, use shared utils, add 30 lib tests
garrytan Mar 15, 2026
7f7035f
feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts
garrytan Mar 15, 2026
9bc6c94
feat: add eval format validation, tier selection, cost tracking
garrytan Mar 15, 2026
1f5b788
feat: add SHA-based eval caching with EVAL_CACHE=0 bypass
garrytan Mar 15, 2026
4ad73f7
feat: unified gstack eval CLI with list, compare, push, cache, cost
garrytan Mar 15, 2026
02925cf
feat: wire costs[] from modelUsage into eval results
garrytan Mar 15, 2026
59752fc
feat: wire eval-cache + eval-tier into LLM judge, pin E2E model
garrytan Mar 15, 2026
daea165
feat: add eval:trend CLI for per-test pass rate tracking
garrytan Mar 15, 2026
33c9552
chore: update gitignore
garrytan Mar 15, 2026
e280333
chore: bump v0.3.10, update CHANGELOG and docs
garrytan Mar 15, 2026
eb7ef21
docs: add setup comments to .gstack-sync.json.example
garrytan Mar 15, 2026
1432046
docs: CHANGELOG covers full branch scope including team sync
garrytan Mar 15, 2026
704fe34
docs: clean up sync example, add team sync section to README
garrytan Mar 15, 2026
e97108a
feat: contributor mode, session awareness, universal RECOMMENDATION f…
garrytan Mar 15, 2026
c11cb70
Merge remote-tracking branch 'origin/garrytan/team-supabase-store' in…
garrytan Mar 15, 2026
dc3fcc8
feat: DRY push functions, add push-greptile + sync test/show commands
garrytan Mar 16, 2026
06f2da2
feat: wire team sync push into ship, retro, qa, and greptile skills
garrytan Mar 16, 2026
87cb769
feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests
garrytan Mar 16, 2026
5e641bd
feat: add Enum & Value Completeness to /review critical checklist
garrytan Mar 16, 2026
b07e842
Merge remote-tracking branch 'origin/garrytan/team-supabase-store' in…
garrytan Mar 16, 2026
2d42e15
chore: bump version and changelog (v0.3.11)
garrytan Mar 16, 2026
0e29d7d
feat: add enriched transcript sync — Haiku summaries, session file en…
garrytan Mar 16, 2026
a104471
feat: add push-transcript CLI, show sessions, interactive setup, 36 t…
garrytan Mar 16, 2026
3a57a3f
feat: add /setup-team-sync skill, auto-push transcript hooks in skills
garrytan Mar 16, 2026
6e14689
docs: add team sync TODOs — streaming parser, effectiveness scoring, …
garrytan Mar 16, 2026
cce407b
Merge remote-tracking branch 'origin/garrytan/team-supabase-store' in…
garrytan Mar 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ bun.lock
.env.local
.env.*
!.env.example
.gstack-sync.json
5 changes: 5 additions & 0 deletions .gstack-sync.json.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"supabase_url": "https://YOUR_PROJECT.supabase.co",
"supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.YOUR_ANON_KEY_HERE",
"team_slug": "your-team-name"
}
37 changes: 37 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,42 @@
# Changelog

## 0.3.11 — 2026-03-15

### Added
- **Contributor mode** — set `gstack_contributor: true` in `~/.gstack/config.yaml` and Claude Code automatically files field reports to `~/.gstack/contributor-logs/` when gstack itself misbehaves. Reports include what you were doing, what went wrong, annoyance level (1-5), repro steps, and raw output. Opens the report for review. Max 3 per session, deduped by slug.
- **Concurrent session tracking** — gstack detects how many sessions are active in a 2-hour window. When 3+ sessions are running simultaneously, all skills enter "ELI16 mode": every AskUserQuestion re-grounds the user on project, branch, current task, and the specific question — because context-switching is real.
- **Universal RECOMMENDATION format** — every AskUserQuestion across all skills now follows: context → question → `RECOMMENDATION: Choose X because ___` → options. Consistent everywhere. Plan-review skills reference this baseline and add their own rules on top.
- **Enum & Value Completeness** review category — new CRITICAL check in `/review` that traces new enum values, status strings, and type constants through every consumer outside the diff. Catches the class of bugs where a value is added but not handled in all case/switch chains, allowlists, or frontend-backend contracts.

### Changed
- Renamed `{{UPDATE_CHECK}}` placeholder to `{{PREAMBLE}}` across all 10 skill templates. The preamble now includes update check, session tracking, contributor mode, and AskUserQuestion format in a single startup block.
- DRY'd plan-ceo-review and plan-eng-review AskUserQuestion formatting rules to reference the preamble baseline instead of duplicating instructions.
- Rewrote CONTRIBUTING.md with contributor workflow, cross-project testing guide, and Conductor workspace docs.
- Added vendored symlink awareness section to CLAUDE.md.

## 0.3.10 — 2026-03-15

### Added
- **Team sync via Supabase (optional)** — shared data store for eval results, retro snapshots, QA reports, ship logs, and Greptile triage across team members. All sync operations are non-fatal and non-blocking — skills never wait on network. Offline queue with automatic retry (up to 5 attempts). Zero impact when not configured: without `.gstack-sync.json`, everything works locally as before. See `docs/designs/TEAM_COORDINATION_STORE.md` for architecture and setup.
- **Supabase migration SQL** — 4 migration files in `supabase/migrations/` for teams, eval_runs, data tables (retros, QA, ships, Greptile), and eval costs. Row-level security policies ensure team members can only access their own team's data.
- **Sync config + auth**`.gstack-sync.json` for project-level config (Supabase URL, anon key, team slug). `~/.gstack/auth.json` for user-level tokens (keyed by Supabase URL for multi-team support). `GSTACK_SUPABASE_ACCESS_TOKEN` env var for CI/automation. Token refresh built in.
- **`gstack sync` CLI**`status`, `push`, `pull`, `drain`, `login`, `logout` subcommands for managing team sync.
- **Universal eval format**`StandardEvalResult` schema with validation, normalization, and bidirectional legacy conversion. Any language can produce JSON matching this format and push via `gstack eval push`.
- **Unified eval CLI**`gstack eval list|compare|summary|trend|push|cost|cache|watch` consolidating all eval tools into one entry point.
- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in the `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
- **Dynamic model selection**`EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters.
- **Shared utilities**`lib/util.ts` extracted with `atomicWriteJSON`, `readJSON`, `getGitInfo`, `getRemoteSlug`, `listEvalFiles`, `loadEvalResults`, `formatTimestamp`, and path constants.
- 52+ new tests across eval cache, cost, format, tier, trend, sync config, sync client, and LLM judge integration.

### Changed
- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown and attempts team sync (non-blocking).
- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.

## 0.3.9 — 2026-03-15

### Added
Expand Down
19 changes: 19 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ bun run dev:skill # watch mode: auto-regen + validate on change
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runs
bun run eval:trend # per-test pass rate trends (flaky detection)
```

`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
Expand Down Expand Up @@ -71,6 +72,24 @@ When you need to interact with a browser (QA, dogfooding, cookie setup), use the
`mcp__claude-in-chrome__*` tools — they are slow, unreliable, and not what this
project uses.

## Vendored symlink awareness

When developing gstack, `.claude/skills/gstack` may be a symlink back to this
working directory (gitignored). This means skill changes are **live immediately**
great for rapid iteration, risky during big refactors where half-written skills
could break other Claude Code sessions using gstack concurrently.

**Check once per session:** Run `ls -la .claude/skills/gstack` to see if it's a
symlink or a real copy. If it's a symlink to your working directory, be aware that:
- Template changes + `bun run gen:skill-docs` immediately affect all gstack invocations
- Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions
- During large refactors, remove the symlink (`rm .claude/skills/gstack`) so the
global install at `~/.claude/skills/gstack/` is used instead

**For plan reviews:** When reviewing plans that modify skill templates or the
gen-skill-docs pipeline, consider whether the changes should be tested in isolation
before going live (especially if the user is actively using gstack in other windows).

## Deploying to the active skill

The active skill lives at `~/.claude/skills/gstack/`. After making changes:
Expand Down
117 changes: 64 additions & 53 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,44 @@ Now edit any `SKILL.md`, invoke it in Claude Code (e.g. `/review`), and see your
bin/dev-teardown # deactivate — back to your global install
```

## How dev mode works
## Contributor mode

`bin/dev-setup` creates a `.claude/skills/` directory inside the repo (gitignored) and fills it with symlinks pointing back to your working tree. Claude Code sees the local `skills/` first, so your edits win over the global install.
Contributor mode is for people who want to fix gstack when it annoys them. Enable it
and Claude Code will automatically log issues to `~/.gstack/contributor-logs/` as you
work — what you were doing, what went wrong, repro steps, raw output.

```bash
~/.claude/skills/gstack/bin/gstack-config set gstack_contributor true
```

The logs are for **you**. When something bugs you enough to fix, the report is
already written. Fork gstack, symlink your fork into the project where you hit
the issue, fix it, and open a PR.

### The contributor workflow

1. **Hit friction while using gstack** — contributor mode logs it automatically
2. **Check your logs:** `ls ~/.gstack/contributor-logs/`
3. **Fork and clone gstack** (if you haven't already)
4. **Symlink your fork into the project where you hit the bug:**
```bash
# In your core project (the one where gstack annoyed you)
ln -sfn /path/to/your/gstack-fork .claude/skills/gstack
cd .claude/skills/gstack && bun install && bun run build
```
5. **Fix the issue** — your changes are live immediately in this project
6. **Test by actually using gstack** — do the thing that annoyed you, verify it's fixed
7. **Open a PR from your fork**

This is the best way to contribute: fix gstack while doing your real work, in the
project where you actually felt the pain.

## Working on gstack inside the gstack repo

When you're editing gstack skills and want to test them by actually using gstack
in the same repo, `bin/dev-setup` wires this up. It creates `.claude/skills/`
symlinks (gitignored) pointing back to your working tree, so Claude Code uses
your local edits instead of the global install.

```
gstack/ <- your working tree
Expand Down Expand Up @@ -134,6 +169,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
bun run eval:list # list all eval runs
bun run eval:compare # compare two runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all runs
bun run eval:trend # per-test pass rate over last N runs (flaky detection)
bun run eval:cache stats # check LLM judge cache hit rate
```

Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.
Expand All @@ -152,7 +189,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
```

- Uses `claude-sonnet-4-6` for scoring stability
- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
- Tests live in `test/skill-llm-eval.test.ts`
- Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code

Expand Down Expand Up @@ -205,69 +243,42 @@ When Conductor creates a new workspace, `bin/dev-setup` runs automatically. It d
- **`.env` propagates across worktrees.** Set it once in the main repo, all Conductor workspaces get it.
- **`.claude/skills/` is gitignored.** The symlinks never get committed.

## Testing a branch in another repo

When you're developing gstack in one workspace and want to test your branch in a
different project (e.g. testing browse changes against your real app), there are
two cases depending on how gstack is installed in that project.
## Testing your changes in a real project

### Global install only (no `.claude/skills/gstack/` in the project)

Point your global install at the branch:
**This is the recommended way to develop gstack.** Symlink your gstack checkout
into the project where you actually use it, so your changes are live while you
do real work:

```bash
cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch> # e.g. origin/v0.3.2
bun install # in case deps changed
bun run build # rebuild the binary
# In your core project
ln -sfn /path/to/your/gstack-checkout .claude/skills/gstack
cd .claude/skills/gstack && bun install && bun run build
```

Now open Claude Code in the other project — it picks up skills from
`~/.claude/skills/` automatically. To go back to main when you're done:
Now every gstack skill invocation in this project uses your working tree. Edit a
template, run `bun run gen:skill-docs`, and the next `/review` or `/qa` call picks
it up immediately.

**To go back to the stable global install**, just remove the symlink:

```bash
cd ~/.claude/skills/gstack
git checkout main && git pull
bun run build
rm .claude/skills/gstack
```

### Vendored project copy (`.claude/skills/gstack/` checked into the project)

Some projects vendor gstack by copying it into the repo (no `.git` inside the
copy). Project-local skills take priority over global, so you need to update
the vendored copy too. This is a three-step process:
Claude Code falls back to `~/.claude/skills/gstack/` automatically.

1. **Update your global install to the branch** (so you have the source):
```bash
cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch> # e.g. origin/v0.3.2
bun install && bun run build
```

2. **Replace the vendored copy** in the other project:
```bash
cd /path/to/other-project
### Alternative: point your global install at a branch

# Remove old skill symlinks and vendored copy
for s in browse plan-ceo-review plan-eng-review review ship retro qa setup-browser-cookies; do
rm -f .claude/skills/$s
done
rm -rf .claude/skills/gstack
If you don't want per-project symlinks, you can switch the global install:

# Copy from global install (strips .git so it stays vendored)
cp -Rf ~/.claude/skills/gstack .claude/skills/gstack
rm -rf .claude/skills/gstack/.git

# Rebuild binary and re-create skill symlinks
cd .claude/skills/gstack && ./setup
```

3. **Test your changes** — open Claude Code in that project and use the skills.
```bash
cd ~/.claude/skills/gstack
git fetch origin
git checkout origin/<branch>
bun install && bun run build
```

To revert to main later, repeat steps 1-2 with `git checkout main && git pull`
instead of `git checkout origin/<branch>`.
This affects all projects. To revert: `git checkout main && git pull && bun run build`.

## Shipping your changes

Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,12 @@ bun run eval:watch # live dashboard during E2E runs

E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure.

### Team sync (optional)

For teams, gstack can sync eval results, retro snapshots, QA reports, and ship logs to a shared Supabase store. Without this, everything works locally as before — sync is purely additive.

To set up: copy `.gstack-sync.json.example` to `.gstack-sync.json`, create a Supabase project, run the migrations in `supabase/migrations/`, and fill in your credentials. See `docs/designs/TEAM_COORDINATION_STORE.md` for the full guide.

## License

MIT
50 changes: 49 additions & 1 deletion SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,63 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->

## Update Check (run first)
## Preamble (run first)

```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true)
```

If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.

## AskUserQuestion Format

**ALWAYS follow this structure for every AskUserQuestion call:**
1. Context: project name, current branch, what we're working on (1-2 sentences)
2. The specific question or decision point
3. `RECOMMENDATION: Choose [X] because [one-line reason]`
4. Lettered options: `A) ... B) ... C) ...`

If `_SESSIONS` is 3 or more: the user is juggling multiple gstack sessions and context-switching heavily. **ELI16 mode** — they may not remember what this conversation is about. Every AskUserQuestion MUST re-ground them: state the project, the branch, the current plan/task, then the specific problem, THEN the recommendation and options. Be extra clear and self-contained — assume they haven't looked at this window in 20 minutes.

Per-skill instructions may add additional formatting rules on top of this baseline.

## Contributor Mode

If `_CONTRIB` is `true`: you are in **contributor mode**. When you hit friction with **gstack itself** (not the user's app), file a field report. Think: "hey, I was trying to do X with gstack and it didn't work / was confusing / was annoying. Here's what happened."

**gstack issues:** browse command fails/wrong output, snapshot missing elements, skill instructions unclear or misleading, binary crash/hang, unhelpful error message, any rough edge or annoyance — even minor stuff.
**NOT gstack issues:** user's app bugs, network errors to user's URL, auth failures on user's site.

**To file:** write `~/.gstack/contributor-logs/{slug}.md` with this structure:

```
# {Title}
Hey gstack team — ran into this while using /{skill-name}:
**What I was trying to do:** {what the user/agent was attempting}
**What happened instead:** {what actually happened}
**How annoying (1-5):** {1=meh, 3=friction, 5=blocker}
## Steps to reproduce
1. {step}
## Raw output
(wrap any error messages or unexpected output in a markdown code block)
**Date:** {YYYY-MM-DD} | **Version:** {gstack version} | **Skill:** /{skill}
```

Then run: `mkdir -p ~/.gstack/contributor-logs && open ~/.gstack/contributor-logs/{slug}.md`

Slug: lowercase, hyphens, max 60 chars (e.g. `browse-snapshot-ref-gap`). Skip if file already exists. Max 3 reports per session. File inline and continue — don't stop the workflow. Tell user: "Filed gstack field report: {title}"

# gstack browse: QA Testing & Dogfooding

Persistent headless Chromium. First call auto-starts (~3s), then ~100-200ms per command.
Expand Down
2 changes: 1 addition & 1 deletion SKILL.md.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ allowed-tools:

---

{{UPDATE_CHECK}}
{{PREAMBLE}}

# gstack browse: QA Testing & Dogfooding

Expand Down
Loading