jaypatrick · jaypatrick · Mar 26, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 26, 2026
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -140,6 +140,7 @@
 - [Troubleshooting](troubleshooting/README.md)
   - [KB-001: API Not Available](troubleshooting/KB-001-api-not-available.md)
   - [KB-002: Hyperdrive Database Down](troubleshooting/KB-002-hyperdrive-database-down.md)
+  - [KB-003: Database Down After Deploy — Live Session](troubleshooting/KB-003-neon-hyperdrive-live-session-2026-03-25.md)
   - [Neon Troubleshooting](troubleshooting/neon-troubleshooting.md)
 - [Workflows](workflows/README.md)
   - [Workflows Reference](workflows/WORKFLOWS.md)

diff --git a/docs/troubleshooting/KB-002-hyperdrive-database-down.md b/docs/troubleshooting/KB-002-hyperdrive-database-down.md
@@ -7,6 +7,96 @@
 
 ---
 
+## Session Log — 2026-03-25
+
+This section captures the live troubleshooting session that led to the discovery of this KB article and the subsequent hardening work in v0.76.0+.
+
+### Symptoms Observed
+
+- UI showed **"Degraded performance — v0.75.0"** and **"Data may be stale"** banners on every page load
+- `/api/health` returned `database.status: "down"` with `latency_ms: 0`
+- Cloudflare Hyperdrive admin page showed **zero traffic** despite Neon dashboard showing migration activity
+- Health response showed `hyperdrive_host: "11f7f957eaae03a9fe9365c78e6eb4ed.hyperdrive.local"` — which is the correct Hyperdrive local proxy address (not a bug)
+
+### Diagnosis Steps Taken
+
+```bash
+# Step 1: Inspect the full health response
+curl -s https://adblock-frontend.jayson-knight.workers.dev/api/health | jq .services.database
+# Result:
+# {
+#   "status": "down",
+#   "latency_ms": 0,
+#   "hyperdrive_host": "11f7f957eaae03a9fe9365c78e6eb4ed.hyperdrive.local"
+# }
+
+# Step 2: Check the Hyperdrive binding configuration
+wrangler hyperdrive get 800f7e2edc86488ab24e8621982e9ad7
+# Result showed "scheme": "postgres" — Hyperdrive uses postgres://, not postgresql://
+
+# Step 3: Check deployed version
+curl -s https://adblock-frontend.jayson-knight.workers.dev/api/health | jq .version
+# "0.76.0" — schema fix was deployed
+
+# Step 4: Tail live worker logs
+wrangler tail --format=pretty
+# Revealed: ZodError thrown before any network call when parsing connectionString
+```
+
+### Key Diagnostic Insight: `.hyperdrive.local` Is Correct
+
+The `hyperdrive_host: "11f7f957eaae03a9fe9365c78e6eb4ed.hyperdrive.local"` host is the **Cloudflare-managed local proxy address** inside a deployed Worker. This is expected and correct — it means the `HYPERDRIVE` binding is wired up properly. The failure was happening _before_ any query reached this proxy.
+
+### Root Cause Found
+
+`PrismaClientConfigSchema` in `worker/lib/prisma-config.ts` was only accepting `postgresql://` as a valid scheme. Cloudflare Hyperdrive returns `postgres://` from `env.HYPERDRIVE.connectionString`. This caused a `ZodError` to be thrown synchronously before any TCP connection was attempted — hence `latency_ms: 0`.
+
+The fix (accepting both schemes via `.regex(/^postgre(?:s|sql):\/\//)`) was already applied in the first PR.
+
+### Remaining Issue at v0.76.0
+
+Even after the schema fix was deployed at v0.76.0, the database was still `down`. The `latency_ms: 0` pattern continued, meaning the error was still at instantiation time, not at query time. At this point there were two hypotheses:
+
+1. **The schema fix wasn't actually deployed** — `wrangler deployments list` should confirm the version
+2. **Query-level failure** — the `PrismaClient` was being created but the query itself was failing silently (catch block swallowed the error)
+
+The original `databaseProbe` catch block did not capture `error_code` or `error_message`, making it impossible to distinguish these cases from the health response alone.
+
+### Resolution Path
+
+The hardening work in this PR added:
+
+1. **`error_code` and `error_message`** in the `databaseProbe` catch block — redacted of any connection string fragments
+2. **5-second timeout** via `Promise.race` — a hung Hyperdrive connection no longer blocks the health response indefinitely
+3. **`$disconnect()` in a `finally` block** — prevents connection pool leaks after each probe
+4. **New `GET /api/health/db-smoke` endpoint** — runs `current_database()`, `pg_version`, `now()`, and `table_count` as a richer smoke test. This is the canonical way to verify DB connectivity after every production deploy.
+
+### Using the New Smoke Test Endpoint
+
+After deploying, run:
+
+```bash
+curl -s https://adblock-frontend.jayson-knight.workers.dev/api/health/db-smoke | jq .
+```
+
+Expected healthy output:
+
+```json
+{
+  "ok": true,
+  "db_name": "adblock-compiler",
+  "pg_version": "PostgreSQL 16.x ...",
+  "server_time": "2026-03-25T21:59:15.917Z",
+  "table_count": 17,
+  "latency_ms": 42,
+  "hyperdrive_host": "ep-winter-term-a8rxh2a9-pooler.eastus2.azure.neon.tech"
+}
+```
+
+If it returns `ok: false`, the `error` field will now contain a redacted error message that pinpoints the failure layer.
+
+---
+
 ## Symptom
 
 The live site at `https://adblock-frontend.jayson-knight.workers.dev/` displays two error banners:
@@ -195,8 +285,9 @@ Confirm that `db_name` matches the expected Neon database name. If it returns a
 ## Related KB Articles
 
 - [KB-001](./KB-001-api-not-available.md) — "Getting API is not available" on the main page
-- *(planned)* KB-003 — Cloudflare Queue consumer not processing messages
-- *(planned)* KB-004 — Angular SPA serves stale build after worker deploy
+- [KB-003](./KB-003-neon-hyperdrive-live-session-2026-03-25.md) — Database Down After Deploy — Live Debugging Session (2026-03-25)
+- *(planned)* KB-004 — Cloudflare Queue consumer not processing messages
+- *(planned)* KB-005 — Angular SPA serves stale build after worker deploy
 
 ---
 

diff --git a/docs/troubleshooting/KB-003-neon-hyperdrive-live-session-2026-03-25.md b/docs/troubleshooting/KB-003-neon-hyperdrive-live-session-2026-03-25.md
@@ -0,0 +1,238 @@
+# KB-003: Database Down After Deploy — Live Debugging Session (2026-03-25)
+
+> **Status:** ✅ Active
+> **Affected versions:** v0.75.0, v0.76.0
+> **Resolved in:** Hardening PR (error surfacing + `/api/health/db-smoke` + `$disconnect` + 5s timeout)
+> **Date:** 2026-03-25
+
+---
+
+## Symptom Description
+
+The live site at `https://adblock-frontend.jayson-knight.workers.dev/` showed:
+
+- **"Degraded performance — v0.75.0"** banner on every page (including anonymous/public pages)
+- **"Data may be stale"** secondary banner
+- `/api/health` returned `database.status: "down"` with `latency_ms: 0`
+- Cloudflare Hyperdrive admin dashboard showed **zero traffic** for the binding
+- Neon dashboard showed recent migration activity (migrations had run successfully)
+- Every auth-related request (`/api/auth/get-session`, `/api/auth/sign-out`) was being killed by Cloudflare with "Worker hung" errors visible in `wrangler tail`
+
+The problem affected **all visitors** — not just authenticated users — because the auth middleware always attempts to verify session state on every request.
+
+---
+
+## All Diagnostic Commands Run and Their Outputs
+
+### 1. Health endpoint check
+
+```bash
+curl -s https://adblock-frontend.jayson-knight.workers.dev/api/health | jq .services.database
+```
+
+Output:
+
+```json
+{
+  "status": "down",
+  "latency_ms": 0,
+  "hyperdrive_host": "11f7f957eaae03a9fe9365c78e6eb4ed.hyperdrive.local"
+}
+```
+
+**Key observations:**
+- `latency_ms: 0` — failure is at instantiation, not network level
+- `.hyperdrive.local` — this is **correct**; it is the Cloudflare internal proxy address
+
+### 2. Hyperdrive binding configuration
+
+```bash
+wrangler hyperdrive get 800f7e2edc86488ab24e8621982e9ad7
+```
+
+Output:
+
+```json
+{
+  "id": "800f7e2edc86488ab24e8621982e9ad7",
+  "name": "hyperdrive-adblockdb-neon-prod",
+  "origin": {
+    "host": "ep-winter-term-a8rxh2a9-pooler.eastus2.azure.neon.tech",
+    "port": 5432,
+    "database": "adblock-compiler",
+    "scheme": "postgres",
+    "user": "neondb_owner"
+  }
+}
+```
+
+**Key observation:** `"scheme": "postgres"` means `env.HYPERDRIVE.connectionString` starts with `postgres://`, not `postgresql://`.
+
+### 3. Deployed version verification
+
+```bash
+curl -s https://adblock-frontend.jayson-knight.workers.dev/api/health | jq .version
+# "0.76.0"
+
+wrangler deployments list
+# Confirmed v0.76.0 was the active deployment
+```
+
+### 4. Worker tail (most important)
+
+```bash
+wrangler tail --format=pretty
+```
+
+Output (paraphrased):
+
+```
+✘ [ERROR] Error: The Workers runtime canceled this request because it detected
+  that your Worker's code had hung and would never generate a response.
+```
+
+This appeared on every `/api/auth/*` request. The root cause: `createAuth(env)` internally calls `createPrismaClient()`, which (in the broken state) either threw synchronously or created a `pg.Pool` that hung waiting for a connection — exhausting the Worker's CPU time budget until Cloudflare killed the request.
+
+---
+
+## Decision Tree
+
+Use this diagram to categorize a `database: down` scenario:
+
+```mermaid
+flowchart TD
+    A["database.status == 'down'"]
+    A --> B{"latency_ms == 0?"}
+
+    B -->|YES| C["Failure at instantiation/validation\n(before network I/O)"]
+    C --> D["Check: does PrismaClientConfigSchema\naccept the Hyperdrive scheme?"]
+    D --> E["Run: wrangler hyperdrive get &lt;id&gt;\n— look at 'scheme' field"]
+    D --> F["Fix: update regex to accept\nboth postgres:// and postgresql://"]
+    C --> G["Check: is env.HYPERDRIVE binding present?"]
+    G --> H["Run: grep HYPERDRIVE wrangler.toml"]
+    G --> I["Fix: add [[hyperdrive]] binding\nwith correct id"]
+
+    B -->|"NO (latency_ms > 0)"| J["Connection was attempted but failed"]
+    J --> K["latency_ms ~ 1–100 ms\n→ ECONNREFUSED — wrong host/port"]
+    J --> L["latency_ms ~ 1000–5000 ms\n→ TCP timeout — host unreachable"]
+    J --> M["latency_ms == 5000 exactly\n→ Probe timeout (5s guard fired)"]
+    M --> N["Run: GET /api/health/db-smoke\nfor detailed diagnostics"]
+    M --> O["Check: Neon project status,\nHyperdrive binding ID, network egress"]
+```
+
+---
+
+## Interpreting `wrangler deployments list` Output
+
+```bash
+wrangler deployments list
+```
+
+Look for:
+- The most recent deployment's version tag (should match the `version` field in `/api/health`)
+- Whether the deployment completed successfully (no error state)
+- The binding list — confirm `HYPERDRIVE` appears with the correct binding ID
+
+If the version in the health response does not match the most recent deployment, either the deployment failed silently or there is a routing issue (frontend Worker proxying to an old backend Worker version).
+
+---
+
+## What `"scheme": "postgres"` vs `"postgresql"` Means
+
+Cloudflare Hyperdrive always returns `postgres://` (the PostgreSQL short alias) from `env.HYPERDRIVE.connectionString`. This is not configurable — it is hardcoded by the Hyperdrive proxy.
+
+| Hyperdrive `scheme` config value | `env.HYPERDRIVE.connectionString` prefix |
+|---|---|
+| `"postgres"` | `postgres://...` |
+| `"postgresql"` | `postgresql://...` |
+
+Both are valid PostgreSQL connection string schemes. The standard form is `postgresql://`; `postgres://` is an alias recognized by all PostgreSQL clients including Prisma and `pg`.
+
+If your Zod schema only accepts one form, it will reject the other at validation time — with zero network activity and `latency_ms: 0`.
+
+---
+
+## Step-by-Step Resolution Checklist
+
+Use this checklist to resolve `database: down` after a deploy.
+
+### Phase 1 — Confirm the failure layer
+
+- [ ] Run `curl -s <your-worker>/api/health | jq .services.database`
+- [ ] Note `latency_ms` — is it `0` (instantiation failure) or non-zero (network failure)?
+- [ ] Note `error_code` and `error_message` (added in v0.76.0+) — what do they say?
+- [ ] Run `GET /api/health/db-smoke` — does it return `ok: true` or `ok: false` with an `error`?
+
+### Phase 2 — If `latency_ms == 0` (instantiation failure)
+
+- [ ] Run `wrangler hyperdrive get <id>` — note the `scheme` field
+- [ ] Check `worker/lib/prisma-config.ts` — does the `connectionString` regex accept the scheme?
+- [ ] Check `wrangler.toml` — does `[[hyperdrive]]` have the correct `id`?
+- [ ] Run `wrangler tail` — look for `ZodError` or similar validation exceptions
+- [ ] Apply schema fix if needed, run `wrangler deploy`, re-check health
+
+### Phase 3 — If `latency_ms > 0` (network failure)
+
+- [ ] Run `GET /api/health/db-smoke` — full diagnostic output
+- [ ] Check Neon project status at `console.neon.tech`
+- [ ] Verify the Hyperdrive binding origin host matches the Neon pooler URL
+- [ ] Check if Neon branch/endpoint is active (not sleeping)
+- [ ] Run `wrangler tail` during a request to capture the raw error
+
+### Phase 4 — Verify the fix
+
+- [ ] Run `GET /api/health/db-smoke` — expect `ok: true`
+- [ ] Run `GET /api/health` — expect `services.database.status: "healthy"`
+- [ ] Check UI — "Degraded performance" banner should be gone
+- [ ] Check Cloudflare Hyperdrive dashboard — traffic should appear within a few minutes
+
+---
+
+## Prevention
+
+**Always run `GET /api/health/db-smoke` after every production deploy.**
+
+Add this to your post-deploy checklist or CI/CD pipeline:
+
+```bash
+# Post-deploy smoke test — fails if DB is not reachable
+SMOKE=$(curl -s https://<your-worker>.workers.dev/api/health/db-smoke)
+OK=$(echo "$SMOKE" | jq -r '.ok')
+if [ "$OK" != "true" ]; then
+  echo "DB smoke test FAILED: $(echo $SMOKE | jq -r '.error')"
+  exit 1
+fi
+echo "DB smoke test passed — table_count=$(echo $SMOKE | jq -r '.table_count')"
+```
+
+This catches Zod validation regressions, wrong binding IDs, and Neon endpoint issues before they affect users.
+
+---
+
+## Worker Code Changed in This Session
+
+| File | Change |
+|---|---|
+| `worker/lib/prisma-config.ts` | Accept both `postgres://` and `postgresql://` schemes |
+| `worker/handlers/health.ts` | `databaseProbe`: error surfacing, 5s timeout, `$disconnect()` in finally |
+| `worker/handlers/health.ts` | New `handleDbSmoke` handler |
+| `worker/hono-app.ts` | Wire `GET /health/db-smoke` route + add to public monitoring paths |
+| `frontend/src/app/performance/performance.component.ts` | Surface `error_message` and `db_name` in health card |
+| `frontend/src/app/admin/dashboard/dashboard.component.ts` | Surface `error_message` and `db_name` in admin health indicators |
+
+---
+
+## ZTA Security Notes
+
+- `GET /api/health/db-smoke` is **unauthenticated** (same tier as `/api/health`) — intentional for diagnostics
+- No credentials, connection strings, or PII are returned by either endpoint
+- `error_message` is sanitized: `postgres://...` patterns are replaced with `[redacted]` before being surfaced
+- `hyperdrive_host` is the only connection detail returned — it is a hostname, not a credential
+
+---
+
+## Related Articles
+
+- [KB-002](./KB-002-hyperdrive-database-down.md) — Root cause analysis and schema fix
+- [Neon Troubleshooting](./neon-troubleshooting.md) — General Neon/Hyperdrive/Prisma troubleshooting
+- [KB-001](./KB-001-api-not-available.md) — "API not available" on the main page
diff --git a/docs/troubleshooting/README.md b/docs/troubleshooting/README.md
@@ -12,8 +12,9 @@ Each article follows a consistent structure: symptom, diagnostic commands, root
 |---|---|---|
 | [KB-001](./KB-001-api-not-available.md) | "Getting API is not available" on the main page | ✅ Active |
 | [KB-002](./KB-002-hyperdrive-database-down.md) | Hyperdrive binding connected but `database` service reports `down` | ✅ Active |
-| KB-003 | Cloudflare Queue consumer not processing messages | 🗓 Planned |
-| KB-004 | Angular SPA serves stale build after worker deploy | 🗓 Planned |
+| [KB-003](./KB-003-neon-hyperdrive-live-session-2026-03-25.md) | Database Down After Deploy — Live Debugging Session (2026-03-25) | ✅ Active |
+| KB-004 | Cloudflare Queue consumer not processing messages | 🗓 Planned |
+| KB-005 | Angular SPA serves stale build after worker deploy | 🗓 Planned |
 
 ---