Skip to content

docs: KB-002 — Hyperdrive postgres:// scheme rejected by Zod, database: down with latency_ms: 0#1407

Merged
jaypatrick merged 3 commits intomainfrom
copilot/document-hyperdrive-database-incident
Mar 25, 2026
Merged

docs: KB-002 — Hyperdrive postgres:// scheme rejected by Zod, database: down with latency_ms: 0#1407
jaypatrick merged 3 commits intomainfrom
copilot/document-hyperdrive-database-incident

Conversation

Copy link
Contributor

Copilot AI commented Mar 25, 2026

Documents the production incident where database.status: "down" + latency_ms: 0 was caused by PrismaClientConfigSchema rejecting Hyperdrive's postgres:// scheme (Hyperdrive never emits postgresql://), meaning Zod threw before any network call was made and Hyperdrive showed zero activity.

Description

Adds KB-002 to the troubleshooting series covering the Hyperdrive postgres:// vs postgresql:// schema-validation failure. The latency_ms: 0 pattern is the key diagnostic signal that distinguishes a validation-layer failure from a real network failure.

Changes

  • docs/troubleshooting/KB-002-hyperdrive-database-down.md — New article covering:
    • Symptom: database: down, latency_ms: 0, Hyperdrive dashboard shows zero queries
    • Diagnostic decision tree keyed on latency_ms: 0 → scheme mismatch → Zod rejection
    • Resolution: update PrismaClientConfigSchema to accept both schemes via /^postgre(?:s|sql):\/\//
    • Enhanced health probe: SELECT current_database() surfaces db_name + hyperdrive_host to catch wrong-database condition
    • Prevention: db_name in health response as a continuous assertion
    • Related files: worker/lib/prisma-config.ts, worker/handlers/health.ts
  • docs/troubleshooting/README.md — KB-002 linked and marked ✅ Active (replaces the old Clerk JWT KB-002 placeholder)
  • docs/SUMMARY.md — KB-002 added to Troubleshooting section
  • docs/troubleshooting/KB-001-api-not-available.md — Related Articles updated to link KB-002

Testing

  • Unit tests added/updated
  • Manual testing performed — article structure and links verified against KB-001 conventions
  • CI passes

Zero Trust Architecture Checklist

This PR does not touch worker/ or frontend/. ZTA checklist is not required.

Original prompt

Task: Document the Hyperdrive / postgres:// database down incident in the troubleshooting docs

This conversation diagnosed and fixed a production issue where GET /api/health returned database: down even though Hyperdrive was correctly configured and Neon migrations were running fine.


Root cause (for context)

worker/lib/prisma-config.ts validated the Hyperdrive connection string with .startsWith('postgresql://'). However, the Cloudflare Hyperdrive binding's .connectionString property always returns the postgres:// short alias (the "scheme" field in Hyperdrive config is "postgres"). This caused PrismaClientConfigSchema.parse(...) to throw a ZodError instantly (before any network call), which the health probe caught and turned into { status: 'down', latency_ms: 0 }. Hyperdrive therefore showed zero activity because the connection was never opened.

The fix (applied in a separate PR) was:

  1. Update PrismaClientConfigSchema to accept both postgres:// and postgresql://.
  2. Enhance handleHealth to run SELECT current_database() instead of SELECT 1, and surface db_name + hyperdrive_host in the JSON response so the wrong-database condition is also caught.

Changes required

1. Create docs/troubleshooting/KB-002-hyperdrive-database-down.md

Follow the exact same structure as KB-001-api-not-available.md. The article must cover:

  • Symptom — UI shows "Degraded performance" and "Data may be stale"; GET /api/health returns database.status: "down" with latency_ms: 0; Hyperdrive dashboard shows zero activity
  • Diagnostic commandscurl -s https://<your-worker>.workers.dev/api/health | jq . and npx wrangler hyperdrive get <id>
  • Root cause decision tree — Zod schema rejects postgres:// (Hyperdrive's actual scheme); the latency_ms: 0 is the key tell that the failure is at validation, not at the network
  • Resolution — Update PrismaClientConfigSchema to accept both schemes; also explains the enhanced health check (SELECT current_database()) introduced as part of the fix
  • Prevention — note the new db_name field in the health response lets you confirm the correct database is connected
  • Related files: worker/lib/prisma-config.ts, worker/handlers/health.ts

Here is the full article text to use verbatim:

# KB-002: Hyperdrive Binding Connected but `database` Service Reports `down`

> **Status:** ✅ Active  
> **Affected version:** v0.75.0  
> **Resolved in:** PR fixing `PrismaClientConfigSchema` to accept `postgres://` + enhanced `/api/health` probe  
> **Date:** 2026-03-25

---

## Symptom

The live site at `https://adblock-frontend.jayson-knight.workers.dev/` displays two error banners:

- **"Degraded performance — v0.75.0"**
- **"Data may be stale"**

Hitting the health endpoint returns:

```json
{
  "status": "down",
  "version": "0.75.0",
  "timestamp": "2026-03-25T21:59:15.917Z",
  "services": {
    "gateway":  { "status": "healthy" },
    "database": { "status": "down", "latency_ms": 0 },
    "compiler": { "status": "healthy" },
    "auth":     { "status": "healthy", "provider": "better-auth" },
    "cache":    { "status": "healthy", "latency_ms": 132 }
  }
}
```

**Key tell:** `latency_ms: 0` on the `database` service.  
A real network failure or timeout always returns a non-zero latency. An instant `0 ms` failure means the probe threw *before* any connection attempt — i.e., at the validation layer.

The Cloudflare Hyperdrive dashboard shows **zero queries/connections** despite the Neon dashboard showing migration activity.

---

## Diagnostic Commands

```bash
# 1. Inspect the full health response
curl -s https://<your-worker>.workers.dev/api/health | jq .

# 2. Confirm the Hyperdrive config is correct
npx wrangler hyperdrive get <hyperdrive-id>
# Expected output — note "scheme": "postgres" (not "postgresql")
# {
#   "origin": {
#     "host": "ep-winter-term-a8rxh2a9-pooler.eastus2.azure.neon.tech",
#     "port": 5432,
#     "database": "adblock-compiler",
#     "scheme": "postgres",          ← THIS IS THE CLUE
#     "user": "neondb_owner"
#   }
# }

# 3. Verify the database name in wrangler.toml [[hyperdrive]] section
grep -A5 '\[\[hyperdrive\]\]' wrangler.toml
```

---

## Root Cause

### Failure point ❷ — Zod validation in `PrismaClientConfigSchema`

The connection string validation schema in `worker/lib/prisma-config.ts` was:

```typescript
export const PrismaClientConfigSchema = z.object({
    connectionString: z.string().url().startsWith('postgresql://'),
});
```

The Cloudflare Hyperdrive binding always returns a connection string using the **`postgres://`** scheme alias (not `postgresql://`). This is visible in the Hyperdrive config where `"scheme": "postgres"`.

When `handleHealth` called:

```typescript
const prisma = _internals.createPrismaClient(env.HYPERDRIVE!.connectionString);
```

`createPrismaClient` internally called `PrismaClientConfigSchema.parse(...)`, which threw a...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---

📱 Kick off Copilot coding agent tasks wherever you are with [GitHub Mobile](https://gh.io/cca-mobile-docs), available on iOS and Android.

Copilot AI changed the title [WIP] Document Hyperdrive database down incident in troubleshooting docs docs: KB-002 — Hyperdrive postgres:// scheme rejected by Zod, database: down with latency_ms: 0 Mar 25, 2026
Copilot AI requested a review from jaypatrick March 25, 2026 22:17
@jaypatrick jaypatrick marked this pull request as ready for review March 25, 2026 22:28
Copilot AI review requested due to automatic review settings March 25, 2026 22:28
@jaypatrick jaypatrick merged commit eb79a21 into main Mar 25, 2026
23 checks passed
@jaypatrick jaypatrick deleted the copilot/document-hyperdrive-database-incident branch March 25, 2026 22:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new troubleshooting KB (KB-002) documenting the Hyperdrive postgres:// vs postgresql:// Zod validation incident and wires it into the docs navigation.

Changes:

  • Add new article KB-002 describing the database: down + latency_ms: 0 symptom and resolution.
  • Update troubleshooting index/links to reference KB-002 as active.
  • Add KB-002 to the docs summary navigation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
docs/troubleshooting/README.md Marks KB-002 as active and links to the new article.
docs/troubleshooting/KB-002-hyperdrive-database-down.md New KB article documenting diagnosis and remediation steps for the Hyperdrive scheme-validation failure.
docs/troubleshooting/KB-001-api-not-available.md Updates “Related KB Articles” to link to KB-002.
docs/SUMMARY.md Adds KB-002 entry under Troubleshooting.

Comment on lines +34 to +35
**Key tell:** `latency_ms: 0` on the `database` service.
A real network failure or timeout always returns a non-zero latency. An instant `0 ms` failure means the probe threw *before* any connection attempt — i.e., at the validation layer.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The article states that a real network failure/timeout “always returns a non-zero latency” and that latency_ms: 0 definitively means validation-layer failure. In the current health implementation, latency_ms is derived from Date.now() deltas (ms resolution), so values can be 0 for very fast failures beyond just Zod parsing. Consider softening this to “near-zero” and treating it as a strong hint rather than a guarantee, and optionally mention checking logs for the ZodError as the definitive signal.

Suggested change
**Key tell:** `latency_ms: 0` on the `database` service.
A real network failure or timeout always returns a non-zero latency. An instant `0 ms` failure means the probe threw *before* any connection attempt — i.e., at the validation layer.
**Key tell:** `latency_ms: 0` (or near-zero) on the `database` service.
Real network failures or timeouts typically show a non-zero latency. A `0 ms` (or near-zero) failure is a strong hint that the probe threw *before* any connection attempt — for example, at the validation/parse layer — but you should confirm this by checking logs for a `ZodError` as described below.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +102
// After the fix — accepts both schemes
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “After the fix” snippet uses .regex(/^postgre(?:s|sql):\/\//) but the current PrismaClientConfigSchema implementation in worker/lib/prisma-config.ts uses .refine((s) => s.startsWith('postgresql://') || s.startsWith('postgres://'), ...). To avoid drift/confusion, update the snippet (or clearly label it as pseudo-code) to match the repo’s actual implementation and error message.

Suggested change
// After the fix — accepts both schemes
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
// After the fix — accepts both schemes (postgres:// and postgresql://)
connectionString: z.string().url().refine((s) => s.startsWith('postgresql://') || s.startsWith('postgres://')),

Copilot uses AI. Check for mistakes.
connectionString: z.string().url().startsWith('postgresql://'),

// After
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section repeats the “Before/After” code snippet for PrismaClientConfigSchema, but the “After” example uses .regex(/^postgre(?:s|sql):\/\//) which doesn’t match the current implementation (it uses .refine(...) with startsWith('postgres://') || startsWith('postgresql://')). Please align the snippet here as well so the Resolution steps are accurate for operators following the KB.

Suggested change
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
connectionString: z.string().url().refine(
(value) => value.startsWith('postgres://') || value.startsWith('postgresql://'),
{ message: 'connectionString must start with "postgres://" or "postgresql://"' },
),

Copilot uses AI. Check for mistakes.
"database": {
"status": "healthy",
"latency_ms": 42,
"db_name": "neondb",
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example enhanced health response shows db_name: "neondb", but the worker’s health probe currently treats db_name !== 'adblock-compiler' as a degraded/wrong-database condition. Using neondb in the “healthy” example can mislead responders. Suggest updating the example to the expected production DB name for this repo (or making the example explicitly generic).

Suggested change
"db_name": "neondb",
"db_name": "adblock-compiler",

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +6
# KB-002: Hyperdrive Binding Connected but `database` Service Reports `down`

> **Status:** ✅ Active
> **Affected version:** v0.75.0
> **Resolved in:** PR fixing `PrismaClientConfigSchema` to accept `postgres://` + enhanced `/api/health` probe
> **Date:** 2026-03-25
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says KB-002 should “follow the exact same structure as KB-001”. KB-001 starts with a consistent metadata block (Series/Component/Service URL/Date Created/Status), but KB-002 uses a different header format and omits several of those fields. If consistent KB structure is a goal, consider matching KB-001’s header metadata format here (or update the PR description if the divergence is intentional).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants