Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@
- [Postman Testing](testing/POSTMAN_TESTING.md)
- [Troubleshooting](troubleshooting/README.md)
- [KB-001: API Not Available](troubleshooting/KB-001-api-not-available.md)
- [KB-002: Hyperdrive Database Down](troubleshooting/KB-002-hyperdrive-database-down.md)
- [Neon Troubleshooting](troubleshooting/neon-troubleshooting.md)
- [Workflows](workflows/README.md)
- [Workflows Reference](workflows/WORKFLOWS.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/troubleshooting/KB-001-api-not-available.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ All four config endpoints are **intentionally pre-auth** — they expose no secr

## Related KB Articles

- *(planned)* KB-002 — Clerk JWT auth degraded / local JWT fallback
- [KB-002](./KB-002-hyperdrive-database-down.md)Hyperdrive binding connected but `database` service reports `down`
- *(planned)* KB-003 — Cloudflare Queue consumer not processing messages
- *(planned)* KB-004 — Angular SPA serves stale build after worker deploy

Expand Down
205 changes: 205 additions & 0 deletions docs/troubleshooting/KB-002-hyperdrive-database-down.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# KB-002: Hyperdrive Binding Connected but `database` Service Reports `down`

> **Status:** ✅ Active
> **Affected version:** v0.75.0
> **Resolved in:** PR fixing `PrismaClientConfigSchema` to accept `postgres://` + enhanced `/api/health` probe
> **Date:** 2026-03-25
Comment on lines +1 to +6
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says KB-002 should “follow the exact same structure as KB-001”. KB-001 starts with a consistent metadata block (Series/Component/Service URL/Date Created/Status), but KB-002 uses a different header format and omits several of those fields. If consistent KB structure is a goal, consider matching KB-001’s header metadata format here (or update the PR description if the divergence is intentional).

Copilot uses AI. Check for mistakes.

---

## Symptom

The live site at `https://adblock-frontend.jayson-knight.workers.dev/` displays two error banners:

- **"Degraded performance — v0.75.0"**
- **"Data may be stale"**

Hitting the health endpoint returns:

```json
{
"status": "down",
"version": "0.75.0",
"timestamp": "2026-03-25T21:59:15.917Z",
"services": {
"gateway": { "status": "healthy" },
"database": { "status": "down", "latency_ms": 0 },
"compiler": { "status": "healthy" },
"auth": { "status": "healthy", "provider": "better-auth" },
"cache": { "status": "healthy", "latency_ms": 132 }
}
}
```

**Key tell:** `latency_ms: 0` on the `database` service.
A real network failure or timeout always returns a non-zero latency. An instant `0 ms` failure means the probe threw *before* any connection attempt — i.e., at the validation layer.
Comment on lines +34 to +35
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The article states that a real network failure/timeout “always returns a non-zero latency” and that latency_ms: 0 definitively means validation-layer failure. In the current health implementation, latency_ms is derived from Date.now() deltas (ms resolution), so values can be 0 for very fast failures beyond just Zod parsing. Consider softening this to “near-zero” and treating it as a strong hint rather than a guarantee, and optionally mention checking logs for the ZodError as the definitive signal.

Suggested change
**Key tell:** `latency_ms: 0` on the `database` service.
A real network failure or timeout always returns a non-zero latency. An instant `0 ms` failure means the probe threw *before* any connection attempt — i.e., at the validation layer.
**Key tell:** `latency_ms: 0` (or near-zero) on the `database` service.
Real network failures or timeouts typically show a non-zero latency. A `0 ms` (or near-zero) failure is a strong hint that the probe threw *before* any connection attempt — for example, at the validation/parse layer — but you should confirm this by checking logs for a `ZodError` as described below.

Copilot uses AI. Check for mistakes.

The Cloudflare Hyperdrive dashboard shows **zero queries/connections** despite the Neon dashboard showing migration activity.

---

## Diagnostic Commands

```bash
# 1. Inspect the full health response
curl -s https://<your-worker>.workers.dev/api/health | jq .

# 2. Check the Hyperdrive binding configuration
npx wrangler hyperdrive get <hyperdrive-id>

# 3. Tail the live worker log to catch Zod validation errors
wrangler tail
```

Look for lines like `ZodError: Invalid url` or `Expected string, received undefined` in the tail output. A `ZodError` thrown during `PrismaClientConfigSchema.parse()` is definitive proof that the failure is at the config-validation layer, not the network.

---

## Root Cause Decision Tree

### ❶ Is `latency_ms` exactly `0`?

**If YES** — the database probe threw *before* opening any connection. This points to a config-validation failure, not a network failure. Proceed to ❷.

**If NO** (latency is non-zero) — the connection was attempted but timed out or was refused. Check Neon project status, Hyperdrive binding ID, and network egress. This article does not cover that case.

---

### ❷ What scheme does the Hyperdrive connection string use?

```bash
npx wrangler hyperdrive get <hyperdrive-id>
```

The output will show a `"scheme"` field. Cloudflare Hyperdrive **always** returns `postgres://` (not `postgresql://`) from the `env.HYPERDRIVE.connectionString` binding property.

```json
{
"id": "...",
"name": "adblock-hyperdrive",
"origin": {
"scheme": "postgres",
"host": "...",
"port": 5432,
"database": "neondb"
}
}
```

**If the scheme is `postgres`** — and `PrismaClientConfigSchema` only accepts `postgresql://`, the schema parse throws a `ZodError` instantly. This is the root cause. Proceed to Resolution.

---

### ❸ Does `PrismaClientConfigSchema` accept `postgres://`?

Open `worker/lib/prisma-config.ts` and check the `connectionString` validator:

```typescript
// Before the fix — rejects Hyperdrive's actual scheme
connectionString: z.string().url().startsWith('postgresql://'),

// After the fix — accepts both schemes
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
Comment on lines +101 to +102
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “After the fix” snippet uses .regex(/^postgre(?:s|sql):\/\//) but the current PrismaClientConfigSchema implementation in worker/lib/prisma-config.ts uses .refine((s) => s.startsWith('postgresql://') || s.startsWith('postgres://'), ...). To avoid drift/confusion, update the snippet (or clearly label it as pseudo-code) to match the repo’s actual implementation and error message.

Suggested change
// After the fix — accepts both schemes
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
// After the fix — accepts both schemes (postgres:// and postgresql://)
connectionString: z.string().url().refine((s) => s.startsWith('postgresql://') || s.startsWith('postgres://')),

Copilot uses AI. Check for mistakes.
```

If the schema only allows `postgresql://`, every request that tries to build a `PrismaClient` from the Hyperdrive binding will fail at parse time with zero network activity.

---

## Resolution

### Step 1 — Update `PrismaClientConfigSchema` to accept both URL schemes

In `worker/lib/prisma-config.ts`, change the `connectionString` validation to accept both `postgres://` and `postgresql://`:

```typescript
// worker/lib/prisma-config.ts

// Before
connectionString: z.string().url().startsWith('postgresql://'),

// After
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section repeats the “Before/After” code snippet for PrismaClientConfigSchema, but the “After” example uses .regex(/^postgre(?:s|sql):\/\//) which doesn’t match the current implementation (it uses .refine(...) with startsWith('postgres://') || startsWith('postgresql://')). Please align the snippet here as well so the Resolution steps are accurate for operators following the KB.

Suggested change
connectionString: z.string().url().regex(/^postgre(?:s|sql):\/\//),
connectionString: z.string().url().refine(
(value) => value.startsWith('postgres://') || value.startsWith('postgresql://'),
{ message: 'connectionString must start with "postgres://" or "postgresql://"' },
),

Copilot uses AI. Check for mistakes.
```

This accepts `postgres://...` (Hyperdrive short alias) and `postgresql://...` (standard long form) while rejecting anything else.

### Step 2 — Deploy

```bash
wrangler deploy
```

After deploying, hit the health endpoint again:

```bash
curl -s https://<your-worker>.workers.dev/api/health | jq .
```

You should see `database.status: "healthy"` with a non-zero `latency_ms`.

### Step 3 — Verify with the enhanced health probe

The fix also introduced an enhanced health check in `worker/handlers/health.ts` that runs `SELECT current_database()` instead of `SELECT 1`. This surfaces `db_name` and `hyperdrive_host` in the health response:

```json
{
"services": {
"database": {
"status": "healthy",
"latency_ms": 42,
"db_name": "neondb",
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example enhanced health response shows db_name: "neondb", but the worker’s health probe currently treats db_name !== 'adblock-compiler' as a degraded/wrong-database condition. Using neondb in the “healthy” example can mislead responders. Suggest updating the example to the expected production DB name for this repo (or making the example explicitly generic).

Suggested change
"db_name": "neondb",
"db_name": "adblock-compiler",

Copilot uses AI. Check for mistakes.
"hyperdrive_host": "...-pooler.us-east-2.aws.neon.tech"
}
}
}
```

Confirm that `db_name` matches the expected Neon database name. If it returns a different database name, the Hyperdrive binding is pointed at the wrong Neon project or branch.

---

## Prevention

- The new `db_name` field in the health response acts as a continuous assertion that the correct database is connected. Monitor this value in your observability dashboards.
- When configuring a new Hyperdrive binding, always run `wrangler hyperdrive get <id>` to confirm the `scheme` field. If it is `"postgres"`, ensure all Zod schemas that validate the connection string accept `postgres://`.
- Add an integration test that builds `PrismaClientConfigSchema.parse()` with a `postgres://` URL to catch future regressions.

---

## Worker Code Reference

| File | Relevance |
|---|---|
| `worker/lib/prisma-config.ts` | `PrismaClientConfigSchema` — validates the Hyperdrive connection string before `PrismaClient` is created |
| `worker/handlers/health.ts` | `handleHealth` — runs the database probe; surface `db_name` and `hyperdrive_host` from the enhanced `SELECT current_database()` query |

---

## ZTA Security Note

`env.HYPERDRIVE.connectionString` is a runtime binding secret — it is never logged, committed, or exposed in the health response. Only the host portion (`hyperdrive_host`) is surfaced for diagnostic purposes. The `db_name` field is also safe to expose: it is a non-secret label, not a credential.

---

## Resolution Summary

| Symptom | Root Cause | Fix |
|---|---|---|
| `database: down`, `latency_ms: 0` | `PrismaClientConfigSchema` rejected `postgres://` | Accept both `postgres://` and `postgresql://` in the regex |
| Hyperdrive dashboard shows zero activity | `PrismaClient` never created — Zod threw before any network call | Same fix as above |
| Health shows wrong `db_name` | Hyperdrive binding points to wrong Neon project/branch | Update Hyperdrive binding origin in Cloudflare dashboard |

---

## Related KB Articles

- [KB-001](./KB-001-api-not-available.md) — "Getting API is not available" on the main page
- *(planned)* KB-003 — Cloudflare Queue consumer not processing messages
- *(planned)* KB-004 — Angular SPA serves stale build after worker deploy

---

## Feedback & Contribution

If you discovered a new failure mode while using this article, please open an issue tagged `troubleshooting` and `documentation` in `jaypatrick/adblock-compiler` with the details so it can be captured in a follow-up KB entry.
2 changes: 1 addition & 1 deletion docs/troubleshooting/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Each article follows a consistent structure: symptom, diagnostic commands, root
| Article | Title | Status |
|---|---|---|
| [KB-001](./KB-001-api-not-available.md) | "Getting API is not available" on the main page | ✅ Active |
| KB-002 | Clerk JWT auth degraded / local JWT fallback | 🗓 Planned |
| [KB-002](./KB-002-hyperdrive-database-down.md) | Hyperdrive binding connected but `database` service reports `down` | ✅ Active |
| KB-003 | Cloudflare Queue consumer not processing messages | 🗓 Planned |
| KB-004 | Angular SPA serves stale build after worker deploy | 🗓 Planned |

Expand Down
Loading