Skip to content

fix(md-exports): Fix 100% cache miss rate on Vercel by using content-only cache keys#16313

Merged
BYK merged 4 commits intomasterfrom
byk/fix-md-cache-hash-normalization
Feb 10, 2026
Merged

fix(md-exports): Fix 100% cache miss rate on Vercel by using content-only cache keys#16313
BYK merged 4 commits intomasterfrom
byk/fix-md-cache-hash-normalization

Conversation

@BYK
Copy link
Member

@BYK BYK commented Feb 10, 2026

Problem

The md-exports script converts pre-rendered HTML pages to markdown files for LLM consumption. It uses a file-based cache keyed on the MD5 hash of the HTML content to avoid re-processing unchanged pages. While cache hit rates were near-perfect locally (99.99%), Vercel builds consistently showed 0% cache hits — every single page was re-processed on every deploy, adding ~70 extra seconds to each build.

Root Cause

The cache key was computed from the full stripped HTML, which still contained Emotion CSS hashes (css-o2ofml, etc.) in <style data-emotion> tags and class attributes throughout the page. These hashes change between Vercel builds even for the same commit, invalidating every cache entry.

Other build-specific artifacts in the layout shell (sidebar HTML from merged PRs, font variable classes, CSS module hashes) also contributed to instability, but Emotion CSS was the primary culprit since it wasn't covered by the existing normalization.

Solution

Instead of trying to strip/normalize all unstable patterns from the full HTML, compute the cache key from only the content the pipeline actually uses:

  1. <title> — becomes the H1 heading
  2. <link rel="canonical"> — used for link rewriting
  3. <div id="main"> — becomes the markdown body

Everything else (header, sidebar, footer, scripts, styles, fonts) is excluded from the cache key entirely since it's irrelevant for markdown output. Within div#main, Emotion classes and CSS module hashes are still normalized since code block components use those inside the content area.

This approach is fundamentally more robust than pattern-matching unstable elements — any new source of non-determinism in the layout shell is automatically ignored.

Results

Vercel build with warm cache:

Worker[3]: Cache stats: 2362 hits, 0 misses (0.0% miss rate)
Worker[2]: Cache stats: 2362 hits, 0 misses (0.0% miss rate)
Worker[1]: Cache stats: 2362 hits, 0 misses (0.0% miss rate)
Worker[0]: Cache stats: 2361 hits, 1 misses (0.0% miss rate)

9447/9448 cache hits (99.99%) — the single miss was a legitimate content change (updated SDK registry data). The md-exports step dropped from ~80s to ~10s.

…e keys

The previous fix stripped script/link/style tags but missed build-specific
hashes embedded in the HTML body itself:
- next/font variable classes on <body> (e.g., __variable_c58dd6)
- CSS module class name hashes (e.g., style_sidebar__iEJoR, 60+ occurrences)
- /_next/static/media/ content hashes (e.g., sentry-logo-dark.fc8e1eeb.svg)

These change on every Next.js rebuild even when content is unchanged,
causing 100% cache miss rates.

Verified locally: back-to-back builds now achieve 99.99% cache hit rate
(9447/9448 files), with the single miss being a legitimate content change.

Co-Authored-By: Claude <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
develop-docs Ready Ready Preview, Comment Feb 10, 2026 9:22pm
sentry-docs Ready Ready Preview, Comment Feb 10, 2026 9:22pm

Request Review

@BYK BYK requested a review from sergical February 10, 2026 12:14
…add cache miss diagnostics

- Split stripUnstableElements() into two functions: stripUnstableElements() for
  safe tag-level removal (used as pipeline input) and normalizeForCacheKey() for
  hash normalization (used only for cache key computation). This fixes the CSS
  module regex corrupting actual page content like Sentry__Debug -> Sentry__X.

- Add temporary diagnostic logging on cache misses that logs per-section hashes
  (head, main, layout) for well-known files, enabling cross-build comparison on
  Vercel to identify why cache hit rate is 0%.

- Bump CACHE_VERSION 5 -> 6 for the changed cache key computation.

Co-Authored-By: Claude <noreply@anthropic.com>
…f full HTML normalization

Root cause identified: Emotion CSS hashes (css-o2ofml, etc.) in <style data-emotion>
tags and class attributes change between Vercel builds even for the same commit.
These were not being stripped or normalized, causing 100% cache miss rate.

Instead of trying to normalize all unstable patterns in the full HTML, this change
extracts only the three elements the pipeline actually uses (title, canonical URL,
div#main content) and hashes just those. This makes the cache key immune to:
- Layout/sidebar/header changes (from merged PRs)
- Emotion CSS hash changes
- Font variable class changes
- CSS module hash changes in layout elements
- Any other build-specific variation in the HTML shell

Within div#main, we still normalize Emotion classes and CSS module hashes since
code block components use those inside the content area.

Bumps CACHE_VERSION 6 -> 7.

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Cache miss root cause identified and fixed — Emotion CSS hashes were the culprit.
The content-only extraction approach achieves 99.99% cache hit rate on Vercel
(9447/9448 hits). Diagnostic logging is no longer needed.

Co-Authored-By: Claude <noreply@anthropic.com>
@BYK BYK changed the title fix(md-exports): Normalize CSS module and font hashes for stable cache keys fix(md-exports): Fix 100% cache miss rate on Vercel by using content-only cache keys Feb 10, 2026
@BYK BYK enabled auto-merge (squash) February 10, 2026 21:17
Copy link
Member

@sergical sergical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're the best!

@BYK BYK merged commit a51f343 into master Feb 10, 2026
14 checks passed
@BYK BYK deleted the byk/fix-md-cache-hash-normalization branch February 10, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments