fix(md-exports): Fix 100% cache miss rate on Vercel by using content-only cache keys#16313
Merged
fix(md-exports): Fix 100% cache miss rate on Vercel by using content-only cache keys#16313
Conversation
…e keys The previous fix stripped script/link/style tags but missed build-specific hashes embedded in the HTML body itself: - next/font variable classes on <body> (e.g., __variable_c58dd6) - CSS module class name hashes (e.g., style_sidebar__iEJoR, 60+ occurrences) - /_next/static/media/ content hashes (e.g., sentry-logo-dark.fc8e1eeb.svg) These change on every Next.js rebuild even when content is unchanged, causing 100% cache miss rates. Verified locally: back-to-back builds now achieve 99.99% cache hit rate (9447/9448 files), with the single miss being a legitimate content change. Co-Authored-By: Claude <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…add cache miss diagnostics - Split stripUnstableElements() into two functions: stripUnstableElements() for safe tag-level removal (used as pipeline input) and normalizeForCacheKey() for hash normalization (used only for cache key computation). This fixes the CSS module regex corrupting actual page content like Sentry__Debug -> Sentry__X. - Add temporary diagnostic logging on cache misses that logs per-section hashes (head, main, layout) for well-known files, enabling cross-build comparison on Vercel to identify why cache hit rate is 0%. - Bump CACHE_VERSION 5 -> 6 for the changed cache key computation. Co-Authored-By: Claude <noreply@anthropic.com>
…f full HTML normalization Root cause identified: Emotion CSS hashes (css-o2ofml, etc.) in <style data-emotion> tags and class attributes change between Vercel builds even for the same commit. These were not being stripped or normalized, causing 100% cache miss rate. Instead of trying to normalize all unstable patterns in the full HTML, this change extracts only the three elements the pipeline actually uses (title, canonical URL, div#main content) and hashes just those. This makes the cache key immune to: - Layout/sidebar/header changes (from merged PRs) - Emotion CSS hash changes - Font variable class changes - CSS module hash changes in layout elements - Any other build-specific variation in the HTML shell Within div#main, we still normalize Emotion classes and CSS module hashes since code block components use those inside the content area. Bumps CACHE_VERSION 6 -> 7. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Cache miss root cause identified and fixed — Emotion CSS hashes were the culprit. The content-only extraction approach achieves 99.99% cache hit rate on Vercel (9447/9448 hits). Diagnostic logging is no longer needed. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The md-exports script converts pre-rendered HTML pages to markdown files for LLM consumption. It uses a file-based cache keyed on the MD5 hash of the HTML content to avoid re-processing unchanged pages. While cache hit rates were near-perfect locally (99.99%), Vercel builds consistently showed 0% cache hits — every single page was re-processed on every deploy, adding ~70 extra seconds to each build.
Root Cause
The cache key was computed from the full stripped HTML, which still contained Emotion CSS hashes (
css-o2ofml, etc.) in<style data-emotion>tags andclassattributes throughout the page. These hashes change between Vercel builds even for the same commit, invalidating every cache entry.Other build-specific artifacts in the layout shell (sidebar HTML from merged PRs, font variable classes, CSS module hashes) also contributed to instability, but Emotion CSS was the primary culprit since it wasn't covered by the existing normalization.
Solution
Instead of trying to strip/normalize all unstable patterns from the full HTML, compute the cache key from only the content the pipeline actually uses:
<title>— becomes the H1 heading<link rel="canonical">— used for link rewriting<div id="main">— becomes the markdown bodyEverything else (header, sidebar, footer, scripts, styles, fonts) is excluded from the cache key entirely since it's irrelevant for markdown output. Within
div#main, Emotion classes and CSS module hashes are still normalized since code block components use those inside the content area.This approach is fundamentally more robust than pattern-matching unstable elements — any new source of non-determinism in the layout shell is automatically ignored.
Results
Vercel build with warm cache:
9447/9448 cache hits (99.99%) — the single miss was a legitimate content change (updated SDK registry data). The md-exports step dropped from ~80s to ~10s.