Skip to content

diag: cache watermark reset on CommandLog archive uses wrong sentinel value — new rows may be missed between reset and next full scan #312

@nanoDBA

Description

@nanoDBA

Problem

When the cache watermark gap is detected (CommandLog was archived/truncated and the cached watermark ID exceeds the current CommandLog max ID), the history table watermark is reset to 0 but the cache table watermark is reset to @cl_max_id_cache (the current CommandLog max ID):

History table reset (lines ~720–728):

IF @cl_max_id > 0 AND @history_max_id > @cl_max_id
BEGIN
    RAISERROR(N'  WARNING: History watermark (%i) exceeds CommandLog max ID (%i). Resetting watermark...', 10, 1, ...);
    SET @history_max_id = 0;    /* #240: reset to 0, not @cl_max_id — NOT EXISTS dedup guard prevents duplicates */
END;

Cache table reset (lines ~800–808):

IF @cl_max_id_cache > 0 AND @cache_watermark > @cl_max_id_cache
BEGIN
    RAISERROR(N'  WARNING: Cache watermark (%i) exceeds CommandLog max ID (%i). Resetting...', 10, 1, ...);
    SET @cache_watermark = @cl_max_id_cache;   -- ← different sentinel value!
END;

The history reset uses 0 (full re-scan on next run), while the cache reset uses the current max CommandLog ID (skip everything currently in CommandLog). The comment on the history reset explains this: "#240: reset to 0, not @cl_max_id — NOT EXISTS dedup guard prevents duplicates."

But there is no such dedup guard on the cache table INSERT — the cache table's PRIMARY KEY on CommandLogID handles deduplication. So the cache reset to @cl_max_id_cache means: on the next run after an archive event, all newly-added rows after the archive are cached correctly, but the rows that existed in CommandLog at archive time are never re-cached.

This creates a race: if CommandLog was archived and immediately re-populated (common in rolling-window retention patterns where old rows are deleted and new rows are inserted in the same batch), the window between SET @cache_watermark = @cl_max_id_cache and the next run's cache INSERT will silently skip any rows with IDs ≤ the old max. These are exactly the rows from the active maintenance run that was in progress during the archive event.

Evidence

Cache watermark reset (lines ~800–808):

IF @cl_max_id_cache > 0 AND @cache_watermark > @cl_max_id_cache
BEGIN
    RAISERROR(N'  WARNING: Cache watermark (%i) exceeds CommandLog max ID (%i). Resetting — CommandLog may have been archived.', 10, 1,
        @cache_watermark, @cl_max_id_cache) WITH NOWAIT;
    SET @cache_watermark = @cl_max_id_cache;  -- ← should be 0 for consistency with history reset
END;

Compare with history table reset (line ~724):

SET @history_max_id = 0;    /* #240: reset to 0, not @cl_max_id — NOT EXISTS dedup guard prevents duplicates */

Proposed Fix

Reset both watermarks to 0 for consistency, and rely on the cache table's PRIMARY KEY to handle deduplication on re-insert:

-- Cache watermark reset: use 0 (not @cl_max_id_cache) so full re-scan occurs
-- PK_StatUpdateDiagCache on CommandLogID prevents duplicate inserts
SET @cache_watermark = 0;

Add a comment explaining the reasoning (mirroring the #240 comment on the history reset).

Impact

Medium. In environments using CommandLog retention policies that delete old rows and immediately insert new ones (rolling 30-day windows are common), the cache table will silently skip the newly-inserted rows from the active maintenance window. This produces a gap in the #stat_updates data that is invisible to the user — no error, no warning, just missing rows from recent runs. The diagnostic will appear to analyze less data than expected, potentially missing current-window failures entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    p4-deferredValid but deferred to future releasepersona-reviewIssue generated via SQL SME persona code reviewtriage-reviewedReviewed and triaged 2026-03-11

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions