Skip to content

✨ CloudWatch metrics for ConfigCache (hit rate, cold/warm latency) #322

@sodre

Description

@sodre

Problem or Use Case

The ConfigCache already tracks hits and misses internally via CacheStats, but these metrics are only available programmatically through get_cache_stats(). There is no integration with CloudWatch, so operators cannot:

  • Monitor cache hit rates in production dashboards
  • Measure cold-cache vs warm-cache latency (p50/p95/p99)
  • Set alarms on cache degradation (e.g., hit rate drops below threshold)
  • Validate the effectiveness of the BatchGetItem optimization (⚡ BatchGetItem optimization for config resolution #298) in production

Split from #298 to keep the BatchGetItem optimization focused on DynamoDB access patterns while this issue addresses observability.

Proposed Solution

Custom CloudWatch Metrics

Publish metrics from ConfigCache to CloudWatch using the existing boto3/aioboto3 clients:

Metric Name Unit Description
ConfigCache/HitRate Percent hits / (hits + misses) over reporting period
ConfigCache/Hits Count Cache hits since last publish
ConfigCache/Misses Count Cache misses since last publish
ConfigCache/ColdCacheLatency Milliseconds Latency on cache miss (includes DynamoDB round trip)
ConfigCache/WarmCacheLatency Milliseconds Latency on cache hit (in-memory lookup)
ConfigCache/Size Count Number of cached entries

Dimensions: StackName (required), Resource (optional, for per-resource breakdown)

Integration Points

  1. CacheStats already tracks hits/misses - extend with latency tracking (start/stop timers around fetch_fn calls)
  2. Periodic publishing - batch metrics and publish at configurable intervals (e.g., every 60s) to avoid per-request CloudWatch API calls
  3. Opt-in - disabled by default, enabled via enable_metrics=True on RateLimiter or ConfigCache

Latency Tracking

# On cache miss: measure fetch_fn latency
start = time.monotonic()
value = await fetch_fn()
elapsed_ms = (time.monotonic() - start) * 1000
self._cold_latencies.append(elapsed_ms)

# On cache hit: measure lookup latency
start = time.monotonic()
value = entry.value  # in-memory
elapsed_ms = (time.monotonic() - start) * 1000
self._warm_latencies.append(elapsed_ms)

Acceptance Criteria

  • CacheStats extended with cold_latency_ms and warm_latency_ms lists for percentile calculation
  • New ConfigCacheMetrics class (or equivalent) publishes custom metrics to CloudWatch namespace ZaeLimiter/ConfigCache
  • Metrics include HitRate, Hits, Misses, ColdCacheLatency, WarmCacheLatency, and Size
  • All metrics tagged with StackName dimension; latency metrics support optional Resource dimension
  • Metrics publishing is opt-in (disabled by default), enabled via a RateLimiter constructor parameter
  • Metrics are batched and published periodically (not per-request) to minimize CloudWatch API costs
  • Unit tests in tests/unit/ verify metric values match CacheStats counters
  • Unit tests verify latency is recorded on cache hit and cache miss code paths
  • Sync variant generated via generate_sync.py if new async source files are added

Alternatives Considered

  1. EMF (Embedded Metrics Format) via Lambda Powertools: Only works inside Lambda. ConfigCache runs in the application process, not the aggregator Lambda.

  2. Expose Prometheus endpoint: Adds a dependency and requires a metrics scraper. CloudWatch is already available in the AWS environment.

  3. Log-based metrics (CloudWatch Logs Insights): Requires structured logging and post-hoc queries. Custom metrics provide real-time dashboards and alarms.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/limiterCore rate limiting logicperformancePerformance optimization

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions