Skip to content

Redundant MetricsCapture in trace_call produces orphan metrics with incomplete resource labels #16173

@waiho-gumloop

Description

@waiho-gumloop

Environment details

  • OS type and version: macOS / Linux
  • Python version: 3.13
  • google-cloud-spanner version: 3.63.0 (current main)

Description

Every Spanner operation that goes through trace_call() produces orphan OpenTelemetry metric data points with incomplete resource labels (missing project_id and instance_id). These orphan data points persist for the process lifetime due to cumulative aggregation and are re-exported to Cloud Monitoring every 60 seconds, which rejects them with:

INVALID_ARGUMENT: One or more TimeSeries could not be written:
timeSeries[...]: the set of resource labels is incomplete, missing (instance_id)

Root cause

trace_call() in _opentelemetry_tracing.py wraps every operation with a bare MetricsCapture() (no resource_info). Meanwhile, every caller of trace_call already provides its own MetricsCapture(self._resource_info) with correct labels.

When Python evaluates with trace_call(...) as span, MetricsCapture(self._resource_info):, two separate MetricsTracer instances are created:

  1. tracer_A (from trace_call's internal MetricsCapture()): has instance_config, location, client_hash, client_uid, client_name from the factory, but never receives project_id or instance_id
  2. tracer_B (from the caller's MetricsCapture(resource_info)): has correct labels, overwrites tracer_A in the context var

On exit, tracer_B records correct metrics first, then tracer_A records metrics with incomplete labels. Since the SpannerMetricsTracerFactory never has project_id/instance_id in its _client_attributes (only set per-tracer via resource_info or MetricsInterceptor), tracer_A always starts without them and is never populated because the MetricsInterceptor only touches the current context-var tracer (tracer_B).

With OpenTelemetry's cumulative aggregation, once these orphan aggregation buckets are created, they persist for the process lifetime and are re-exported every 60 seconds.

History

Impact

  • Affects every Spanner operation (~27 code paths) on every invocation
  • Creates persistent orphan metric aggregation buckets
  • Produces repeated INVALID_ARGUMENT error logs every 60 seconds
  • Wastes CPU/network on exporting invalid TimeSeries
  • Application functionality is unaffected; valid metrics from the caller's MetricsCapture still work

Steps to reproduce

  1. Create a spanner.Client() with metrics enabled (default)
  2. Perform any Spanner operation (e.g., session.create(), snapshot.execute_sql())
  3. Observe INVALID_ARGUMENT errors logged from the metrics exporter every 60 seconds

Suggested fix

Remove the bare MetricsCapture() from trace_call — it is redundant since every caller already provides its own. See PR googleapis/python-spanner#1522.

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the Spanner API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions