Skip to content

feat(preprod): Add Datadog metrics for snapshot upload and diff lifecycle#111024

Open
NicoHinderling wants to merge 5 commits intomasterfrom
nico/feat/snapshot-analytics-events
Open

feat(preprod): Add Datadog metrics for snapshot upload and diff lifecycle#111024
NicoHinderling wants to merge 5 commits intomasterfrom
nico/feat/snapshot-analytics-events

Conversation

@NicoHinderling
Copy link
Contributor

Adds metrics.distribution and metrics.incr instrumentation to the snapshot upload endpoint and compare_snapshots task so the Preprod Health dashboard can track snapshot usage and build quality signals.

Metrics added

On upload (ProjectPreprodSnapshotEndpoint.post):

  • preprod.snapshots.upload.image_count — number of images per upload, tagged has_vcs to distinguish CI builds from standalone uploads
  • preprod.snapshots.upload.duplicate_image_file_names — count of image_file_name collisions within a single manifest (proxy for same screen uploaded under multiple hashes)
  • preprod.snapshots.upload.bundles_per_commit — how many snapshot bundles have been uploaded for the same commit (only emitted when has_vcs=True)

On diff completion (compare_snapshots task):

  • preprod.snapshots.diff.duration_s — time from comparison record creation to diff task completion
  • preprod.snapshots.e2e_duration_s — time from artifact upload to diff completion, mirrors the existing preprod.size_analysis.results_e2e pattern
  • preprod.snapshots.image.avg_size_bytes — average byte size of images actually fetched for pixel diff (excludes added/removed)
  • preprod.snapshots.diff.zero_changes — incremented when a diff completes with no changed, added, or removed images

All metrics use sample_rate=1.0. No Amplitude analytics events are included — this is Datadog/tracemetrics only, targeting the existing Preprod Health dashboard.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 18, 2026
@NicoHinderling NicoHinderling marked this pull request as ready for review March 18, 2026 20:39
@NicoHinderling NicoHinderling requested a review from a team as a code owner March 18, 2026 20:39
Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

NicoHinderling and others added 4 commits March 18, 2026 15:36
…ycle

Instruments the snapshot upload endpoint and compare_snapshots task with
distribution metrics to enable observability into upload patterns, image
volumes, diff durations, and build quality signals on the Preprod Health
dashboard.

Co-Authored-By: Claude <noreply@anthropic.com>
diff_duration_s was measured from comparison.date_added, which is set
on first attempt creation via get_or_create. On retries the same record
is reused, so the metric would include idle time between attempts.
Now measured from task_start_time captured at the top of the function.

zero_changes was not accounting for renamed_pairs, causing rename-only
diffs to incorrectly increment the zero_changes counter.

Co-Authored-By: Claude <noreply@anthropic.com>
The PreprodArtifact count query for bundles_per_commit sat unguarded
in the critical path before downstream task dispatch. A DB timeout or
error would prevent create_preprod_snapshot_status_check_task from
ever being dispatched, orphaning the artifact. Wrap in try/except so
a metrics failure cannot block the upload completion flow.

Co-Authored-By: Claude <noreply@anthropic.com>
When all eligible image pairs error (e.g., objectstore outage, oversized
images), changed_count stays 0 and added/removed/renamed are empty, causing
zero_changes to fire incorrectly. Add error_count check so the metric only
increments when the diff genuinely found no differences.

Co-Authored-By: Claude <noreply@anthropic.com>
v["image_file_name"] for v in images.values() if v.get("image_file_name")
)
duplicate_count = sum(c - 1 for c in file_name_counts.values() if c > 1)
metrics.distribution(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often would we actually get duplicate images? This seems like a strange thing to record metrics on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think max added that in my requests doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll add a comment as a reminder to consider removing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually im just going to kill this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants