Skip to content

OCM-19740 - local observability#571

Open
Alcamech wants to merge 1 commit intoopenshift:masterfrom
Alcamech:OCM-19740
Open

OCM-19740 - local observability#571
Alcamech wants to merge 1 commit intoopenshift:masterfrom
Alcamech:OCM-19740

Conversation

@Alcamech
Copy link

@Alcamech Alcamech commented Jan 26, 2026

What type of PR is this?

Documentation

What this PR does / why we need it?

Adds an "Adding New Metrics" guide to docs/metrics.md with step-by-step instructions for defining, registering, and verifying Prometheus metrics locally.
Adds an Metrics Tracding guide to docs/metrics-tracing.md that provides a comprehensive mapping of all Prometheus metrics

Which Jira/Github issue(s) this PR fixes?

OCM-19740

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Alcamech
Once this PR has been reviewed and has the lgtm label, please assign clcollins for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2026

@Alcamech: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

- Invalid PDBs could block node drains
- Manual interventions detected

**Alert**: `UpgradeClusterCheckFailedSRE` (paging)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this alert?
I remember we didn't implement this alert.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referenced in the pagingAlerts slice in pkg/metrics/metrics.go:74-81 but I do not see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml

Do you want me to remove this reference from the doc and pagingAlerts slice?


**Paging Alerts Tracked** (from `pkg/metrics/metrics.go:74-81`):
- `UpgradeConfigValidationFailedSRE`
- `UpgradeClusterCheckFailedSRE`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember we have this alert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referenced in the pagingAlerts slice in pkg/metrics/metrics.go:74-81 but I do not see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml

Do you want me to remove this reference from the doc and pagingAlerts slice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alert was removed since 2021. Maybe you can comment it out.

- `UpgradeControlPlaneUpgradeTimeoutSRE`
- `UpgradeNodeUpgradeTimeoutSRE`
- `UpgradeNodeDrainFailedSRE`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpgradeStateNotificationFailureSRE this alert is missing

Copy link
Author

@Alcamech Alcamech Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commented out in the pagingAlerts slice in pkg/metrics/metrics.go:74-81

//"UpgradeNotificationFailedSRE", TODO: OSD-26790 - Create an Alert in mcc repo

but I do see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml

Do you also want me to uncomment this in the pagingAlerts slice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is UpgradeStateNotificationFailureSRE now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants