Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Alcamech The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@Alcamech: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| - Invalid PDBs could block node drains | ||
| - Manual interventions detected | ||
|
|
||
| **Alert**: `UpgradeClusterCheckFailedSRE` (paging) |
There was a problem hiding this comment.
Do we have this alert?
I remember we didn't implement this alert.
There was a problem hiding this comment.
This is referenced in the pagingAlerts slice in pkg/metrics/metrics.go:74-81 but I do not see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml
Do you want me to remove this reference from the doc and pagingAlerts slice?
|
|
||
| **Paging Alerts Tracked** (from `pkg/metrics/metrics.go:74-81`): | ||
| - `UpgradeConfigValidationFailedSRE` | ||
| - `UpgradeClusterCheckFailedSRE` |
There was a problem hiding this comment.
I don't remember we have this alert
There was a problem hiding this comment.
This is referenced in the pagingAlerts slice in pkg/metrics/metrics.go:74-81 but I do not see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml
Do you want me to remove this reference from the doc and pagingAlerts slice?
There was a problem hiding this comment.
This alert was removed since 2021. Maybe you can comment it out.
| - `UpgradeControlPlaneUpgradeTimeoutSRE` | ||
| - `UpgradeNodeUpgradeTimeoutSRE` | ||
| - `UpgradeNodeDrainFailedSRE` | ||
|
|
There was a problem hiding this comment.
UpgradeStateNotificationFailureSRE this alert is missing
There was a problem hiding this comment.
This is commented out in the pagingAlerts slice in pkg/metrics/metrics.go:74-81
//"UpgradeNotificationFailedSRE", TODO: OSD-26790 - Create an Alert in mcc repo
but I do see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml
Do you also want me to uncomment this in the pagingAlerts slice?
There was a problem hiding this comment.
The name is UpgradeStateNotificationFailureSRE now
What type of PR is this?
Documentation
What this PR does / why we need it?
Adds an "Adding New Metrics" guide to docs/metrics.md with step-by-step instructions for defining, registering, and verifying Prometheus metrics locally.
Adds an Metrics Tracding guide to docs/metrics-tracing.md that provides a comprehensive mapping of all Prometheus metrics
Which Jira/Github issue(s) this PR fixes?
OCM-19740