-
Notifications
You must be signed in to change notification settings - Fork 217
OTA-1546: Add a metric and an alert for conditional risk conditions #1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a4e2f8a
0d1ee9c
2b5364a
f87500a
b4ff403
9952662
5d598cd
e975641
f458075
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| metadata: | ||
| labels: | ||
| k8s-app: cluster-version-operator | ||
| name: cluster-version-operator-accept-risks | ||
| namespace: openshift-cluster-version | ||
| annotations: | ||
| kubernetes.io/description: Alerting rules for the feature gate ClusterUpdateAcceptRisks. | ||
| exclude.release.openshift.io/internal-openshift-hosted: "true" | ||
| include.release.openshift.io/self-managed-high-availability: "true" | ||
| release.openshift.io/feature-gate: "ClusterUpdateAcceptRisks" | ||
| spec: | ||
| groups: | ||
| - name: cluster-version-tech-preview | ||
| rules: | ||
| - alert: OpenShiftUpdateRiskMightApply | ||
| annotations: | ||
| summary: The cluster might have been exposed to the conditional update risk for 10 minutes. | ||
| description: The conditional update risk {{ "{{ $labels.risk }}" }} might apply to the cluster because of {{ "{{ $labels.reason }}" }}, and the cluster update to a version exposed to the risk is not recommended. For more information refer to 'oc adm upgrade'. | ||
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/OpenShiftUpdateRiskMightApply.md | ||
| expr: | | ||
| max by (namespace, risk, reason) (last_over_time(cluster_version_risk_conditions{job="cluster-version-operator", condition="Applies"}[5m]) != 0) | ||
| for: 10m | ||
| labels: | ||
| severity: warning |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -14,6 +14,7 @@ import ( | |||||||
| "github.com/prometheus/client_golang/prometheus/promhttp" | ||||||||
| corev1 "k8s.io/api/core/v1" | ||||||||
| apierrors "k8s.io/apimachinery/pkg/api/errors" | ||||||||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||||||||
| "k8s.io/apimachinery/pkg/labels" | ||||||||
| "k8s.io/apimachinery/pkg/util/sets" | ||||||||
| "k8s.io/apiserver/pkg/server/dynamiccertificates" | ||||||||
|
|
@@ -52,6 +53,7 @@ type operatorMetrics struct { | |||||||
| capability *prometheus.GaugeVec | ||||||||
| clusterOperatorUp *prometheus.GaugeVec | ||||||||
| clusterOperatorConditions *prometheus.GaugeVec | ||||||||
| clusterVersionRiskConditions *prometheus.GaugeVec | ||||||||
| clusterOperatorConditionTransitions *prometheus.GaugeVec | ||||||||
| clusterInstaller *prometheus.GaugeVec | ||||||||
| clusterVersionOperatorUpdateRetrievalTimestampSeconds *prometheus.GaugeVec | ||||||||
|
|
@@ -102,6 +104,10 @@ penultimate completed version for 'completed'. | |||||||
| Name: "cluster_operator_conditions", | ||||||||
| Help: "Report the conditions for active cluster operators. 0 is False and 1 is True.", | ||||||||
| }, []string{"name", "condition", "reason"}), | ||||||||
| clusterVersionRiskConditions: prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||||||||
| Name: "cluster_version_risk_conditions", | ||||||||
| Help: "Report the risk conditions for the cluster version. -1 is Unknown, 0 is False and 1 is True.", | ||||||||
| }, []string{"condition", "risk", "reason"}), | ||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we know all reason values in advance?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, there are 3 of them. cluster-version-operator/pkg/cvo/availableupdates.go Lines 534 to 536 in 9cec473
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm so right now we have only the following combinations?
In that case, the reason label seems to be redundant with the metric's value? Another option would be to expose all possible reasons and report 1 to represent the actual reason (similar to a pattern used by kube-state-metrics, e.g. Or Intermittent series are harder to reason about (https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. But we might add more in the future.
Then we would have to use the reason label in our alert expression (something like I think it would work too. Comparing to the current implementation: The suggested way would have more samples (#samples =#risks X #reasons) than the above (#samples=#risks). Do I understand it correctly? But it is not missing (as described in avoid-missing-metricsavoid-missing-metrics), it is just having less samples. I want to keep the implementation close to the existing ones if you feel OK. It is already challenging enough for me to follow. Switching the method for this one would just make the reading even harder in the future. Let me know if I misunderstood the point.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I find that approach harder to consume. And https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics :
isn't really selling me on that approach. I agree that it's good to have a signal for a given (riskName, conditionName) tuple. I don't see a problem with that flipping between multiple time-series as the current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm advocating for a solution which is already used and documented by kube-state-metrics hence familiar for our users. Another advantage declaring all series upfront is that it's easier for users to understand the metric (you don't have to guess which label values can exist).
Taking the current alert expression: If the condition status is Which would avoid alert flapping upon transient scrape failures or blips in risk evaluation. Re propagating the reason as a metric label to express more failure scenarios: my 2 cents would be that the PromQL evaluator should have its own instrumentation (e.g. #1292). Just to wrap up: I don't claim to enforce anything here, I'm only providing free advice and it's fine if you dismiss my suggestions :) Having said that, upstream instrumentation practices didn't grow out of nowhere and non-predictable series are for sure harder to work with in the end (not only for alerting but also for users).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We can list labels in the metric description if reason-discovery is important. I'm currently not clear on why anyone would need reason-discovery, vs. just trying to human-ingest a current-reason string that's presented to them.
But you also lose the information about which reason in alert labels (which get shipped up through Telemetry by default), and flipping between reasons seems like an unlikely corner case. And if we are concerned with flipping between reasons un under the alert-
#1292 would tell us there were problems with PromQL evaluation, but will not currently tell us which PromQL is failing to evaluate. The
I'm not asserting the upstream approach is wrong, just that the benefit is not yet clear to me ;) I think having a hint at the reason will likely help responding humans, both in-cluster admins and Red-Hat-side update-service maintainers, figure out what needs to be fixed. For in-cluster admins, we don't need the reason in the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks to both of you for the feedback. |
||||||||
| clusterOperatorConditionTransitions: prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||||||||
| Name: "cluster_operator_condition_transitions", | ||||||||
| Help: "Reports the number of times that a condition on a cluster operator changes status", | ||||||||
|
|
@@ -487,6 +493,7 @@ func (m *operatorMetrics) Describe(ch chan<- *prometheus.Desc) { | |||||||
| ch <- m.capability.WithLabelValues("").Desc() | ||||||||
| ch <- m.clusterOperatorUp.WithLabelValues("", "", "").Desc() | ||||||||
| ch <- m.clusterOperatorConditions.WithLabelValues("", "", "").Desc() | ||||||||
| ch <- m.clusterVersionRiskConditions.WithLabelValues("", "", "").Desc() | ||||||||
| ch <- m.clusterOperatorConditionTransitions.WithLabelValues("", "").Desc() | ||||||||
| ch <- m.clusterInstaller.WithLabelValues("", "", "").Desc() | ||||||||
| ch <- m.clusterVersionOperatorUpdateRetrievalTimestampSeconds.WithLabelValues("").Desc() | ||||||||
|
|
@@ -508,6 +515,26 @@ func (m *operatorMetrics) collectConditionalUpdates(ch chan<- prometheus.Metric, | |||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
| func (m *operatorMetrics) collectConditionalUpdateRisks(ch chan<- prometheus.Metric, risks []configv1.ConditionalUpdateRisk) { | ||||||||
| for _, risk := range risks { | ||||||||
| for _, condition := range risk.Conditions { | ||||||||
| if condition.Type != internal.ConditionalUpdateRiskConditionTypeApplies { | ||||||||
| continue | ||||||||
| } | ||||||||
|
|
||||||||
| g := m.clusterVersionRiskConditions.WithLabelValues(condition.Type, risk.Name, condition.Reason) | ||||||||
| switch condition.Status { | ||||||||
| case metav1.ConditionTrue: | ||||||||
| g.Set(1) | ||||||||
| case metav1.ConditionUnknown: | ||||||||
| g.Set(-1) | ||||||||
| } | ||||||||
| // We do not need to do g.Set(0) as it is done when g is initialized | ||||||||
hongkailiu marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
| ch <- g | ||||||||
| } | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
| // Collect collects metrics from the operator into the channel ch | ||||||||
| func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) { | ||||||||
| current := m.optr.currentVersion() | ||||||||
|
|
@@ -653,6 +680,9 @@ func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) { | |||||||
| } | ||||||||
|
|
||||||||
| m.collectConditionalUpdates(ch, cv.Status.ConditionalUpdates) | ||||||||
| if m.optr.shouldReconcileAcceptRisks() { | ||||||||
| m.collectConditionalUpdateRisks(ch, cv.Status.ConditionalUpdateRisks) | ||||||||
| } | ||||||||
| } | ||||||||
|
|
||||||||
| g := m.version.WithLabelValues("current", current.Version, current.Image, completed.Version) | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not worth holding on, but personally it bugs me that
cluster_operator_conditionsuses a plural name when each time-series will only be talking about a single condition at a time. I'd have preferedcluster_operator_conditionthere andcluster_version_risk_conditionhere. Too late to be worth changingcluster_operator_conditionsat this point. Maybe that means we go plural again just for CVO-metric-weirdness consistency. Or maybe we can pivot to singular before going GA.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream has the same preference on
Singulars, such askube_node_status_conditionandkube_deployment_status_conditionfrom CHANGELOG.md.Works for me.
If I get time after completing the features, I will make a pull to rename both.