-
Notifications
You must be signed in to change notification settings - Fork 45
Add base alert management API #657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
machadovilaca
wants to merge
21
commits into
openshift:alerts-management-api
Choose a base branch
from
machadovilaca:add-alert-management-api-base
base: alerts-management-api
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
5aad6ea
Add base alert management API
machadovilaca fb8a751
Change IsPlatformAlertRule implementation (#1)
machadovilaca f622f25
Set source label to platform on OpenShift alerting rules (#3)
machadovilaca 7f42262
Add persistent relabeled alerts rules (#5)
machadovilaca be8cb4c
Add post API (#2)
avlitman 14d2066
Add patch API (#4)
avlitman 8db3f3d
Update GET /alerts API filters to flat labels filtering format (#10)
sradco 166d031
Fix alertRelabelConfic logic (#9)
sradco f3f53f7
Add owner label (#6)
avlitman ac423c6
Drop relabeled alert rules persistent configmap (#13)
machadovilaca 97375a3
Re-add missing metav1 import (#15)
avlitman f3f2c6b
Add GitHub action unit tests (#16)
machadovilaca 2c567fe
Add to PATCH API drop and restore platform alerts (#12)
avlitman fb8a39d
Add const labels file (#17)
avlitman 1871c9f
Add support for AlertingRule CRs (#14)
machadovilaca e2c0b31
Update alert rule id format (#19)
sradco e4bdd22
Add component mapping to alerts (#8)
sradco 1cd16c7
Add create platform alert rule via AlertingRule CRD (#20)
machadovilaca e9b6e01
Update GET rules api and adds health details (#11)
sradco 157b3fc
Update APIs based on managed_by labels (#18)
avlitman 2bdfc6d
Add missing ARC support functions (#21)
sradco File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| name: Unit Tests | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: | ||
| - add-alert-management-api-base | ||
|
|
||
| jobs: | ||
| test: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Go | ||
| uses: actions/setup-go@v5 | ||
| with: | ||
| go-version-file: go.mod | ||
|
|
||
| - name: Run tests | ||
| run: go test -count=1 $(go list ./... | grep -v /test/e2e) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| ## Alert Management Notes | ||
|
|
||
| This document covers alert management behavior and prerequisites for the monitoring plugin. | ||
|
|
||
| ### User workload monitoring prerequisites | ||
|
|
||
| To include **user workload** alerts and rules in `/api/v1/alerting/alerts` and `/api/v1/alerting/rules`, the user workload monitoring stack must be enabled. Follow the OpenShift documentation for enabling and configuring UWM: | ||
|
|
||
| https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.20/html/configuring_user_workload_monitoring/configuring-alerts-and-notifications-uwm | ||
|
|
||
| #### How the plugin reads user workload alerts/rules | ||
|
|
||
| The plugin prefers **Thanos tenancy** for user workload alerts/rules (RBAC-scoped, requires a namespace parameter). When the client does not provide a `namespace` filter, the plugin discovers candidate namespaces and queries Thanos tenancy per-namespace, using the end-user bearer token. | ||
|
|
||
| Routes in `openshift-user-workload-monitoring` are treated as **fallbacks** (and are also used for some health checks and pending state retrieval). | ||
|
|
||
| If you want to create the user workload Prometheus route (optional), you can expose the service: | ||
|
|
||
| ```shell | ||
| oc -n openshift-user-workload-monitoring expose svc/prometheus-user-workload-web --name=prometheus-user-workload-web --port=web | ||
| ``` | ||
|
|
||
| If the route is missing/unreachable but tenancy is healthy, the plugin should still return user workload data and suppress route warnings. | ||
|
|
||
| #### Alert states | ||
|
|
||
| - `/api/v1/alerting/alerts?state=pending`: pending alerts come from Prometheus. | ||
| - `/api/v1/alerting/alerts?state=firing`: firing alerts come from Alertmanager when available. | ||
| - `/api/v1/alerting/alerts?state=silenced`: silenced alerts come from Alertmanager (requires an Alertmanager endpoint). | ||
|
|
||
| ### Alertmanager routing choices | ||
|
|
||
| OpenShift supports routing user workload alerts to: | ||
|
|
||
| - The **platform Alertmanager** (default instance) | ||
| - A **separate Alertmanager** for user workloads | ||
| - **External Alertmanager** instances | ||
|
|
||
| This is a cluster configuration choice and does not change the plugin API shape. The plugin reads alerts from Alertmanager (for firing/silenced) and Prometheus (for pending), then merges platform and user workload results when available. | ||
|
|
||
| The plugin intentionally reads from only the in-cluster Alertmanager endpoints. Supporting multiple external Alertmanagers would introduce ambiguous alert state and silencing outcomes because each instance can apply different routing, inhibition, and silence configurations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,213 @@ | ||
| # Alert Rule Classification - Design and Usage | ||
|
|
||
| ## Overview | ||
| The backend classifies Prometheus alerting rules into a “component” and an “impact layer”. It: | ||
| - Computes an `openshift_io_alert_rule_id` per alerting rule. | ||
| - Determines component/layer based on matcher logic and rule labels. | ||
| - Allows users to override classification via a single, fixed-name ConfigMap per namespace. | ||
| - Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`. | ||
|
|
||
| This document explains how it works, how to override, and how to test it. | ||
|
|
||
|
|
||
| ## Terminology | ||
| - openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name. | ||
| - component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.). | ||
| - layer: Impact scope. Allowed values: | ||
| - `cluster` | ||
| - `namespace` | ||
|
|
||
| Notes: | ||
| - **Stability**: | ||
| - The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change. | ||
| - For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition. | ||
| - For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id. | ||
| - Layer values are validated as `cluster|namespace` when set. To remove an override, clear the field (via API `null` or by removing the ConfigMap entry); empty/invalid values are ignored at read time. | ||
|
|
||
| ## Rule ID computation (openshift_io_alert_rule_id) | ||
| Location: `pkg/alert_rule/alert_rule.go` | ||
|
|
||
| The backend computes a specHash-like value from: | ||
| - `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name | ||
| - `expr`: trimmed with consecutive whitespace collapsed | ||
| - `for`: trimmed (duration string as written in the rule) | ||
| - `labels`: only non-system labels | ||
| - excludes labels with `openshift_io_` prefix and the `alertname` label | ||
| - drops empty values | ||
| - keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`) | ||
| - sorted by key and joined as `key=value` lines | ||
|
|
||
| Annotations are intentionally ignored to reduce id churn on documentation-only changes. | ||
|
|
||
| ## Classification Logic (How component/layer are determined) | ||
| Location: `pkg/alertcomponent/matcher.go` | ||
|
|
||
| 1) The code adapts `cluster-health-analyzer` matchers: | ||
| - CVO-related alerts (update/upgrade) → component/layer based on known patterns | ||
| - Compute / node-related alerts | ||
| - Core control plane components (renamed to layer `cluster`) | ||
| - Workload/namespace-level alerts (renamed to layer `namespace`) | ||
|
|
||
| 2) Fallback: | ||
| - If the computed component is empty or “Others”, we set: | ||
| - `component = other` | ||
| - `layer` derived from source: | ||
| - `openshift_io_alert_source=platform` → `cluster` | ||
| - `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster` | ||
| - `prometheus` label starting with `openshift-monitoring/` → `cluster` | ||
| - otherwise → `namespace` | ||
|
|
||
| 3) Result: | ||
| - Each alerting rule is assigned a `(component, layer)` tuple following the above logic. | ||
|
|
||
| ## Developer Overrides via Rule Labels (Recommended) | ||
| If you want explicit component/layer values and do not want to rely on the matcher, set | ||
| these labels on each rule in your `PrometheusRule`: | ||
| - `openshift_io_alert_rule_component` | ||
| - `openshift_io_alert_rule_layer` | ||
|
|
||
| Both are validated the same way as API overrides: | ||
| - `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric | ||
| - `layer`: `cluster` or `namespace` | ||
|
|
||
| When these labels are present and valid, they override matcher-derived values. | ||
|
|
||
| ## User Overrides (ConfigMap) | ||
| Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go` | ||
|
|
||
| - The backend stores overrides in the plugin namespace, sharded by target rule namespace: | ||
| - Name: `alert-classification-overrides-<rule-namespace>` | ||
| - Namespace: the monitoring plugin's namespace | ||
| - Required label: | ||
| - `monitoring.openshift.io/type=alert-classification-overrides` | ||
| - Recommended label: | ||
| - `app.kubernetes.io/managed-by=openshift-console` | ||
|
|
||
| - Data layout: | ||
| - Key: base64url(nopad(UTF-8 bytes of `<openshift_io_alert_rule_id>`)) | ||
| - This keeps ConfigMap keys opaque and avoids relying on any particular id character set. | ||
| - Value: JSON object with a `classification` field that holds component/layer. | ||
| - Optional metadata fields such as `alertName`, `prometheusRuleName`, and | ||
| `prometheusRuleNamespace` may be included for readability; they are ignored by | ||
| the backend. | ||
| - Dynamic overrides: | ||
| - `openshift_io_alert_rule_component_from`: derive component from an alert label key. | ||
| - `openshift_io_alert_rule_layer_from`: derive layer from an alert label key. | ||
|
|
||
| Example: | ||
| ```json | ||
| { | ||
| "alertName": "ClusterOperatorDown", | ||
| "prometheusRuleName": "cluster-version", | ||
| "prometheusRuleNamespace": "openshift-cluster-version", | ||
| "classification": { | ||
| "openshift_io_alert_rule_component_from": "name", | ||
| "openshift_io_alert_rule_layer": "cluster" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Notes: | ||
| - Overrides are only read when the required `monitoring.openshift.io/type` label is present. | ||
| - Invalid component/layer values are ignored for that entry. | ||
| - `*_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`). | ||
| - If a `*_from` label is present but the alert does not carry that label or the derived | ||
| value is invalid, the backend falls back to static values (if present) or defaults. | ||
| - If both component and layer are empty, the entry is removed. | ||
|
|
||
|
|
||
| ## Alerts API Enrichment | ||
| Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go` | ||
|
|
||
| - Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema) | ||
| - The backend fetches active alerts and enriches each alert with: | ||
| - `openshift_io_alert_rule_id` | ||
| - `openshift_io_alert_component` | ||
| - `openshift_io_alert_layer` | ||
| - `prometheusRuleName`: name of the PrometheusRule resource the alert originates from | ||
| - `prometheusRuleNamespace`: namespace of that PrometheusRule resource | ||
| - `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR) | ||
| - Prometheus compatibility: | ||
| - Base response matches Prometheus `/api/v1/alerts`. | ||
| - Additional fields are additive and safe for clients like Perses. | ||
|
|
||
| ## Prometheus/Thanos Sources | ||
| Location: `pkg/k8s/prometheus_alerts.go` | ||
|
|
||
| - Order of candidates: | ||
| 1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied) | ||
| 2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts` | ||
| 3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts` | ||
| 4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback) | ||
| 5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts` | ||
|
|
||
| - TLS and Auth: | ||
| - Bearer token: service account token from in-cluster config. | ||
| - CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`. | ||
|
|
||
| RBAC: | ||
| - Read routes in `openshift-monitoring`. | ||
| - Access `prometheuses/api` as needed for oauth-proxied endpoints. | ||
|
|
||
| ## Updating Rules Classification | ||
| APIs: | ||
| - Single update: | ||
| - Method: `PATCH /api/v1/alerting/rules/{ruleId}` | ||
| - Request body: | ||
| ```json | ||
| { | ||
| "classification": { | ||
| "openshift_io_alert_rule_component": "team-x", | ||
| "openshift_io_alert_rule_layer": "namespace", | ||
| "openshift_io_alert_rule_component_from": "name", | ||
| "openshift_io_alert_rule_layer_from": "layer" | ||
| } | ||
| } | ||
| ``` | ||
| - `openshift_io_alert_rule_layer`: `cluster` or `namespace` | ||
| - To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`). | ||
| - Response: | ||
| - 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success. | ||
| - Standard error body on failure (400 validation, 404 not found, etc.) | ||
| - Bulk update: | ||
| - Method: `PATCH /api/v1/alerting/rules` | ||
| - Request body: | ||
| ```json | ||
| { | ||
| "ruleIds": ["<id-a>", "<id-b>"], | ||
| "classification": { | ||
| "openshift_io_alert_rule_component": "etcd", | ||
| "openshift_io_alert_rule_layer": "cluster" | ||
| } | ||
| } | ||
| ``` | ||
| - Response: | ||
| - 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures. | ||
|
|
||
| Direct K8s (supported for power users/GitOps): | ||
| - PATCH/PUT the ConfigMap `alert-classification-overrides-<rule-namespace>` in the monitoring plugin namespace (respect `resourceVersion`). | ||
| - Each entry is keyed by base64url(`<openshift_io_alert_rule_id>`) with a JSON payload that contains a `classification` object (`openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`). | ||
| - UI should check update permissions with SelfSubjectAccessReview before showing an editor. | ||
|
|
||
| Notes: | ||
| - These endpoints are intended for updating **classification only** (component/layer overrides), | ||
| with permissions enforced based on the rule’s ownership (platform, user workload, operator-managed, | ||
| GitOps-managed). | ||
| - To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`. | ||
| Clients that need to update both should issue two requests. The combined operation is not atomic. | ||
| - In the ConfigMap override entries, classification is nested under `classification` | ||
| and validated as component/layer to keep it separate from generic label updates. | ||
|
|
||
| ## Security Notes | ||
| - Persist only minimal classification metadata in the fixed-name ConfigMap. | ||
|
|
||
| ## Testing and Ops | ||
| Unit tests: | ||
| - `pkg/management/get_alerts_test.go` | ||
| - Overrides from labeled ConfigMap, fallback behavior, label validation. | ||
|
|
||
| ## Future Work | ||
| - Optional CRD to formalize the schema (adds overhead; ConfigMap is sufficient today). | ||
| - Optional composite update API if we need to update rule fields and classification atomically. | ||
| - De-duplication/merge logic when aggregating alerts across sources. | ||
|
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't have GitHub actions on github.com/openshift