Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
5aad6ea
Add base alert management API
machadovilaca Nov 25, 2025
fb8a751
Change IsPlatformAlertRule implementation (#1)
machadovilaca Dec 9, 2025
f622f25
Set source label to platform on OpenShift alerting rules (#3)
machadovilaca Dec 10, 2025
7f42262
Add persistent relabeled alerts rules (#5)
machadovilaca Dec 17, 2025
be8cb4c
Add post API (#2)
avlitman Dec 23, 2025
14d2066
Add patch API (#4)
avlitman Dec 23, 2025
8db3f3d
Update GET /alerts API filters to flat labels filtering format (#10)
sradco Jan 14, 2026
166d031
Fix alertRelabelConfic logic (#9)
sradco Jan 19, 2026
f3f53f7
Add owner label (#6)
avlitman Jan 26, 2026
ac423c6
Drop relabeled alert rules persistent configmap (#13)
machadovilaca Feb 4, 2026
97375a3
Re-add missing metav1 import (#15)
avlitman Feb 4, 2026
f3f2c6b
Add GitHub action unit tests (#16)
machadovilaca Feb 4, 2026
2c567fe
Add to PATCH API drop and restore platform alerts (#12)
avlitman Feb 16, 2026
fb8a39d
Add const labels file (#17)
avlitman Feb 16, 2026
1871c9f
Add support for AlertingRule CRs (#14)
machadovilaca Feb 17, 2026
e2c0b31
Update alert rule id format (#19)
sradco Feb 18, 2026
e4bdd22
Add component mapping to alerts (#8)
sradco Feb 18, 2026
1cd16c7
Add create platform alert rule via AlertingRule CRD (#20)
machadovilaca Feb 19, 2026
e9b6e01
Update GET rules api and adds health details (#11)
sradco Feb 19, 2026
157b3fc
Update APIs based on managed_by labels (#18)
avlitman Feb 24, 2026
2bdfc6d
Add missing ARC support functions (#21)
sradco Feb 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/unit-tests.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't have GitHub actions on github.com/openshift

Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Unit Tests

on:
pull_request:
branches:
- add-alert-management-api-base

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Run tests
run: go test -count=1 $(go list ./... | grep -v /test/e2e)
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ lint-frontend:
lint-backend:
go mod tidy
go fmt ./cmd/
go fmt ./pkg/
go fmt ./pkg/... ./internal/...

.PHONY: install-backend
install-backend:
Expand All @@ -57,7 +57,11 @@ start-backend:

.PHONY: test-backend
test-backend:
go test ./pkg/... -v
go test ./pkg/... ./internal/... -v

.PHONY: test-e2e
test-e2e:
PLUGIN_URL=http://localhost:9001 go test -v -timeout=150m -count=1 ./test/e2e

.PHONY: build-image
build-image:
Expand Down
5 changes: 3 additions & 2 deletions cmd/plugin-backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,16 @@ import (
"strconv"
"strings"

server "github.com/openshift/monitoring-plugin/pkg"
"github.com/sirupsen/logrus"

server "github.com/openshift/monitoring-plugin/pkg"
)

var (
portArg = flag.Int("port", 0, "server port to listen on (default: 9443)\nports 9444 and 9445 reserved for other use")
certArg = flag.String("cert", "", "cert file path to enable TLS (disabled by default)")
keyArg = flag.String("key", "", "private key file path to enable TLS (disabled by default)")
featuresArg = flag.String("features", "", "enabled features, comma separated.\noptions: ['acm-alerting', 'incidents', 'dev-config', 'perses-dashboards']")
featuresArg = flag.String("features", "", "enabled features, comma separated.\noptions: ['acm-alerting', 'incidents', 'dev-config', 'perses-dashboards', 'alert-management-api']")
staticPathArg = flag.String("static-path", "", "static files path to serve frontend (default: './web/dist')")
configPathArg = flag.String("config-path", "", "config files path (default: './config')")
pluginConfigArg = flag.String("plugin-config-path", "", "plugin yaml configuration")
Expand Down
41 changes: 41 additions & 0 deletions docs/alert-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
## Alert Management Notes

This document covers alert management behavior and prerequisites for the monitoring plugin.

### User workload monitoring prerequisites

To include **user workload** alerts and rules in `/api/v1/alerting/alerts` and `/api/v1/alerting/rules`, the user workload monitoring stack must be enabled. Follow the OpenShift documentation for enabling and configuring UWM:

https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.20/html/configuring_user_workload_monitoring/configuring-alerts-and-notifications-uwm

#### How the plugin reads user workload alerts/rules

The plugin prefers **Thanos tenancy** for user workload alerts/rules (RBAC-scoped, requires a namespace parameter). When the client does not provide a `namespace` filter, the plugin discovers candidate namespaces and queries Thanos tenancy per-namespace, using the end-user bearer token.

Routes in `openshift-user-workload-monitoring` are treated as **fallbacks** (and are also used for some health checks and pending state retrieval).

If you want to create the user workload Prometheus route (optional), you can expose the service:

```shell
oc -n openshift-user-workload-monitoring expose svc/prometheus-user-workload-web --name=prometheus-user-workload-web --port=web
```

If the route is missing/unreachable but tenancy is healthy, the plugin should still return user workload data and suppress route warnings.

#### Alert states

- `/api/v1/alerting/alerts?state=pending`: pending alerts come from Prometheus.
- `/api/v1/alerting/alerts?state=firing`: firing alerts come from Alertmanager when available.
- `/api/v1/alerting/alerts?state=silenced`: silenced alerts come from Alertmanager (requires an Alertmanager endpoint).

### Alertmanager routing choices

OpenShift supports routing user workload alerts to:

- The **platform Alertmanager** (default instance)
- A **separate Alertmanager** for user workloads
- **External Alertmanager** instances

This is a cluster configuration choice and does not change the plugin API shape. The plugin reads alerts from Alertmanager (for firing/silenced) and Prometheus (for pending), then merges platform and user workload results when available.

The plugin intentionally reads from only the in-cluster Alertmanager endpoints. Supporting multiple external Alertmanagers would introduce ambiguous alert state and silencing outcomes because each instance can apply different routing, inhibition, and silence configurations.
213 changes: 213 additions & 0 deletions docs/alert-rule-classification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Alert Rule Classification - Design and Usage

## Overview
The backend classifies Prometheus alerting rules into a “component” and an “impact layer”. It:
- Computes an `openshift_io_alert_rule_id` per alerting rule.
- Determines component/layer based on matcher logic and rule labels.
- Allows users to override classification via a single, fixed-name ConfigMap per namespace.
- Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`.

This document explains how it works, how to override, and how to test it.


## Terminology
- openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name.
- component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.).
- layer: Impact scope. Allowed values:
- `cluster`
- `namespace`

Notes:
- **Stability**:
- The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change.
- For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition.
- For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id.
- Layer values are validated as `cluster|namespace` when set. To remove an override, clear the field (via API `null` or by removing the ConfigMap entry); empty/invalid values are ignored at read time.

## Rule ID computation (openshift_io_alert_rule_id)
Location: `pkg/alert_rule/alert_rule.go`

The backend computes a specHash-like value from:
- `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name
- `expr`: trimmed with consecutive whitespace collapsed
- `for`: trimmed (duration string as written in the rule)
- `labels`: only non-system labels
- excludes labels with `openshift_io_` prefix and the `alertname` label
- drops empty values
- keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`)
- sorted by key and joined as `key=value` lines

Annotations are intentionally ignored to reduce id churn on documentation-only changes.

## Classification Logic (How component/layer are determined)
Location: `pkg/alertcomponent/matcher.go`

1) The code adapts `cluster-health-analyzer` matchers:
- CVO-related alerts (update/upgrade) → component/layer based on known patterns
- Compute / node-related alerts
- Core control plane components (renamed to layer `cluster`)
- Workload/namespace-level alerts (renamed to layer `namespace`)

2) Fallback:
- If the computed component is empty or “Others”, we set:
- `component = other`
- `layer` derived from source:
- `openshift_io_alert_source=platform` → `cluster`
- `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster`
- `prometheus` label starting with `openshift-monitoring/` → `cluster`
- otherwise → `namespace`

3) Result:
- Each alerting rule is assigned a `(component, layer)` tuple following the above logic.

## Developer Overrides via Rule Labels (Recommended)
If you want explicit component/layer values and do not want to rely on the matcher, set
these labels on each rule in your `PrometheusRule`:
- `openshift_io_alert_rule_component`
- `openshift_io_alert_rule_layer`

Both are validated the same way as API overrides:
- `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric
- `layer`: `cluster` or `namespace`

When these labels are present and valid, they override matcher-derived values.

## User Overrides (ConfigMap)
Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go`

- The backend stores overrides in the plugin namespace, sharded by target rule namespace:
- Name: `alert-classification-overrides-<rule-namespace>`
- Namespace: the monitoring plugin's namespace
- Required label:
- `monitoring.openshift.io/type=alert-classification-overrides`
- Recommended label:
- `app.kubernetes.io/managed-by=openshift-console`

- Data layout:
- Key: base64url(nopad(UTF-8 bytes of `<openshift_io_alert_rule_id>`))
- This keeps ConfigMap keys opaque and avoids relying on any particular id character set.
- Value: JSON object with a `classification` field that holds component/layer.
- Optional metadata fields such as `alertName`, `prometheusRuleName`, and
`prometheusRuleNamespace` may be included for readability; they are ignored by
the backend.
- Dynamic overrides:
- `openshift_io_alert_rule_component_from`: derive component from an alert label key.
- `openshift_io_alert_rule_layer_from`: derive layer from an alert label key.

Example:
```json
{
"alertName": "ClusterOperatorDown",
"prometheusRuleName": "cluster-version",
"prometheusRuleNamespace": "openshift-cluster-version",
"classification": {
"openshift_io_alert_rule_component_from": "name",
"openshift_io_alert_rule_layer": "cluster"
}
}
```

Notes:
- Overrides are only read when the required `monitoring.openshift.io/type` label is present.
- Invalid component/layer values are ignored for that entry.
- `*_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`).
- If a `*_from` label is present but the alert does not carry that label or the derived
value is invalid, the backend falls back to static values (if present) or defaults.
- If both component and layer are empty, the entry is removed.


## Alerts API Enrichment
Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go`

- Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema)
- The backend fetches active alerts and enriches each alert with:
- `openshift_io_alert_rule_id`
- `openshift_io_alert_component`
- `openshift_io_alert_layer`
- `prometheusRuleName`: name of the PrometheusRule resource the alert originates from
- `prometheusRuleNamespace`: namespace of that PrometheusRule resource
- `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR)
- Prometheus compatibility:
- Base response matches Prometheus `/api/v1/alerts`.
- Additional fields are additive and safe for clients like Perses.

## Prometheus/Thanos Sources
Location: `pkg/k8s/prometheus_alerts.go`

- Order of candidates:
1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied)
2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts`
3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts`
4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback)
5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts`

- TLS and Auth:
- Bearer token: service account token from in-cluster config.
- CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`.

RBAC:
- Read routes in `openshift-monitoring`.
- Access `prometheuses/api` as needed for oauth-proxied endpoints.

## Updating Rules Classification
APIs:
- Single update:
- Method: `PATCH /api/v1/alerting/rules/{ruleId}`
- Request body:
```json
{
"classification": {
"openshift_io_alert_rule_component": "team-x",
"openshift_io_alert_rule_layer": "namespace",
"openshift_io_alert_rule_component_from": "name",
"openshift_io_alert_rule_layer_from": "layer"
}
}
```
- `openshift_io_alert_rule_layer`: `cluster` or `namespace`
- To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`).
- Response:
- 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success.
- Standard error body on failure (400 validation, 404 not found, etc.)
- Bulk update:
- Method: `PATCH /api/v1/alerting/rules`
- Request body:
```json
{
"ruleIds": ["<id-a>", "<id-b>"],
"classification": {
"openshift_io_alert_rule_component": "etcd",
"openshift_io_alert_rule_layer": "cluster"
}
}
```
- Response:
- 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures.

Direct K8s (supported for power users/GitOps):
- PATCH/PUT the ConfigMap `alert-classification-overrides-<rule-namespace>` in the monitoring plugin namespace (respect `resourceVersion`).
- Each entry is keyed by base64url(`<openshift_io_alert_rule_id>`) with a JSON payload that contains a `classification` object (`openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`).
- UI should check update permissions with SelfSubjectAccessReview before showing an editor.

Notes:
- These endpoints are intended for updating **classification only** (component/layer overrides),
with permissions enforced based on the rule’s ownership (platform, user workload, operator-managed,
GitOps-managed).
- To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`.
Clients that need to update both should issue two requests. The combined operation is not atomic.
- In the ConfigMap override entries, classification is nested under `classification`
and validated as component/layer to keep it separate from generic label updates.

## Security Notes
- Persist only minimal classification metadata in the fixed-name ConfigMap.

## Testing and Ops
Unit tests:
- `pkg/management/get_alerts_test.go`
- Overrides from labeled ConfigMap, fallback behavior, label validation.

## Future Work
- Optional CRD to formalize the schema (adds overhead; ConfigMap is sufficient today).
- Optional composite update API if we need to update rule fields and classification atomically.
- De-duplication/merge logic when aggregating alerts across sources.

Loading