openshift · machadovilaca · Nov 25, 2025 · Dec 9, 2025 · Dec 10, 2025 · Dec 17, 2025
diff --git a/.github/workflows/unit-tests.yaml b/.github/workflows/unit-tests.yaml
@@ -0,0 +1,21 @@
+name: Unit Tests
+
+on:
+  pull_request:
+    branches:
+      - add-alert-management-api-base
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Set up Go
+        uses: actions/setup-go@v5
+        with:
+          go-version-file: go.mod
+
+      - name: Run tests
+        run: go test -count=1 $(go list ./... | grep -v /test/e2e)
diff --git a/Makefile b/Makefile
@@ -41,7 +41,7 @@ lint-frontend:
 lint-backend:
 	go mod tidy
 	go fmt ./cmd/
-	go fmt ./pkg/
+	go fmt ./pkg/... ./internal/...
 
 .PHONY: install-backend
 install-backend:
@@ -57,7 +57,11 @@ start-backend:
 
 .PHONY: test-backend
 test-backend:
-	go test ./pkg/... -v
+	go test ./pkg/... ./internal/... -v
+
+.PHONY: test-e2e
+test-e2e:
+	PLUGIN_URL=http://localhost:9001 go test -v -timeout=150m -count=1 ./test/e2e
 
 .PHONY: build-image
 build-image:

diff --git a/cmd/plugin-backend.go b/cmd/plugin-backend.go
@@ -8,15 +8,16 @@ import (
 	"strconv"
 	"strings"
 
-	server "github.com/openshift/monitoring-plugin/pkg"
 	"github.com/sirupsen/logrus"
+
+	server "github.com/openshift/monitoring-plugin/pkg"
 )
 
 var (
 	portArg             = flag.Int("port", 0, "server port to listen on (default: 9443)\nports 9444 and 9445 reserved for other use")
 	certArg             = flag.String("cert", "", "cert file path to enable TLS (disabled by default)")
 	keyArg              = flag.String("key", "", "private key file path to enable TLS (disabled by default)")
-	featuresArg         = flag.String("features", "", "enabled features, comma separated.\noptions: ['acm-alerting', 'incidents', 'dev-config', 'perses-dashboards']")
+	featuresArg         = flag.String("features", "", "enabled features, comma separated.\noptions: ['acm-alerting', 'incidents', 'dev-config', 'perses-dashboards', 'alert-management-api']")
 	staticPathArg       = flag.String("static-path", "", "static files path to serve frontend (default: './web/dist')")
 	configPathArg       = flag.String("config-path", "", "config files path (default: './config')")
 	pluginConfigArg     = flag.String("plugin-config-path", "", "plugin yaml configuration")

diff --git a/docs/alert-management.md b/docs/alert-management.md
@@ -0,0 +1,41 @@
+## Alert Management Notes
+
+This document covers alert management behavior and prerequisites for the monitoring plugin.
+
+### User workload monitoring prerequisites
+
+To include **user workload** alerts and rules in `/api/v1/alerting/alerts` and `/api/v1/alerting/rules`, the user workload monitoring stack must be enabled. Follow the OpenShift documentation for enabling and configuring UWM:
+
+https://docs.redhat.com/en/documentation/monitoring_stack_for_red_hat_openshift/4.20/html/configuring_user_workload_monitoring/configuring-alerts-and-notifications-uwm
+
+#### How the plugin reads user workload alerts/rules
+
+The plugin prefers **Thanos tenancy** for user workload alerts/rules (RBAC-scoped, requires a namespace parameter). When the client does not provide a `namespace` filter, the plugin discovers candidate namespaces and queries Thanos tenancy per-namespace, using the end-user bearer token.
+
+Routes in `openshift-user-workload-monitoring` are treated as **fallbacks** (and are also used for some health checks and pending state retrieval).
+
+If you want to create the user workload Prometheus route (optional), you can expose the service:
+
+```shell
+oc -n openshift-user-workload-monitoring expose svc/prometheus-user-workload-web --name=prometheus-user-workload-web --port=web
+```
+
+If the route is missing/unreachable but tenancy is healthy, the plugin should still return user workload data and suppress route warnings.
+
+#### Alert states
+
+- `/api/v1/alerting/alerts?state=pending`: pending alerts come from Prometheus.
+- `/api/v1/alerting/alerts?state=firing`: firing alerts come from Alertmanager when available.
+- `/api/v1/alerting/alerts?state=silenced`: silenced alerts come from Alertmanager (requires an Alertmanager endpoint).
+
+### Alertmanager routing choices
+
+OpenShift supports routing user workload alerts to:
+
+- The **platform Alertmanager** (default instance)
+- A **separate Alertmanager** for user workloads
+- **External Alertmanager** instances
+
+This is a cluster configuration choice and does not change the plugin API shape. The plugin reads alerts from Alertmanager (for firing/silenced) and Prometheus (for pending), then merges platform and user workload results when available.
+
+The plugin intentionally reads from only the in-cluster Alertmanager endpoints. Supporting multiple external Alertmanagers would introduce ambiguous alert state and silencing outcomes because each instance can apply different routing, inhibition, and silence configurations.
diff --git a/docs/alert-rule-classification.md b/docs/alert-rule-classification.md
@@ -0,0 +1,213 @@
+# Alert Rule Classification - Design and Usage
+
+## Overview
+The backend classifies Prometheus alerting rules into a “component” and an “impact layer”. It:
+- Computes an `openshift_io_alert_rule_id` per alerting rule.
+- Determines component/layer based on matcher logic and rule labels.
+- Allows users to override classification via a single, fixed-name ConfigMap per namespace.
+- Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`.
+
+This document explains how it works, how to override, and how to test it.
+
+
+## Terminology
+- openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name.
+- component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.).
+- layer: Impact scope. Allowed values:
+  - `cluster`
+  - `namespace`
+
+Notes:
+- **Stability**:
+  - The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change.
+  - For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition.
+  - For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id.
+- Layer values are validated as `cluster|namespace` when set. To remove an override, clear the field (via API `null` or by removing the ConfigMap entry); empty/invalid values are ignored at read time.
+
+## Rule ID computation (openshift_io_alert_rule_id)
+Location: `pkg/alert_rule/alert_rule.go`
+
+The backend computes a specHash-like value from:
+- `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name
+- `expr`: trimmed with consecutive whitespace collapsed
+- `for`: trimmed (duration string as written in the rule)
+- `labels`: only non-system labels
+  - excludes labels with `openshift_io_` prefix and the `alertname` label
+  - drops empty values
+  - keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`)
+  - sorted by key and joined as `key=value` lines
+
+Annotations are intentionally ignored to reduce id churn on documentation-only changes.
+
+## Classification Logic (How component/layer are determined)
+Location: `pkg/alertcomponent/matcher.go`
+
+1) The code adapts `cluster-health-analyzer` matchers:
+   - CVO-related alerts (update/upgrade) → component/layer based on known patterns
+   - Compute / node-related alerts
+   - Core control plane components (renamed to layer `cluster`)
+   - Workload/namespace-level alerts (renamed to layer `namespace`)
+
+2) Fallback:
+   - If the computed component is empty or “Others”, we set:
+     - `component = other`
+     - `layer` derived from source:
+       - `openshift_io_alert_source=platform` → `cluster`
+       - `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster`
+       - `prometheus` label starting with `openshift-monitoring/` → `cluster`
+       - otherwise → `namespace`
+
+3) Result:
+   - Each alerting rule is assigned a `(component, layer)` tuple following the above logic.
+
+## Developer Overrides via Rule Labels (Recommended)
+If you want explicit component/layer values and do not want to rely on the matcher, set
+these labels on each rule in your `PrometheusRule`:
+- `openshift_io_alert_rule_component`
+- `openshift_io_alert_rule_layer`
+
+Both are validated the same way as API overrides:
+- `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric
+- `layer`: `cluster` or `namespace`
+
+When these labels are present and valid, they override matcher-derived values.
+
+## User Overrides (ConfigMap)
+Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go`
+
+- The backend stores overrides in the plugin namespace, sharded by target rule namespace:
+  - Name: `alert-classification-overrides-<rule-namespace>`
+  - Namespace: the monitoring plugin's namespace
+  - Required label:
+    - `monitoring.openshift.io/type=alert-classification-overrides`
+  - Recommended label:
+    - `app.kubernetes.io/managed-by=openshift-console`
+
+- Data layout:
+  - Key: base64url(nopad(UTF-8 bytes of `<openshift_io_alert_rule_id>`))
+    - This keeps ConfigMap keys opaque and avoids relying on any particular id character set.
+  - Value: JSON object with a `classification` field that holds component/layer.
+    - Optional metadata fields such as `alertName`, `prometheusRuleName`, and
+      `prometheusRuleNamespace` may be included for readability; they are ignored by
+      the backend.
+  - Dynamic overrides:
+    - `openshift_io_alert_rule_component_from`: derive component from an alert label key.
+    - `openshift_io_alert_rule_layer_from`: derive layer from an alert label key.
+
+Example:
+```json
+{
+  "alertName": "ClusterOperatorDown",
+  "prometheusRuleName": "cluster-version",
+  "prometheusRuleNamespace": "openshift-cluster-version",
+  "classification": {
+    "openshift_io_alert_rule_component_from": "name",
+    "openshift_io_alert_rule_layer": "cluster"
+  }
+}
+```
+
+Notes:
+- Overrides are only read when the required `monitoring.openshift.io/type` label is present.
+- Invalid component/layer values are ignored for that entry.
+- `*_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`).
+- If a `*_from` label is present but the alert does not carry that label or the derived
+  value is invalid, the backend falls back to static values (if present) or defaults.
+- If both component and layer are empty, the entry is removed.
+
+
+## Alerts API Enrichment
+Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go`
+
+- Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema)
+- The backend fetches active alerts and enriches each alert with:
+  - `openshift_io_alert_rule_id`
+  - `openshift_io_alert_component`
+  - `openshift_io_alert_layer`
+  - `prometheusRuleName`: name of the PrometheusRule resource the alert originates from
+  - `prometheusRuleNamespace`: namespace of that PrometheusRule resource
+  - `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR)
+- Prometheus compatibility:
+  - Base response matches Prometheus `/api/v1/alerts`.
+  - Additional fields are additive and safe for clients like Perses.
+
+## Prometheus/Thanos Sources
+Location: `pkg/k8s/prometheus_alerts.go`
+
+- Order of candidates:
+  1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied)
+  2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts`
+  3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts`
+  4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback)
+  5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts`
+
+- TLS and Auth:
+  - Bearer token: service account token from in-cluster config.
+  - CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`.
+
+RBAC:
+- Read routes in `openshift-monitoring`.
+- Access `prometheuses/api` as needed for oauth-proxied endpoints.
+
+## Updating Rules Classification
+APIs:
+- Single update:
+  - Method: `PATCH /api/v1/alerting/rules/{ruleId}`
+  - Request body:
+    ```json
+    {
+      "classification": {
+        "openshift_io_alert_rule_component": "team-x",
+        "openshift_io_alert_rule_layer": "namespace",
+        "openshift_io_alert_rule_component_from": "name",
+        "openshift_io_alert_rule_layer_from": "layer"
+      }
+    }
+    ```
+    - `openshift_io_alert_rule_layer`: `cluster` or `namespace`
+    - To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`).
+  - Response:
+    - 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success.
+    - Standard error body on failure (400 validation, 404 not found, etc.)
+- Bulk update:
+  - Method: `PATCH /api/v1/alerting/rules`
+  - Request body:
+    ```json
+    {
+      "ruleIds": ["<id-a>", "<id-b>"],
+      "classification": {
+        "openshift_io_alert_rule_component": "etcd",
+        "openshift_io_alert_rule_layer": "cluster"
+      }
+    }
+    ```
+  - Response:
+    - 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures.
+
+Direct K8s (supported for power users/GitOps):
+- PATCH/PUT the ConfigMap `alert-classification-overrides-<rule-namespace>` in the monitoring plugin namespace (respect `resourceVersion`).
+- Each entry is keyed by base64url(`<openshift_io_alert_rule_id>`) with a JSON payload that contains a `classification` object (`openshift_io_alert_rule_component`, `openshift_io_alert_rule_layer`).
+- UI should check update permissions with SelfSubjectAccessReview before showing an editor.
+
+Notes:
+- These endpoints are intended for updating **classification only** (component/layer overrides),
+  with permissions enforced based on the rule’s ownership (platform, user workload, operator-managed,
+  GitOps-managed).
+- To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`.
+  Clients that need to update both should issue two requests. The combined operation is not atomic.
+- In the ConfigMap override entries, classification is nested under `classification`
+  and validated as component/layer to keep it separate from generic label updates.
+
+## Security Notes
+- Persist only minimal classification metadata in the fixed-name ConfigMap.
+
+## Testing and Ops
+Unit tests:
+- `pkg/management/get_alerts_test.go`
+  - Overrides from labeled ConfigMap, fallback behavior, label validation.
+
+## Future Work
+- Optional CRD to formalize the schema (adds overhead; ConfigMap is sufficient today).
+- Optional composite update API if we need to update rule fields and classification atomically.
+- De-duplication/merge logic when aggregating alerts across sources.
+