feat(greenhouse): adds dashboards for alerts and plugins by olandr · Pull Request #1737 · cloudoperators/greenhouse

olandr · 2026-01-16T15:59:37Z

This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number.

The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params.

Signed-off-by: Simon Olander simon.olander@sap.com

Summary by CodeRabbit

New Features
- Added three comprehensive monitoring dashboards: "Alerts" dashboard for tracking Kubernetes version updates, resource ownership labels, and cluster readiness; "Organization" dashboard for organization and team status monitoring; "Plugins" dashboard for plugin workload status, reconciliation metrics, and health tracking.

This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number. The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params. --------- Signed-off-by: Simon Olander <simon.olander@sap.com>

olandr · 2026-01-16T16:05:58Z

#Plugins

olandr · 2026-01-16T16:07:20Z

Alerts

IvoGoman · 2026-01-19T11:30:14Z

Hey Simon, thanks for driving this.

For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview.

When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources.

As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. greenhouse_organization_ready & greenhouse_scim_access_ready). Then for Team & Team RBAC panels that show the absolute number of TeamMembers, the status of the TeamRoleBindings. All with filters for Organization, Team & Cluster to filter the panels.

The plugin dashboard is going into this direction. For the panels Plugin Reconciliation is Constantly Failing & Plugins Not Ready for over 15 minutes, how about modelling these as Time Series Charts? The dashboard could then give the operator, who is looking at overall status or investigating an alert, the change over time instead of showing the same information from the alert. The alert should already have a link to the prometheus with the exact query of the alert.

olandr · 2026-01-21T12:49:00Z

Hey Simon, thanks for driving this.

For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview.

Agreed, the other dashboards look a lot nicer. I deleted these duplicates.

When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources.

As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. greenhouse_organization_ready & greenhouse_scim_access_ready). Then for Team & Team RBAC panels that show the absolute number of TeamMembers, the status of the TeamRoleBindings. All with filters for Organization, Team & Cluster to filter the panels.

I added a new Dashboard called "Organization". To me it is a nice start, but it looks a bit empty tbh

The plugin dashboard is going into this direction. For the panels Plugin Reconciliation is Constantly Failing & Plugins Not Ready for over 15 minutes, how about modelling these as Time Series Charts? The dashboard could then give the operator, who is looking at overall status or investigating an alert, the change over time instead of showing the same information from the alert. The alert should already have a link to the prometheus with the exact query of the alert.

That sounds like a good idea to me as well. I started with the alert-angle to have something to start with, but I agree that it is redundant as you can render the exact alert in Prometheus/thanos.

Only drawback with a TimeSeriesChart however is that it looks a bit cluttered. I agree that it gives an operator way more information than a simple "boolean"-table. But It is not super easy to make render it in a nice looking way.

--------- Signed-off-by: Simon Olander <simon.olander@sap.com>

trouaux · 2026-01-21T16:03:09Z

those gno plugins are not part of obs-eu-nl-1

olandr · 2026-02-02T09:37:47Z

those gno plugins are not part of obs-eu-nl-1

they are according to the metrics 🤔

olandr · 2026-02-02T09:39:56Z

Handing over this PR to someone else for the time-being.

IvoGoman

Thanks for the big rework @olandr 🎉
A couple of things which would be cool:

Organization

the Variable Organization does not behave as intended. Can you use greenhouse_organization_ready as the Series selector. (This only being greenhouse will be addressed soon)
Number of Team Members would be nice as an Timeseries Chart with {{team}} in the legend

For the other two let's get them in use them and understand where we can tweak them.

coderabbitai · 2026-02-27T16:03:08Z

📝 Walkthrough

Walkthrough

Three new Perses dashboard configuration files are added to monitor Kubernetes cluster infrastructure. The alerts dashboard tracks cluster maintenance and resource status; the organization dashboard monitors organization and team metrics; the plugins dashboard provides comprehensive plugin workload status and reconciliation monitoring, all leveraging Prometheus time-series data.

Changes

Cohort / File(s)	Summary
Perses Dashboard Configurations `charts/greenhouse/perses-dashboards/alerts.json`, `charts/greenhouse/perses-dashboards/organization.json`, `charts/greenhouse/perses-dashboards/plugins.json`	Three new dashboard configuration files added for comprehensive cluster monitoring. Alerts dashboard includes panels for Kubernetes version updates, resource ownership labels, kubeconfig token expiry, and cluster readiness. Organization dashboard tracks status, team membership changes, and SCIM access. Plugins dashboard provides workload status, reconciliation metrics, and test failure monitoring. All dashboards use Prometheus queries with dynamic variables for filtering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 twitches whiskers with delight
Three dashboards bloom in dashboard light,
With Prometheus queries dancing bright,
Alerts and plugins, teams in sight—
Metrics aligned, all perfectly right! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description explains the purpose and design drivers clearly but is missing several template sections including PR type, related tickets, tests, documentation updates, and checklists.	Complete the PR description by selecting applicable PR type checkboxes, adding related issue numbers if applicable, specifying test coverage, and confirming documentation and checklist items.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(greenhouse): adds dashboards for alerts and plugins' accurately describes the main changes: adding two dashboard configurations (alerts.json, plugins.json, and organization.json) to the greenhouse charts directory.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/dashboards-issue-1302

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

charts/greenhouse/perses-dashboards/plugins.json (2)
540-544: Hardcoded environment-specific cluster defaults.

The clusterName variable has hardcoded QA cluster names as defaults:
"defaultValue": [
  "cc-b0-qa-de-1",
  "dme-strg-qa-de-1",
  "g-qa-de-200"
]
Consider using "$__all" as the default to make this dashboard portable across environments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/plugins.json` around lines 540 - 544, The
dashboard exposes hardcoded QA cluster names in the clusterName variable's
defaultValue array in plugins.json; replace that hardcoded list with the Grafana
all selector token (use "$__all" as the sole defaultValue) so the clusterName
variable defaults to all clusters and the dashboard is portable across
environments. Target the clusterName variable's "defaultValue" entry in
plugins.json and update it to a single-element array containing "$__all".
217-266: Duplicate panel title "Installed plugins".

This panel (lines 217-266) has the same title as the panel at lines 48-117. While they serve different purposes (StatChart vs Table), identical titles can confuse users navigating the dashboard. Consider renaming one, e.g., "Installed Plugins by Cluster" or "Plugin Readiness Overview".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/plugins.json` around lines 217 - 266, The
panel object with id "70671561f36d4ce8a6085f2230515e63" (kind "Panel", plugin
"StatChart") currently uses the display.name "Installed plugins", duplicating
another panel title; update its display.name to a distinct, descriptive label
(e.g., "Installed Plugins by Cluster" or "Plugin Readiness Overview") to avoid
confusion while leaving the rest of the panel spec (queries, thresholds,
seriesNameFormat) unchanged.
charts/greenhouse/perses-dashboards/organization.json (2)
348-588: Consider removing unused variables.

The following variables are defined but not referenced in any panel query within this dashboard:

clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert, teamAlert

ownedBy, pluginDefinition, clusterName, step

If these are placeholders for future use or shared across dashboards via external tooling, this is fine. Otherwise, they add unnecessary complexity to the configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/organization.json` around lines 348 -
588, Remove or document the unused ListVariable definitions (clusterAlert,
operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert,
teamAlert and the other unused names ownedBy, pluginDefinition, clusterName,
step) from the dashboard JSON if they are not referenced by any panel queries;
alternatively, if they must remain as placeholders, add a comment or metadata
flag indicating they are intentionally unused. Locate the variables by their
"name" fields (e.g., "clusterAlert", "operatorAlert", "pluginAlert",
"proxyAlert", "organizationAlert", "resourceAlert", "teamAlert") and either
delete those whole variable objects from the "variables" array or mark them with
a clear "hidden/unused" annotation, and while editing also trim leading/trailing
whitespace/newline characters from the "value" and "label" strings (e.g.,
entries containing leading spaces or "\n") to avoid accidental mismatches.
716-781: Hardcoded environment-specific defaults.

The visible variables have hardcoded defaults that appear environment-specific:

Line 724: "defaultValue": "sci-demo" for organization

Line 770: "defaultValue": "greenhouse-playground" for cluster

Consider using "$__all" or a more generic default, or document that these are intended for a specific environment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/organization.json` around lines 716 -
781, The organization and cluster ListVariable entries currently hardcode
environment-specific defaults ("sci-demo" in the variable named "organization"
and "greenhouse-playground" in the variable named "cluster"); change their
"defaultValue" to a generic value (for example "$__all" or an empty string) or
remove the hardcoded Defaults so the variables are portable across environments
(update the ListVariable with "name": "organization" and the ListVariable with
"name": "cluster" to use the new generic default and optionally add a brief
comment in their display.description if a specific default is required in
certain deployments).
charts/greenhouse/perses-dashboards/alerts.json (2)
391-631: Same whitespace issues and unused variables as organization.json.

This file has identical issues with the static alert variable values containing leading spaces and trailing newlines (e.g., lines 414-419, 449-462, 492-501, 531-537, 566-567, 593-594, 624-625).

Additionally, all alert variables (clusterAlert, operatorAlert, etc.) plus ownedBy, pluginDefinition, clusterName, and step are defined but not referenced in any panel query. Consider removing unused variables or documenting their intended purpose.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/alerts.json` around lines 391 - 631, The
JSON contains many StaticListVariable entries with leading spaces and trailing
newlines in their "value" and "defaultValue" strings (fix values for
clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert,
resourceAlert, teamAlert) — trim all whitespace/newlines from each listed value
and its defaultValue entries in the variables' plugin.spec.values and
spec.defaultValue arrays; also remove or wire up unused variables (clusterAlert,
operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert,
teamAlert, ownedBy, pluginDefinition, clusterName, step) by either deleting them
from the "variables" array or referencing them in panel queries, and ensure the
variable names match exactly where they are used so queries pick the cleaned
values.
309-323: Potential issue with summing binary ready metric.

The query sum by (clusterName, namespace) (greenhouse_cluster_ready) sums a binary (0/1) ready status metric. If multiple pods/instances report this metric for the same cluster/namespace, the sum could exceed 1, causing the cell settings (which check for exact values "1" or "0") to show no color.

Consider using max or min instead of sum depending on desired semantics (any ready vs all ready).
Proposed alternatives
-"query": "sum by (clusterName, namespace) (greenhouse_cluster_ready)",
+"query": "max by (clusterName, namespace) (greenhouse_cluster_ready)",
Or if you want "all instances ready":
-"query": "sum by (clusterName, namespace) (greenhouse_cluster_ready)",
+"query": "min by (clusterName, namespace) (greenhouse_cluster_ready)",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/alerts.json` around lines 309 - 323, The
Prometheus query currently uses sum by (clusterName, namespace)
(greenhouse_cluster_ready) which can exceed 1 for multiple reporters; change the
aggregation to the correct semantics—e.g., use max by (clusterName, namespace)
(greenhouse_cluster_ready) if you want "any instance ready" or min by
(clusterName, namespace) (greenhouse_cluster_ready) if you want "all instances
ready"—by replacing the query string in the PrometheusTimeSeriesQuery spec (the
"query" field) accordingly so the cell color checks for exact "1"/"0" work as
intended.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/greenhouse/perses-dashboards/alerts.json`:
- Around line 25-45: The regexes in the "cellSettings" conditions are incorrect
because the trailing '+' allows repeated groups and single-digit minors fall
through; update the two condition.spec.expr patterns to use explicit,
non-repeating ranges and anchors: replace "v1\\.((1|2)[0-9])+\\..*" with
"^v1\\.(1[0-9]|2[0-9])\\..*" and replace "v1\\.([3-9][0-9])+\\..*" with
"^v1\\.([3-9][0-9])\\..*"; if you need to cover single-digit minors (e.g.,
v1.9.x) add a separate condition.expr like "^v1\\.[0-9]\\..*" and assign the
intended color so no versions fall through.

In `@charts/greenhouse/perses-dashboards/organization.json`:
- Around line 356-582: The static list variable entries contain leading spaces
and trailing newlines causing mismatches; edit the ListVariable blocks (names:
"clusterAlert", "operatorAlert", "pluginAlert", "proxyAlert",
"organizationAlert", "resourceAlert", "teamAlert") and trim whitespace/newlines
from every string in both spec.plugin.spec.values[*].value and spec.defaultValue
arrays so values are exact alert names (e.g.,
"GreenhouseClusterKubernetesVersionOutOfMaintenance",
"GreenhouseWebhookErrorsHigh", "GreenhouseResourceOwnedByLabelMissing", etc.);
ensure no entries include leading spaces or "\n" characters.

In `@charts/greenhouse/perses-dashboards/plugins.json`:
- Around line 283-297: The legend format references {{owned_by}} but the
Prometheus query in the TimeSeriesQuery (sum by(plugin, cluster,
namespace)(rate(greenhouse_plugin_chart_test_runs_total{...}[15m]))) does not
include owned_by in its grouped labels; fix by adding owned_by to the
aggregation labels (change sum by(plugin, cluster, namespace) to sum by(plugin,
cluster, namespace, owned_by)) so the greenhouse_plugin_chart_test_runs_total
query emits an owned_by label for seriesNameFormat to render correctly, or
alternatively remove {{owned_by}} from the seriesNameFormat if you prefer not to
include that label.

---

Nitpick comments:
In `@charts/greenhouse/perses-dashboards/alerts.json`:
- Around line 391-631: The JSON contains many StaticListVariable entries with
leading spaces and trailing newlines in their "value" and "defaultValue" strings
(fix values for clusterAlert, operatorAlert, pluginAlert, proxyAlert,
organizationAlert, resourceAlert, teamAlert) — trim all whitespace/newlines from
each listed value and its defaultValue entries in the variables'
plugin.spec.values and spec.defaultValue arrays; also remove or wire up unused
variables (clusterAlert, operatorAlert, pluginAlert, proxyAlert,
organizationAlert, resourceAlert, teamAlert, ownedBy, pluginDefinition,
clusterName, step) by either deleting them from the "variables" array or
referencing them in panel queries, and ensure the variable names match exactly
where they are used so queries pick the cleaned values.
- Around line 309-323: The Prometheus query currently uses sum by (clusterName,
namespace) (greenhouse_cluster_ready) which can exceed 1 for multiple reporters;
change the aggregation to the correct semantics—e.g., use max by (clusterName,
namespace) (greenhouse_cluster_ready) if you want "any instance ready" or min by
(clusterName, namespace) (greenhouse_cluster_ready) if you want "all instances
ready"—by replacing the query string in the PrometheusTimeSeriesQuery spec (the
"query" field) accordingly so the cell color checks for exact "1"/"0" work as
intended.

In `@charts/greenhouse/perses-dashboards/organization.json`:
- Around line 348-588: Remove or document the unused ListVariable definitions
(clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert,
resourceAlert, teamAlert and the other unused names ownedBy, pluginDefinition,
clusterName, step) from the dashboard JSON if they are not referenced by any
panel queries; alternatively, if they must remain as placeholders, add a comment
or metadata flag indicating they are intentionally unused. Locate the variables
by their "name" fields (e.g., "clusterAlert", "operatorAlert", "pluginAlert",
"proxyAlert", "organizationAlert", "resourceAlert", "teamAlert") and either
delete those whole variable objects from the "variables" array or mark them with
a clear "hidden/unused" annotation, and while editing also trim leading/trailing
whitespace/newline characters from the "value" and "label" strings (e.g.,
entries containing leading spaces or "\n") to avoid accidental mismatches.
- Around line 716-781: The organization and cluster ListVariable entries
currently hardcode environment-specific defaults ("sci-demo" in the variable
named "organization" and "greenhouse-playground" in the variable named
"cluster"); change their "defaultValue" to a generic value (for example "$__all"
or an empty string) or remove the hardcoded Defaults so the variables are
portable across environments (update the ListVariable with "name":
"organization" and the ListVariable with "name": "cluster" to use the new
generic default and optionally add a brief comment in their display.description
if a specific default is required in certain deployments).

In `@charts/greenhouse/perses-dashboards/plugins.json`:
- Around line 540-544: The dashboard exposes hardcoded QA cluster names in the
clusterName variable's defaultValue array in plugins.json; replace that
hardcoded list with the Grafana all selector token (use "$__all" as the sole
defaultValue) so the clusterName variable defaults to all clusters and the
dashboard is portable across environments. Target the clusterName variable's
"defaultValue" entry in plugins.json and update it to a single-element array
containing "$__all".
- Around line 217-266: The panel object with id
"70671561f36d4ce8a6085f2230515e63" (kind "Panel", plugin "StatChart") currently
uses the display.name "Installed plugins", duplicating another panel title;
update its display.name to a distinct, descriptive label (e.g., "Installed
Plugins by Cluster" or "Plugin Readiness Overview") to avoid confusion while
leaving the rest of the panel spec (queries, thresholds, seriesNameFormat)
unchanged.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 817c862 and 5dc9680.

📒 Files selected for processing (3)

charts/greenhouse/perses-dashboards/alerts.json
charts/greenhouse/perses-dashboards/organization.json
charts/greenhouse/perses-dashboards/plugins.json

github-actions Bot added the size/XXL label Jan 16, 2026

olandr marked this pull request as ready for review January 16, 2026 15:59

olandr requested a review from a team as a code owner January 16, 2026 15:59

github-actions Bot added feature helm-charts and removed size/XXL labels Jan 16, 2026

olandr mentioned this pull request Jan 16, 2026

feat(greenhouse-organization): adds Alerts & Plugins Dashboards for GH sapcc/helm-charts#10405

Closed

style(greenhouse): changes dashboards to address review comments

327b952

--------- Signed-off-by: Simon Olander <simon.olander@sap.com>

github-actions Bot added the size/XXL label Jan 21, 2026

uwe-mayer linked an issue Feb 3, 2026 that may be closed by this pull request

[FEAT] - Greenhouse resource dashboard #1302

Closed

1 task

uwe-mayer mentioned this pull request Feb 3, 2026

[FEAT] - Greenhouse resource dashboard #1302

Closed

1 task

ibakshay self-assigned this Feb 9, 2026

Merge branch 'feat/dashboards-issue-1302'

f3899d6

ibakshay requested a review from a team as a code owner February 10, 2026 12:33

github-actions Bot added size/L documentation Improvements or additions to documentation core-apis dependencies size/XXL and removed size/XXL labels Feb 10, 2026

IvoGoman reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into feat/dashboards-issue-1302

7d1f5fa

github-actions Bot removed the size/L label Feb 27, 2026

Update organization dashboard metadata and queries

5dc9680

ibakshay requested a review from IvoGoman February 27, 2026 16:05

coderabbitai Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread charts/greenhouse/perses-dashboards/alerts.json

Comment thread charts/greenhouse/perses-dashboards/organization.json

Comment thread charts/greenhouse/perses-dashboards/plugins.json

IvoGoman approved these changes Mar 3, 2026

View reviewed changes

Merge branch 'main' into feat/dashboards-issue-1302

0eeac1f

ibakshay merged commit 67bbde6 into main Mar 3, 2026
11 checks passed

ibakshay deleted the feat/dashboards-issue-1302 branch March 3, 2026 08:55

Conversation

olandr commented Jan 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

olandr commented Jan 16, 2026

Uh oh!

olandr commented Jan 16, 2026

Alerts

Uh oh!

IvoGoman commented Jan 19, 2026

Uh oh!

olandr commented Jan 21, 2026

Uh oh!

trouaux commented Jan 21, 2026

Uh oh!

olandr commented Feb 2, 2026

Uh oh!

olandr commented Feb 2, 2026

Uh oh!

IvoGoman left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

olandr commented Jan 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 27, 2026 •

edited

Loading