Skip to content

feat(greenhouse): adds dashboards for alerts and plugins#1737

Merged
ibakshay merged 6 commits intomainfrom
feat/dashboards-issue-1302
Mar 3, 2026
Merged

feat(greenhouse): adds dashboards for alerts and plugins#1737
ibakshay merged 6 commits intomainfrom
feat/dashboards-issue-1302

Conversation

@olandr
Copy link
Copy Markdown
Member

@olandr olandr commented Jan 16, 2026

This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number.

The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params.


Signed-off-by: Simon Olander simon.olander@sap.com

Summary by CodeRabbit

  • New Features
    • Added three comprehensive monitoring dashboards: "Alerts" dashboard for tracking Kubernetes version updates, resource ownership labels, and cluster readiness; "Organization" dashboard for organization and team status monitoring; "Plugins" dashboard for plugin workload status, reconciliation metrics, and health tracking.

This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number.

The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params.

---------

Signed-off-by: Simon Olander <simon.olander@sap.com>
@olandr olandr marked this pull request as ready for review January 16, 2026 15:59
@olandr olandr requested a review from a team as a code owner January 16, 2026 15:59
@olandr
Copy link
Copy Markdown
Member Author

olandr commented Jan 16, 2026

#Plugins

image image image

@olandr
Copy link
Copy Markdown
Member Author

olandr commented Jan 16, 2026

Alerts

image image image image

@IvoGoman
Copy link
Copy Markdown
Contributor

Hey Simon, thanks for driving this.

For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview.

When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources.

As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. greenhouse_organization_ready & greenhouse_scim_access_ready). Then for Team & Team RBAC panels that show the absolute number of TeamMembers, the status of the TeamRoleBindings. All with filters for Organization, Team & Cluster to filter the panels.

The plugin dashboard is going into this direction. For the panels Plugin Reconciliation is Constantly Failing & Plugins Not Ready for over 15 minutes, how about modelling these as Time Series Charts? The dashboard could then give the operator, who is looking at overall status or investigating an alert, the change over time instead of showing the same information from the alert. The alert should already have a link to the prometheus with the exact query of the alert.

@olandr
Copy link
Copy Markdown
Member Author

olandr commented Jan 21, 2026

Hey Simon, thanks for driving this.

For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview.

Agreed, the other dashboards look a lot nicer. I deleted these duplicates.

When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources.

As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. greenhouse_organization_ready & greenhouse_scim_access_ready). Then for Team & Team RBAC panels that show the absolute number of TeamMembers, the status of the TeamRoleBindings. All with filters for Organization, Team & Cluster to filter the panels.

I added a new Dashboard called "Organization". To me it is a nice start, but it looks a bit empty tbh
image

The plugin dashboard is going into this direction. For the panels Plugin Reconciliation is Constantly Failing & Plugins Not Ready for over 15 minutes, how about modelling these as Time Series Charts? The dashboard could then give the operator, who is looking at overall status or investigating an alert, the change over time instead of showing the same information from the alert. The alert should already have a link to the prometheus with the exact query of the alert.

That sounds like a good idea to me as well. I started with the alert-angle to have something to start with, but I agree that it is redundant as you can render the exact alert in Prometheus/thanos.

Only drawback with a TimeSeriesChart however is that it looks a bit cluttered. I agree that it gives an operator way more information than a simple "boolean"-table. But It is not super easy to make render it in a nice looking way.
image

---------

Signed-off-by: Simon Olander <simon.olander@sap.com>
@trouaux
Copy link
Copy Markdown
Contributor

trouaux commented Jan 21, 2026

Screenshot 2026-01-21 at 5 01 29 PM

those gno plugins are not part of obs-eu-nl-1

@olandr
Copy link
Copy Markdown
Member Author

olandr commented Feb 2, 2026

Screenshot 2026-01-21 at 5 01 29 PM those gno plugins are not part of obs-eu-nl-1

they are according to the metrics 🤔

@olandr
Copy link
Copy Markdown
Member Author

olandr commented Feb 2, 2026

Handing over this PR to someone else for the time-being.

@uwe-mayer uwe-mayer linked an issue Feb 3, 2026 that may be closed by this pull request
1 task
@ibakshay ibakshay self-assigned this Feb 9, 2026
@ibakshay ibakshay requested a review from a team as a code owner February 10, 2026 12:33
@github-actions github-actions Bot added size/L documentation Improvements or additions to documentation core-apis dependencies size/XXL and removed size/XXL labels Feb 10, 2026
Copy link
Copy Markdown
Contributor

@IvoGoman IvoGoman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the big rework @olandr 🎉
A couple of things which would be cool:

Organization

  • the Variable Organization does not behave as intended. Can you use greenhouse_organization_ready as the Series selector. (This only being greenhouse will be addressed soon)
  • Number of Team Members would be nice as an Timeseries Chart with {{team}} in the legend

For the other two let's get them in use them and understand where we can tweak them.

@github-actions github-actions Bot removed the size/L label Feb 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

Three new Perses dashboard configuration files are added to monitor Kubernetes cluster infrastructure. The alerts dashboard tracks cluster maintenance and resource status; the organization dashboard monitors organization and team metrics; the plugins dashboard provides comprehensive plugin workload status and reconciliation monitoring, all leveraging Prometheus time-series data.

Changes

Cohort / File(s) Summary
Perses Dashboard Configurations
charts/greenhouse/perses-dashboards/alerts.json, charts/greenhouse/perses-dashboards/organization.json, charts/greenhouse/perses-dashboards/plugins.json
Three new dashboard configuration files added for comprehensive cluster monitoring. Alerts dashboard includes panels for Kubernetes version updates, resource ownership labels, kubeconfig token expiry, and cluster readiness. Organization dashboard tracks status, team membership changes, and SCIM access. Plugins dashboard provides workload status, reconciliation metrics, and test failure monitoring. All dashboards use Prometheus queries with dynamic variables for filtering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 twitches whiskers with delight
Three dashboards bloom in dashboard light,
With Prometheus queries dancing bright,
Alerts and plugins, teams in sight—
Metrics aligned, all perfectly right! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description explains the purpose and design drivers clearly but is missing several template sections including PR type, related tickets, tests, documentation updates, and checklists. Complete the PR description by selecting applicable PR type checkboxes, adding related issue numbers if applicable, specifying test coverage, and confirming documentation and checklist items.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(greenhouse): adds dashboards for alerts and plugins' accurately describes the main changes: adding two dashboard configurations (alerts.json, plugins.json, and organization.json) to the greenhouse charts directory.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dashboards-issue-1302

Comment @coderabbitai help to get the list of available commands and usage tips.

@ibakshay ibakshay requested a review from IvoGoman February 27, 2026 16:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
charts/greenhouse/perses-dashboards/plugins.json (2)

540-544: Hardcoded environment-specific cluster defaults.

The clusterName variable has hardcoded QA cluster names as defaults:

"defaultValue": [
  "cc-b0-qa-de-1",
  "dme-strg-qa-de-1",
  "g-qa-de-200"
]

Consider using "$__all" as the default to make this dashboard portable across environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/plugins.json` around lines 540 - 544, The
dashboard exposes hardcoded QA cluster names in the clusterName variable's
defaultValue array in plugins.json; replace that hardcoded list with the Grafana
all selector token (use "$__all" as the sole defaultValue) so the clusterName
variable defaults to all clusters and the dashboard is portable across
environments. Target the clusterName variable's "defaultValue" entry in
plugins.json and update it to a single-element array containing "$__all".

217-266: Duplicate panel title "Installed plugins".

This panel (lines 217-266) has the same title as the panel at lines 48-117. While they serve different purposes (StatChart vs Table), identical titles can confuse users navigating the dashboard. Consider renaming one, e.g., "Installed Plugins by Cluster" or "Plugin Readiness Overview".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/plugins.json` around lines 217 - 266, The
panel object with id "70671561f36d4ce8a6085f2230515e63" (kind "Panel", plugin
"StatChart") currently uses the display.name "Installed plugins", duplicating
another panel title; update its display.name to a distinct, descriptive label
(e.g., "Installed Plugins by Cluster" or "Plugin Readiness Overview") to avoid
confusion while leaving the rest of the panel spec (queries, thresholds,
seriesNameFormat) unchanged.
charts/greenhouse/perses-dashboards/organization.json (2)

348-588: Consider removing unused variables.

The following variables are defined but not referenced in any panel query within this dashboard:

  • clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert, teamAlert
  • ownedBy, pluginDefinition, clusterName, step

If these are placeholders for future use or shared across dashboards via external tooling, this is fine. Otherwise, they add unnecessary complexity to the configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/organization.json` around lines 348 -
588, Remove or document the unused ListVariable definitions (clusterAlert,
operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert,
teamAlert and the other unused names ownedBy, pluginDefinition, clusterName,
step) from the dashboard JSON if they are not referenced by any panel queries;
alternatively, if they must remain as placeholders, add a comment or metadata
flag indicating they are intentionally unused. Locate the variables by their
"name" fields (e.g., "clusterAlert", "operatorAlert", "pluginAlert",
"proxyAlert", "organizationAlert", "resourceAlert", "teamAlert") and either
delete those whole variable objects from the "variables" array or mark them with
a clear "hidden/unused" annotation, and while editing also trim leading/trailing
whitespace/newline characters from the "value" and "label" strings (e.g.,
entries containing leading spaces or "\n") to avoid accidental mismatches.

716-781: Hardcoded environment-specific defaults.

The visible variables have hardcoded defaults that appear environment-specific:

  • Line 724: "defaultValue": "sci-demo" for organization
  • Line 770: "defaultValue": "greenhouse-playground" for cluster

Consider using "$__all" or a more generic default, or document that these are intended for a specific environment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/organization.json` around lines 716 -
781, The organization and cluster ListVariable entries currently hardcode
environment-specific defaults ("sci-demo" in the variable named "organization"
and "greenhouse-playground" in the variable named "cluster"); change their
"defaultValue" to a generic value (for example "$__all" or an empty string) or
remove the hardcoded Defaults so the variables are portable across environments
(update the ListVariable with "name": "organization" and the ListVariable with
"name": "cluster" to use the new generic default and optionally add a brief
comment in their display.description if a specific default is required in
certain deployments).
charts/greenhouse/perses-dashboards/alerts.json (2)

391-631: Same whitespace issues and unused variables as organization.json.

This file has identical issues with the static alert variable values containing leading spaces and trailing newlines (e.g., lines 414-419, 449-462, 492-501, 531-537, 566-567, 593-594, 624-625).

Additionally, all alert variables (clusterAlert, operatorAlert, etc.) plus ownedBy, pluginDefinition, clusterName, and step are defined but not referenced in any panel query. Consider removing unused variables or documenting their intended purpose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/alerts.json` around lines 391 - 631, The
JSON contains many StaticListVariable entries with leading spaces and trailing
newlines in their "value" and "defaultValue" strings (fix values for
clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert,
resourceAlert, teamAlert) — trim all whitespace/newlines from each listed value
and its defaultValue entries in the variables' plugin.spec.values and
spec.defaultValue arrays; also remove or wire up unused variables (clusterAlert,
operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert,
teamAlert, ownedBy, pluginDefinition, clusterName, step) by either deleting them
from the "variables" array or referencing them in panel queries, and ensure the
variable names match exactly where they are used so queries pick the cleaned
values.

309-323: Potential issue with summing binary ready metric.

The query sum by (clusterName, namespace) (greenhouse_cluster_ready) sums a binary (0/1) ready status metric. If multiple pods/instances report this metric for the same cluster/namespace, the sum could exceed 1, causing the cell settings (which check for exact values "1" or "0") to show no color.

Consider using max or min instead of sum depending on desired semantics (any ready vs all ready).

Proposed alternatives
-"query": "sum by (clusterName, namespace) (greenhouse_cluster_ready)",
+"query": "max by (clusterName, namespace) (greenhouse_cluster_ready)",

Or if you want "all instances ready":

-"query": "sum by (clusterName, namespace) (greenhouse_cluster_ready)",
+"query": "min by (clusterName, namespace) (greenhouse_cluster_ready)",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/greenhouse/perses-dashboards/alerts.json` around lines 309 - 323, The
Prometheus query currently uses sum by (clusterName, namespace)
(greenhouse_cluster_ready) which can exceed 1 for multiple reporters; change the
aggregation to the correct semantics—e.g., use max by (clusterName, namespace)
(greenhouse_cluster_ready) if you want "any instance ready" or min by
(clusterName, namespace) (greenhouse_cluster_ready) if you want "all instances
ready"—by replacing the query string in the PrometheusTimeSeriesQuery spec (the
"query" field) accordingly so the cell color checks for exact "1"/"0" work as
intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/greenhouse/perses-dashboards/alerts.json`:
- Around line 25-45: The regexes in the "cellSettings" conditions are incorrect
because the trailing '+' allows repeated groups and single-digit minors fall
through; update the two condition.spec.expr patterns to use explicit,
non-repeating ranges and anchors: replace "v1\\.((1|2)[0-9])+\\..*" with
"^v1\\.(1[0-9]|2[0-9])\\..*" and replace "v1\\.([3-9][0-9])+\\..*" with
"^v1\\.([3-9][0-9])\\..*"; if you need to cover single-digit minors (e.g.,
v1.9.x) add a separate condition.expr like "^v1\\.[0-9]\\..*" and assign the
intended color so no versions fall through.

In `@charts/greenhouse/perses-dashboards/organization.json`:
- Around line 356-582: The static list variable entries contain leading spaces
and trailing newlines causing mismatches; edit the ListVariable blocks (names:
"clusterAlert", "operatorAlert", "pluginAlert", "proxyAlert",
"organizationAlert", "resourceAlert", "teamAlert") and trim whitespace/newlines
from every string in both spec.plugin.spec.values[*].value and spec.defaultValue
arrays so values are exact alert names (e.g.,
"GreenhouseClusterKubernetesVersionOutOfMaintenance",
"GreenhouseWebhookErrorsHigh", "GreenhouseResourceOwnedByLabelMissing", etc.);
ensure no entries include leading spaces or "\n" characters.

In `@charts/greenhouse/perses-dashboards/plugins.json`:
- Around line 283-297: The legend format references {{owned_by}} but the
Prometheus query in the TimeSeriesQuery (sum by(plugin, cluster,
namespace)(rate(greenhouse_plugin_chart_test_runs_total{...}[15m]))) does not
include owned_by in its grouped labels; fix by adding owned_by to the
aggregation labels (change sum by(plugin, cluster, namespace) to sum by(plugin,
cluster, namespace, owned_by)) so the greenhouse_plugin_chart_test_runs_total
query emits an owned_by label for seriesNameFormat to render correctly, or
alternatively remove {{owned_by}} from the seriesNameFormat if you prefer not to
include that label.

---

Nitpick comments:
In `@charts/greenhouse/perses-dashboards/alerts.json`:
- Around line 391-631: The JSON contains many StaticListVariable entries with
leading spaces and trailing newlines in their "value" and "defaultValue" strings
(fix values for clusterAlert, operatorAlert, pluginAlert, proxyAlert,
organizationAlert, resourceAlert, teamAlert) — trim all whitespace/newlines from
each listed value and its defaultValue entries in the variables'
plugin.spec.values and spec.defaultValue arrays; also remove or wire up unused
variables (clusterAlert, operatorAlert, pluginAlert, proxyAlert,
organizationAlert, resourceAlert, teamAlert, ownedBy, pluginDefinition,
clusterName, step) by either deleting them from the "variables" array or
referencing them in panel queries, and ensure the variable names match exactly
where they are used so queries pick the cleaned values.
- Around line 309-323: The Prometheus query currently uses sum by (clusterName,
namespace) (greenhouse_cluster_ready) which can exceed 1 for multiple reporters;
change the aggregation to the correct semantics—e.g., use max by (clusterName,
namespace) (greenhouse_cluster_ready) if you want "any instance ready" or min by
(clusterName, namespace) (greenhouse_cluster_ready) if you want "all instances
ready"—by replacing the query string in the PrometheusTimeSeriesQuery spec (the
"query" field) accordingly so the cell color checks for exact "1"/"0" work as
intended.

In `@charts/greenhouse/perses-dashboards/organization.json`:
- Around line 348-588: Remove or document the unused ListVariable definitions
(clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert,
resourceAlert, teamAlert and the other unused names ownedBy, pluginDefinition,
clusterName, step) from the dashboard JSON if they are not referenced by any
panel queries; alternatively, if they must remain as placeholders, add a comment
or metadata flag indicating they are intentionally unused. Locate the variables
by their "name" fields (e.g., "clusterAlert", "operatorAlert", "pluginAlert",
"proxyAlert", "organizationAlert", "resourceAlert", "teamAlert") and either
delete those whole variable objects from the "variables" array or mark them with
a clear "hidden/unused" annotation, and while editing also trim leading/trailing
whitespace/newline characters from the "value" and "label" strings (e.g.,
entries containing leading spaces or "\n") to avoid accidental mismatches.
- Around line 716-781: The organization and cluster ListVariable entries
currently hardcode environment-specific defaults ("sci-demo" in the variable
named "organization" and "greenhouse-playground" in the variable named
"cluster"); change their "defaultValue" to a generic value (for example "$__all"
or an empty string) or remove the hardcoded Defaults so the variables are
portable across environments (update the ListVariable with "name":
"organization" and the ListVariable with "name": "cluster" to use the new
generic default and optionally add a brief comment in their display.description
if a specific default is required in certain deployments).

In `@charts/greenhouse/perses-dashboards/plugins.json`:
- Around line 540-544: The dashboard exposes hardcoded QA cluster names in the
clusterName variable's defaultValue array in plugins.json; replace that
hardcoded list with the Grafana all selector token (use "$__all" as the sole
defaultValue) so the clusterName variable defaults to all clusters and the
dashboard is portable across environments. Target the clusterName variable's
"defaultValue" entry in plugins.json and update it to a single-element array
containing "$__all".
- Around line 217-266: The panel object with id
"70671561f36d4ce8a6085f2230515e63" (kind "Panel", plugin "StatChart") currently
uses the display.name "Installed plugins", duplicating another panel title;
update its display.name to a distinct, descriptive label (e.g., "Installed
Plugins by Cluster" or "Plugin Readiness Overview") to avoid confusion while
leaving the rest of the panel spec (queries, thresholds, seriesNameFormat)
unchanged.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 817c862 and 5dc9680.

📒 Files selected for processing (3)
  • charts/greenhouse/perses-dashboards/alerts.json
  • charts/greenhouse/perses-dashboards/organization.json
  • charts/greenhouse/perses-dashboards/plugins.json

Comment thread charts/greenhouse/perses-dashboards/alerts.json
Comment thread charts/greenhouse/perses-dashboards/organization.json
Comment thread charts/greenhouse/perses-dashboards/plugins.json
@ibakshay ibakshay merged commit 67bbde6 into main Mar 3, 2026
11 checks passed
@ibakshay ibakshay deleted the feat/dashboards-issue-1302 branch March 3, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] - Greenhouse resource dashboard

4 participants