feat(greenhouse): adds dashboards for alerts and plugins#1737
Conversation
This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number. The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params. --------- Signed-off-by: Simon Olander <simon.olander@sap.com>
|
Hey Simon, thanks for driving this. For a few panel groups in the alerts dashboard I see some redundancy to existing dashboards. The Operator Alerts on the Alerts Plugin were modelled after the controller runtime metrics dashboard. Same for the Proxy alerts which are covered by the Proxy Overview. When writing the original issue I was thinking more in the direction of having an overview by Greenhouse resources/ group of resources. As an example for the Team Alerts. This could be it's own Organization dashboard. Showing the overall status of the organisation (e.g. The plugin dashboard is going into this direction. For the panels |
--------- Signed-off-by: Simon Olander <simon.olander@sap.com>
|
Handing over this PR to someone else for the time-being. |
IvoGoman
left a comment
There was a problem hiding this comment.
Thanks for the big rework @olandr 🎉
A couple of things which would be cool:
Organization
- the Variable Organization does not behave as intended. Can you use
greenhouse_organization_readyas the Series selector. (This only being greenhouse will be addressed soon) - Number of Team Members would be nice as an Timeseries Chart with
{{team}}in the legend
For the other two let's get them in use them and understand where we can tweak them.
📝 WalkthroughWalkthroughThree new Perses dashboard configuration files are added to monitor Kubernetes cluster infrastructure. The alerts dashboard tracks cluster maintenance and resource status; the organization dashboard monitors organization and team metrics; the plugins dashboard provides comprehensive plugin workload status and reconciliation monitoring, all leveraging Prometheus time-series data. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (6)
charts/greenhouse/perses-dashboards/plugins.json (2)
540-544: Hardcoded environment-specific cluster defaults.The
clusterNamevariable has hardcoded QA cluster names as defaults:"defaultValue": [ "cc-b0-qa-de-1", "dme-strg-qa-de-1", "g-qa-de-200" ]Consider using
"$__all"as the default to make this dashboard portable across environments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/greenhouse/perses-dashboards/plugins.json` around lines 540 - 544, The dashboard exposes hardcoded QA cluster names in the clusterName variable's defaultValue array in plugins.json; replace that hardcoded list with the Grafana all selector token (use "$__all" as the sole defaultValue) so the clusterName variable defaults to all clusters and the dashboard is portable across environments. Target the clusterName variable's "defaultValue" entry in plugins.json and update it to a single-element array containing "$__all".
217-266: Duplicate panel title "Installed plugins".This panel (lines 217-266) has the same title as the panel at lines 48-117. While they serve different purposes (StatChart vs Table), identical titles can confuse users navigating the dashboard. Consider renaming one, e.g., "Installed Plugins by Cluster" or "Plugin Readiness Overview".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/greenhouse/perses-dashboards/plugins.json` around lines 217 - 266, The panel object with id "70671561f36d4ce8a6085f2230515e63" (kind "Panel", plugin "StatChart") currently uses the display.name "Installed plugins", duplicating another panel title; update its display.name to a distinct, descriptive label (e.g., "Installed Plugins by Cluster" or "Plugin Readiness Overview") to avoid confusion while leaving the rest of the panel spec (queries, thresholds, seriesNameFormat) unchanged.charts/greenhouse/perses-dashboards/organization.json (2)
348-588: Consider removing unused variables.The following variables are defined but not referenced in any panel query within this dashboard:
clusterAlert,operatorAlert,pluginAlert,proxyAlert,organizationAlert,resourceAlert,teamAlertownedBy,pluginDefinition,clusterName,stepIf these are placeholders for future use or shared across dashboards via external tooling, this is fine. Otherwise, they add unnecessary complexity to the configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/greenhouse/perses-dashboards/organization.json` around lines 348 - 588, Remove or document the unused ListVariable definitions (clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert, teamAlert and the other unused names ownedBy, pluginDefinition, clusterName, step) from the dashboard JSON if they are not referenced by any panel queries; alternatively, if they must remain as placeholders, add a comment or metadata flag indicating they are intentionally unused. Locate the variables by their "name" fields (e.g., "clusterAlert", "operatorAlert", "pluginAlert", "proxyAlert", "organizationAlert", "resourceAlert", "teamAlert") and either delete those whole variable objects from the "variables" array or mark them with a clear "hidden/unused" annotation, and while editing also trim leading/trailing whitespace/newline characters from the "value" and "label" strings (e.g., entries containing leading spaces or "\n") to avoid accidental mismatches.
716-781: Hardcoded environment-specific defaults.The visible variables have hardcoded defaults that appear environment-specific:
- Line 724:
"defaultValue": "sci-demo"for organization- Line 770:
"defaultValue": "greenhouse-playground"for clusterConsider using
"$__all"or a more generic default, or document that these are intended for a specific environment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/greenhouse/perses-dashboards/organization.json` around lines 716 - 781, The organization and cluster ListVariable entries currently hardcode environment-specific defaults ("sci-demo" in the variable named "organization" and "greenhouse-playground" in the variable named "cluster"); change their "defaultValue" to a generic value (for example "$__all" or an empty string) or remove the hardcoded Defaults so the variables are portable across environments (update the ListVariable with "name": "organization" and the ListVariable with "name": "cluster" to use the new generic default and optionally add a brief comment in their display.description if a specific default is required in certain deployments).charts/greenhouse/perses-dashboards/alerts.json (2)
391-631: Same whitespace issues and unused variables as organization.json.This file has identical issues with the static alert variable values containing leading spaces and trailing newlines (e.g., lines 414-419, 449-462, 492-501, 531-537, 566-567, 593-594, 624-625).
Additionally, all alert variables (
clusterAlert,operatorAlert, etc.) plusownedBy,pluginDefinition,clusterName, andstepare defined but not referenced in any panel query. Consider removing unused variables or documenting their intended purpose.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/greenhouse/perses-dashboards/alerts.json` around lines 391 - 631, The JSON contains many StaticListVariable entries with leading spaces and trailing newlines in their "value" and "defaultValue" strings (fix values for clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert, teamAlert) — trim all whitespace/newlines from each listed value and its defaultValue entries in the variables' plugin.spec.values and spec.defaultValue arrays; also remove or wire up unused variables (clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert, resourceAlert, teamAlert, ownedBy, pluginDefinition, clusterName, step) by either deleting them from the "variables" array or referencing them in panel queries, and ensure the variable names match exactly where they are used so queries pick the cleaned values.
309-323: Potential issue with summing binary ready metric.The query
sum by (clusterName, namespace) (greenhouse_cluster_ready)sums a binary (0/1) ready status metric. If multiple pods/instances report this metric for the same cluster/namespace, the sum could exceed 1, causing the cell settings (which check for exact values "1" or "0") to show no color.Consider using
maxormininstead ofsumdepending on desired semantics (any ready vs all ready).Proposed alternatives
-"query": "sum by (clusterName, namespace) (greenhouse_cluster_ready)", +"query": "max by (clusterName, namespace) (greenhouse_cluster_ready)",Or if you want "all instances ready":
-"query": "sum by (clusterName, namespace) (greenhouse_cluster_ready)", +"query": "min by (clusterName, namespace) (greenhouse_cluster_ready)",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/greenhouse/perses-dashboards/alerts.json` around lines 309 - 323, The Prometheus query currently uses sum by (clusterName, namespace) (greenhouse_cluster_ready) which can exceed 1 for multiple reporters; change the aggregation to the correct semantics—e.g., use max by (clusterName, namespace) (greenhouse_cluster_ready) if you want "any instance ready" or min by (clusterName, namespace) (greenhouse_cluster_ready) if you want "all instances ready"—by replacing the query string in the PrometheusTimeSeriesQuery spec (the "query" field) accordingly so the cell color checks for exact "1"/"0" work as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/greenhouse/perses-dashboards/alerts.json`:
- Around line 25-45: The regexes in the "cellSettings" conditions are incorrect
because the trailing '+' allows repeated groups and single-digit minors fall
through; update the two condition.spec.expr patterns to use explicit,
non-repeating ranges and anchors: replace "v1\\.((1|2)[0-9])+\\..*" with
"^v1\\.(1[0-9]|2[0-9])\\..*" and replace "v1\\.([3-9][0-9])+\\..*" with
"^v1\\.([3-9][0-9])\\..*"; if you need to cover single-digit minors (e.g.,
v1.9.x) add a separate condition.expr like "^v1\\.[0-9]\\..*" and assign the
intended color so no versions fall through.
In `@charts/greenhouse/perses-dashboards/organization.json`:
- Around line 356-582: The static list variable entries contain leading spaces
and trailing newlines causing mismatches; edit the ListVariable blocks (names:
"clusterAlert", "operatorAlert", "pluginAlert", "proxyAlert",
"organizationAlert", "resourceAlert", "teamAlert") and trim whitespace/newlines
from every string in both spec.plugin.spec.values[*].value and spec.defaultValue
arrays so values are exact alert names (e.g.,
"GreenhouseClusterKubernetesVersionOutOfMaintenance",
"GreenhouseWebhookErrorsHigh", "GreenhouseResourceOwnedByLabelMissing", etc.);
ensure no entries include leading spaces or "\n" characters.
In `@charts/greenhouse/perses-dashboards/plugins.json`:
- Around line 283-297: The legend format references {{owned_by}} but the
Prometheus query in the TimeSeriesQuery (sum by(plugin, cluster,
namespace)(rate(greenhouse_plugin_chart_test_runs_total{...}[15m]))) does not
include owned_by in its grouped labels; fix by adding owned_by to the
aggregation labels (change sum by(plugin, cluster, namespace) to sum by(plugin,
cluster, namespace, owned_by)) so the greenhouse_plugin_chart_test_runs_total
query emits an owned_by label for seriesNameFormat to render correctly, or
alternatively remove {{owned_by}} from the seriesNameFormat if you prefer not to
include that label.
---
Nitpick comments:
In `@charts/greenhouse/perses-dashboards/alerts.json`:
- Around line 391-631: The JSON contains many StaticListVariable entries with
leading spaces and trailing newlines in their "value" and "defaultValue" strings
(fix values for clusterAlert, operatorAlert, pluginAlert, proxyAlert,
organizationAlert, resourceAlert, teamAlert) — trim all whitespace/newlines from
each listed value and its defaultValue entries in the variables'
plugin.spec.values and spec.defaultValue arrays; also remove or wire up unused
variables (clusterAlert, operatorAlert, pluginAlert, proxyAlert,
organizationAlert, resourceAlert, teamAlert, ownedBy, pluginDefinition,
clusterName, step) by either deleting them from the "variables" array or
referencing them in panel queries, and ensure the variable names match exactly
where they are used so queries pick the cleaned values.
- Around line 309-323: The Prometheus query currently uses sum by (clusterName,
namespace) (greenhouse_cluster_ready) which can exceed 1 for multiple reporters;
change the aggregation to the correct semantics—e.g., use max by (clusterName,
namespace) (greenhouse_cluster_ready) if you want "any instance ready" or min by
(clusterName, namespace) (greenhouse_cluster_ready) if you want "all instances
ready"—by replacing the query string in the PrometheusTimeSeriesQuery spec (the
"query" field) accordingly so the cell color checks for exact "1"/"0" work as
intended.
In `@charts/greenhouse/perses-dashboards/organization.json`:
- Around line 348-588: Remove or document the unused ListVariable definitions
(clusterAlert, operatorAlert, pluginAlert, proxyAlert, organizationAlert,
resourceAlert, teamAlert and the other unused names ownedBy, pluginDefinition,
clusterName, step) from the dashboard JSON if they are not referenced by any
panel queries; alternatively, if they must remain as placeholders, add a comment
or metadata flag indicating they are intentionally unused. Locate the variables
by their "name" fields (e.g., "clusterAlert", "operatorAlert", "pluginAlert",
"proxyAlert", "organizationAlert", "resourceAlert", "teamAlert") and either
delete those whole variable objects from the "variables" array or mark them with
a clear "hidden/unused" annotation, and while editing also trim leading/trailing
whitespace/newline characters from the "value" and "label" strings (e.g.,
entries containing leading spaces or "\n") to avoid accidental mismatches.
- Around line 716-781: The organization and cluster ListVariable entries
currently hardcode environment-specific defaults ("sci-demo" in the variable
named "organization" and "greenhouse-playground" in the variable named
"cluster"); change their "defaultValue" to a generic value (for example "$__all"
or an empty string) or remove the hardcoded Defaults so the variables are
portable across environments (update the ListVariable with "name":
"organization" and the ListVariable with "name": "cluster" to use the new
generic default and optionally add a brief comment in their display.description
if a specific default is required in certain deployments).
In `@charts/greenhouse/perses-dashboards/plugins.json`:
- Around line 540-544: The dashboard exposes hardcoded QA cluster names in the
clusterName variable's defaultValue array in plugins.json; replace that
hardcoded list with the Grafana all selector token (use "$__all" as the sole
defaultValue) so the clusterName variable defaults to all clusters and the
dashboard is portable across environments. Target the clusterName variable's
"defaultValue" entry in plugins.json and update it to a single-element array
containing "$__all".
- Around line 217-266: The panel object with id
"70671561f36d4ce8a6085f2230515e63" (kind "Panel", plugin "StatChart") currently
uses the display.name "Installed plugins", duplicating another panel title;
update its display.name to a distinct, descriptive label (e.g., "Installed
Plugins by Cluster" or "Plugin Readiness Overview") to avoid confusion while
leaving the rest of the panel spec (queries, thresholds, seriesNameFormat)
unchanged.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
charts/greenhouse/perses-dashboards/alerts.jsoncharts/greenhouse/perses-dashboards/organization.jsoncharts/greenhouse/perses-dashboards/plugins.json











This adds two dashboards: Alerts and Plugin. The main design drivers for both of these dashboards has been to (i) use the greenhouse_* metrics, (ii) be a 1-1 mapping of the Greenhouse Alerts, and (iii) give context to the alerts more than simply a number.
The way it has been designed, to not make it completely unusable due to the number of params (e.g. clusterNames, nodes, jobs) is to add UX elements that allows for filtering some params.
Signed-off-by: Simon Olander simon.olander@sap.com
Summary by CodeRabbit