Skip to content

CNV-80608: PR15 - k8s: add orphan AlertRelabelConfig garbage collection#15

Closed
sradco wants to merge 1 commit intoalert-mgmt-14-single-rulefrom
alert-mgmt-15-orphan-arc-gc
Closed

CNV-80608: PR15 - k8s: add orphan AlertRelabelConfig garbage collection#15
sradco wants to merge 1 commit intoalert-mgmt-14-single-rulefrom
alert-mgmt-15-orphan-arc-gc

Conversation

@sradco
Copy link
Copy Markdown
Owner

@sradco sradco commented Mar 3, 2026

When an operator or manual action removes rules from a PrometheusRule (or deletes the CR entirely),
the ARCs created by the plugin for
classification/drop/stamp become orphaned.

This adds a GC pass to sync() that detects and deletes orphaned ARCs by comparing live rule IDs against each ARC's alertRuleId annotation. GitOps-managed ARCs
are never auto-deleted; a warning is logged instead.

@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from b8e9698 to bde1f65 Compare March 5, 2026 10:44
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch 2 times, most recently from 3cf4844 to e768021 Compare March 8, 2026 16:12
@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from bde1f65 to 1eb5d54 Compare March 8, 2026 16:12
Comment thread pkg/k8s/relabeled_rules.go Outdated
return
}

arcs, err := rrm.alertRelabelConfigs.List(ctx, "")
Copy link
Copy Markdown

@machadovilaca machadovilaca Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try to filter by some label in the query instead fetching all

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
The GC workload is inherently "scan all, check each".
There's no way to pre-filter for orphans without already knowing which ones are orphans.
It's an informer cache read, not an API query.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then we are filtering the results by liveRuleIDs
are we not?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread pkg/k8s/relabeled_rules.go Outdated
}

if deleted > 0 {
log.Infof("orphan ARC GC: cleaned up %d orphaned ARCs", deleted)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not much use as we are already logging the deletion with the same level

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Removed. Thanks.

Comment thread pkg/k8s/relabeled_rules.go Outdated

log.Infof("Synced %d relabeled rules in memory", len(alerts))

rrm.gcOrphanedARCs(ctx, allRuleIDs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not belong here

we want to run when PrometheusRule is updated or deleted
that should be handled in pkg/k8s/prometheus_rule.go
where you would add event handlers for those actions

logic for the garbage collection and respective unit test should also live in a dedicated file with matching names
pkg/k8s/relabeled_rules_gc_test.go is probably not convetional

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not belong here

we want to run when PrometheusRule is updated or deleted that should be handled in pkg/k8s/prometheus_rule.go where you would add event handlers for those actions

I disagree.
There are two separate PrometheusRule informers in this codebase, and they serve different purposes:
prometheus_rule.go is a CRUD manager. It wraps the Get/List/Update/Delete/AddRule operations.
It has no workqueue, no reconciliation loop, no event handlers.
relabeled_rules.go is a controller, It has a workqueue, event handlers, and a reconciliation loop.

Running GC on secret changes is unnecessary but harmless. It's an local cache scan that's a no-op when nothing is orphaned. Running on initial-sync and the 15-minute resync is actually important.
It catches orphans left behind from before the plugin started (for example: plugin crash, missed events). A pure event-driven approach would miss those.

logic for the garbage collection and respective unit test should also live in a dedicated file with matching names pkg/k8s/relabeled_rules_gc_test.go is probably not convetional

+1. Moved gcOrphanedARCs to alert_relabel_config_gc.go and renamed the test file to alert_relabel_config_gc_test.go

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache receives the event handlers for all the prometheus rule resources on startup
so thats not an issue

and although the garbage collector performs operations on ARC resources
those operations are necessary only due to operations on related prometheus rule resources
so the handler should be somewhere related

it actually can still be in this file, because we track the relationship between those resources
but should still track prometheus rule events, given by rrm.queue.Add("prometheus-rule-sync")

@sradco sradco changed the title k8s: add orphan AlertRelabelConfig garbage collection CNV-80608: PR15 - k8s: add orphan AlertRelabelConfig garbage collection Mar 9, 2026
@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from 1eb5d54 to fdee571 Compare March 9, 2026 14:00
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch 3 times, most recently from f62057b to bad1af3 Compare March 10, 2026 10:45
@sradco
Copy link
Copy Markdown
Owner Author

sradco commented Mar 10, 2026

@machadovilaca thank you for the review. Please see my comments

@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from fdee571 to e8a3aad Compare March 10, 2026 13:46
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch from bad1af3 to 6ff63c9 Compare March 10, 2026 13:46
@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from e8a3aad to a17e0df Compare March 10, 2026 14:04
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch from 6ff63c9 to 6b2d009 Compare March 10, 2026 14:04
@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from a17e0df to 5af9134 Compare March 11, 2026 09:06
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch from 6b2d009 to 1c95614 Compare March 11, 2026 09:06
@sradco sradco force-pushed the alert-mgmt-14-single-rule branch from 5af9134 to c37ca02 Compare March 11, 2026 09:27
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch from 1c95614 to 6ace03c Compare March 11, 2026 09:27
When an operator or manual action removes rules from
a PrometheusRule (or deletes the CR entirely),
the ARCs created by the plugin for
classification/drop/stamp become orphaned.

This adds a GC pass to sync() that detects and deletes
orphaned ARCs by comparing live rule IDs against each
ARC's alertRuleId annotation. GitOps-managed ARCs
are never auto-deleted; a warning is logged instead.

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: AI Assistant <noreply@cursor.com>
@sradco sradco force-pushed the alert-mgmt-15-orphan-arc-gc branch from 6ace03c to 53c6cf9 Compare March 11, 2026 11:50
@sradco sradco closed this Mar 12, 2026
@sradco sradco deleted the alert-mgmt-15-orphan-arc-gc branch March 12, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants