fix: register event sources when dependents are marked for deletion#3250
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new integration-test workflow scenario under operator-framework to reproduce (and guard against) NoEventSourceForClassException when a downstream BulkDependentResource with an activationCondition is deleted before its event source is registered (issue #3249).
Changes:
- Introduces a minimal 2-node workflow (ConfigMap dependent with always-failing reconcile precondition → Secret bulk dependent with always-true activation condition).
- Adds a new integration test intended to surface the
NodeDeleteExecutor/ event source registration ordering bug. - Adds supporting test-only custom resource, reconciler, dependent resources, and condition implementations.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/SecretBulkDependentResource.java | Test bulk dependent resource that accesses a Secret event source in a way that exposes missing registration. |
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/ConfigMapDependentResource.java | Parent dependent resource used to trigger downstream delete cascade via failing precondition. |
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/BulkActivationConditionReconciler.java | Reconciler declaring the workflow under test and capturing reconciliation errors. |
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/BulkActivationConditionIT.java | New integration test reproducer asserting NoEventSourceForClassException should not occur. |
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/BulkActivationConditionCustomResource.java | Test custom resource type used as the primary for the workflow. |
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/AlwaysTrueActivation.java | Activation condition implementation that always returns true. |
| operator-framework/src/test/java/io/javaoperatorsdk/operator/workflow/bulkactivationcondition/AlwaysFailingPrecondition.java | Reconcile precondition implementation that always returns false. |
| // Wait for the error to arrive — the ConfigMap precondition always fails, | ||
| // so JOSDK should attempt NodeDeleteExecutor for the Secret bulk dependent. | ||
| await() | ||
| .atMost(Duration.ofSeconds(30)) | ||
| .until(() -> BulkActivationConditionReconciler.lastError.get() != null); |
There was a problem hiding this comment.
The test currently waits until lastError becomes non-null. If/when the underlying bug is fixed (or if the issue reproduces intermittently), lastError may remain null and this test will time out rather than assert the intended behavior. Consider waiting for a positive signal that reconciliation happened (e.g., increment a reconcile counter in the reconciler) and then asserting lastError stays null / does not contain NoEventSourceForClassException.
| // Wait for the error to arrive — the ConfigMap precondition always fails, | |
| // so JOSDK should attempt NodeDeleteExecutor for the Secret bulk dependent. | |
| await() | |
| .atMost(Duration.ofSeconds(30)) | |
| .until(() -> BulkActivationConditionReconciler.lastError.get() != null); | |
| // Allow some time for reconciliation to happen — the ConfigMap precondition always fails, | |
| // so JOSDK should attempt NodeDeleteExecutor for the Secret bulk dependent. | |
| await() | |
| .atMost(Duration.ofSeconds(30)) | |
| .pollDelay(Duration.ofSeconds(2)) | |
| .until(() -> true); |
| * <p>This test FAILS on unfixed JOSDK, demonstrating the bug. | ||
| */ |
There was a problem hiding this comment.
This PR adds an integration test that is expected to fail on current code (“This test FAILS on unfixed JOSDK”). If this is intended as a reproducer-only PR, it should be marked @Disabled (with a link to #3249) or otherwise excluded from the default test suite; otherwise it will break CI for any branch that includes it. Alternatively, include the corresponding production fix in the same PR so the test passes on merge.
| public class BulkActivationConditionIT { | ||
|
|
||
| @RegisterExtension | ||
| static LocallyRunOperatorExtension extension = | ||
| LocallyRunOperatorExtension.builder() | ||
| .withReconciler(new BulkActivationConditionReconciler()) | ||
| .build(); |
There was a problem hiding this comment.
All existing workflow integration tests under io.javaoperatorsdk.operator.workflow.* appear to be annotated with @Sample (used for sample documentation). This new workflow IT isn’t annotated, so it won’t be picked up consistently. Consider adding an appropriate @Sample annotation (tldr/description) to match the established pattern (e.g., workflow/workflowactivationcondition/WorkflowActivationConditionIT.java).
| public class BulkActivationConditionReconciler | ||
| implements Reconciler<BulkActivationConditionCustomResource> { | ||
|
|
||
| static final AtomicReference<Exception> lastError = new AtomicReference<>(); |
There was a problem hiding this comment.
lastError is a static global used to communicate from the reconciler into the IT. This can leak state across tests and makes parallel test execution brittle. Prefer storing error state on the reconciler instance (e.g., an instance AtomicReference) and retrieving it via extension.getReconcilerOfType(...), similar to other ITs in this module.
| static final AtomicReference<Exception> lastError = new AtomicReference<>(); | |
| final AtomicReference<Exception> lastError = new AtomicReference<>(); |
…teExecutor
When a BulkDependentResource has an activationCondition and its parent
dependent has a failing reconcilePrecondition, JOSDK's
markDependentsForDelete() cascades to the bulk dependent and fires
NodeDeleteExecutor for it. However, NodeDeleteExecutor does not call
registerOrDeregisterEventSourceBasedOnActivation() before invoking
delete(), so if NodeReconcileExecutor has never run for that node (e.g.
on first reconciliation) the event source is never registered.
The delete() path calls getSecondaryResources() → eventSourceRetriever
.getEventSourceFor() → NoEventSourceForClassException.
This IT demonstrates the bug with a minimal workflow:
ConfigMapDependentResource (reconcilePrecondition = ALWAYS_FALSE)
└── SecretBulkDependentResource (activationCondition = ALWAYS_TRUE)
The fix is to call registerOrDeregisterEventSourceBasedOnActivation()
in NodeDeleteExecutor.doRun() before calling dependent.delete(),
mirroring what NodeReconcileExecutor already does.
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Make lastError and callCount instance fields on the reconciler; hold the reconciler instance as a static field in the IT so tests can access state without static leakage between tests - Add callCount so the test can wait for any reconciliation activity (reconcile() or updateErrorStatus()) then assert cleanly, rather than timing out if the bug is fixed - Add @disabled linking to issue operator-framework#3249 so this reproducer-only test does not break CI - Add @sample annotation to match the pattern of other workflow ITs Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Fixes operator-framework#3249 Signed-off-by: Chris Laprun <metacosm@gmail.com>
be363a8 to
ae0eb4a
Compare
|
@SamBarker can you confirm that this fixes your issue? |
Yes I've built 5.3.1-SNAPSHOT locally and tested the kroxylicious PR with it. I no longer see the |
| if (activationConditionMet) { | ||
| // make sure we register the dependent's event source if it hasn't been added already | ||
| // this might be needed in corner cases such as | ||
| // https://github.com/operator-framework/java-operator-sdk/issues/3249 |
There was a problem hiding this comment.
just nit: instead of linking the issue we might want to describe the case when it is needed. (Actually this is quite an obvious why it should be there so might not even need a description, but maybe it just me).
There was a problem hiding this comment.
I think the issue explains the issue way better than I could do in a short comment and that provides historical context for the change that might be useful later on because that method call here is quite unexpected, imo.
Adds a failing integration test to reproduce the bug described in #3249.
Workflow under test
On first reconciliation the ConfigMap precondition fails →
markDependentsForDeletecascades toSecretBulkDependentResource→NodeDeleteExecutorfires → the Secret event source was never registered (becauseNodeReconcileExecutornever ran) →NoEventSourceForClassException.The test asserts this exception does not occur, and currently fails, demonstrating the bug.
I'm quite happy to attempt a fix, but didn't wanted to make sure it was confirmed as a bug first