fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
57627e3 to
ffb4756
Compare
| @Flaky | ||
| @AzureArchive | ||
| @Exclusive | ||
| Scenario Outline: Pause and resume archiving to azure (PutObject after pause) |
There was a problem hiding this comment.
What step is problematic on this test ? I feel like the only problematic one here is line 49 "create azure archive location"
There was a problem hiding this comment.
no location is created here, should not be a problem. Unless the issue is in the "pre" hook.
There was a problem hiding this comment.
General thoughts :
- Waiting for you to rerun 3/4 times to analyze real impact on both timing and flakiness
- Still wish we could modify zenko operator to not reconcile every backbeat pod when they are not concerned at all by the change of the configuration
- One trick we may consider here : While we can't control that much the order of execution for the tests, I believe cucumber runs the tests from top to bottom of the file, and maybe also run the files from a to z (that's why the kafka cleaner scenario has its file called zzz.xxx 🌚 ). So here, it might be interesting to put all the problematic tests together at the end of the file instead of having them in the middle
Other things :
- If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests
- William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours
Edit :
After spending time on this CI, I have also come to the same conclusion that flakiness comes from those pods being killed and recreated a lot of times while the tests are running, so this is defintly one of the big thing we need to fix.
francoisferrand
left a comment
There was a problem hiding this comment.
I got really mixed feeling on this one: we are just putting a bandaid, and not addressing the actual underlying issue (the system should remain stable even when creating locations!!!), just masking the problem and increasing build time...
maybe this is acceptable as a temporary measure or to facilitate investigations, but only in that context: so work must not stop there
- System MUST remain stable and functional even through location creations/etc...
- Either there is a bug in these tests (not waiting on the right events or the right way), or an issue in the software : but it must be fixed
- Introducing exclusivity is the exact opposite of the good practices for tests (test must be idempotent, work in parallel, etc...)
| @Flaky | ||
| @AzureArchive | ||
| @Exclusive | ||
| Scenario Outline: Pause and resume archiving to azure (PutObject after pause) |
There was a problem hiding this comment.
no location is created here, should not be a problem. Unless the issue is in the "pre" hook.
| return pickle.tags.some(t => t.name === tagName); | ||
| } | ||
|
|
||
| setParallelCanAssign((pickle, runningPickles) => { |
There was a problem hiding this comment.
when/how is this function used (i.e. what is a pickle) : it is called at the beginning of each scenario, or before each "step" ?
If we want to manage "exclusivity", would be better to handle the "step" level, to minimize the degradation...
Summary
Fixes intermittent CI failures in
ctst-end2end-shardedcaused by parallel cucumber workers interfering with each other when one worker runs scenarios that mutate cluster-wide state.Problem
The
ctst-end2end-shardedjob runs cucumber with 4 parallel workers (--parallel $PARALLEL_RUNS). Some test scenarios create or modify Zenko locations via the management API, which triggers operator reconciliation and rolling restarts of backbeat components (replication data processor, notification processors, sorbet, etc.).When these cluster-mutating scenarios run in one worker, the other 3 workers' tests are affected — their backbeat pods get killed and recreated mid-flight, causing replication timeouts, kafka cleaner failures, and azure archive restore retry timeouts.
Observed failure (run #8809)
8 out of 4418 scenarios failed:
Root cause timeline
e2e-azure-archive-2-non-versionedviaPOST /config/{id}/locationbackbeat-replication-data-processoracross 6 different ReplicaSets. The data processor is killed and recreated 15 times.backbeat-configsecret (v21 doesn't exist yet), is killed. Processor is completely down for 6 minutes.The CRUD scenario creates 3 locations + modifies 3 locations = 6 reconciliation rounds, each triggering a full rolling restart of all backbeat deployments. The
waitForZenkoToStabilize()call in the CRUD test only blocks that specific worker — the other 3 workers are unaware that pods are being churned.Solution
Add an
@Exclusivetag mechanism to cucumber'ssetParallelCanAssignthat gives tagged scenarios exclusive access to all workers:@Exclusivescenario is running, no other scenario can start on any worker@Exclusivescenario only starts when all other running scenarios have finishedatMostOnePicklePerTaglogic for@ColdStorage,@PRA, etc. is preserved as a fallbackThis is safe from races because the coordinator runs in a single Node.js process —
setParallelCanAssignis called synchronously from the event loop when deciding work placement. Cucumber also has a built-in deadlock safety valve: if all workers go idle but pickles remain, it force-assigns the first one.Scenarios tagged with
@ExclusiveAlternatives considered
Move location creation to
configure-e2e-ctst.sh— Would eliminate the problem for azure archive CRUD but doesn't generalize to other cluster-mutating scenarios (PRA, website). Would also require significant refactoring of the CRUD test itself.Tag-based ordering (run mutating tests in a separate phase) — Cucumber doesn't natively support phased execution. Would require splitting into multiple
cucumber-jsinvocations, losing the single-report output.Reduce parallelism globally — Would slow down all tests, not just the problematic ones.
The
@Exclusiveapproach is the most targeted: it only serializes the specific scenarios that cause cluster-wide churn, while allowing all other tests to run in parallel as before.Estimated performance impact
Based on a successful run (attempt 4 of #8809):
not @PRA)Current pipeline: ~82 min → estimated with
@Exclusive: ~96 min (+17%)Without retries, the cost drops to ~10%. This is a worthwhile tradeoff for eliminating a major source of CI flakiness that currently requires re-running the entire job (adding 82+ min per retry).