fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel by delthas · Pull Request #2360 · scality/Zenko

delthas · 2026-03-20T11:12:30Z

Summary

Fixes intermittent CI failures in ctst-end2end-sharded caused by parallel cucumber workers interfering with each other when one worker runs scenarios that mutate cluster-wide state.

Problem

The ctst-end2end-sharded job runs cucumber with 4 parallel workers (--parallel $PARALLEL_RUNS). Some test scenarios create or modify Zenko locations via the management API, which triggers operator reconciliation and rolling restarts of backbeat components (replication data processor, notification processors, sorbet, etc.).

When these cluster-mutating scenarios run in one worker, the other 3 workers' tests are affected — their backbeat pods get killed and recreated mid-flight, causing replication timeouts, kafka cleaner failures, and azure archive restore retry timeouts.

Observed failure (run #8809)

8 out of 4418 scenarios failed:

6 replication scenarios (s3utils + location stripping) — objects stuck in pending/processing state
1 azure archive restore retry — timeout waiting for restored state
1 kafka cleaner — topics not cleaned in time

Root cause timeline

11:08 — Azure archive CRUD test starts on worker pid:62, creates location e2e-azure-archive-2-non-versioned via POST /config/{id}/location
11:08–11:28 — Operator reconciles, triggering 23+ rolling update events for backbeat-replication-data-processor across 6 different ReplicaSets. The data processor is killed and recreated 15 times.
11:22–11:28 — A replication-data-processor pod fails to mount backbeat-config secret (v21 doesn't exist yet), is killed. Processor is completely down for 6 minutes.
11:28:39 — Final processor pod created, becomes ready at ~11:29
11:29:10 — Replication tests start on workers pid:48 and pid:54 — seconds after the processor came up. The freshly-started processor hasn't re-joined Kafka consumer groups yet.
11:34–11:46 — All 6 replication scenarios timeout (300s) because the processor can't keep up.

The CRUD scenario creates 3 locations + modifies 3 locations = 6 reconciliation rounds, each triggering a full rolling restart of all backbeat deployments. The waitForZenkoToStabilize() call in the CRUD test only blocks that specific worker — the other 3 workers are unaware that pods are being churned.

Solution

Add an @Exclusive tag mechanism to cucumber's setParallelCanAssign that gives tagged scenarios exclusive access to all workers:

When an @Exclusive scenario is running, no other scenario can start on any worker
An @Exclusive scenario only starts when all other running scenarios have finished
The existing atMostOnePicklePerTag logic for @ColdStorage, @PRA, etc. is preserved as a fallback

This is safe from races because the coordinator runs in a single Node.js process — setParallelCanAssign is called synchronously from the event loop when deciding work placement. Cucumber also has a built-in deadlock safety valve: if all workers go idle but pickles remain, it force-assigns the first one.

Scenarios tagged with `@Exclusive`

Scenario	Feature	Mutation
Create, read, update and delete azure archive location	azureArchive.feature	Creates 3 locations + modifies them → 6 reconciliation rounds
Pause and resume archiving to azure (PutObject after pause)	azureArchive.feature	Pauses/resumes lifecycle for a location
Bucket Website CRUD	bucketWebsite.feature	Adds endpoint to overlay (no stabilization wait)
PRA (nominal case)	pra.feature	Installs/uninstalls entire DR site

Alternatives considered

Move location creation to configure-e2e-ctst.sh — Would eliminate the problem for azure archive CRUD but doesn't generalize to other cluster-mutating scenarios (PRA, website). Would also require significant refactoring of the CRUD test itself.
Tag-based ordering (run mutating tests in a separate phase) — Cucumber doesn't natively support phased execution. Would require splitting into multiple cucumber-js invocations, losing the single-report output.
Reduce parallelism globally — Would slow down all tests, not just the problematic ones.

The @Exclusive approach is the most targeted: it only serializes the specific scenarios that cause cluster-wide churn, while allowing all other tests to run in parallel as before.

Estimated performance impact

Based on a successful run (attempt 4 of #8809):

Scenario	Duration	Extra wall-clock if exclusive
Azure CRUD (3 examples + 1 retry)	~17 min	~13 min
Pause/Resume (3 examples)	~2 min	~1.5 min
Bucket Website CRUD	~1s	~0s
PRA	N/A (excluded by `not @PRA`)	0

Current pipeline: ~82 min → estimated with @Exclusive: ~96 min (+17%)

Without retries, the cost drops to ~10%. This is a worthwhile tradeoff for eliminating a major source of CI flakiness that currently requires re-running the entire job (adding 82+ min per retry).

bert-e · 2026-03-20T11:12:35Z

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-20T11:12:43Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

…parallel Issue: ZENKO-5228

SylvainSenechal · 2026-03-20T16:59:09Z

tests/ctst/features/azureArchive.feature

    @Flaky
    @AzureArchive
+    @Exclusive
    Scenario Outline: Pause and resume archiving to azure (PutObject after pause)


What step is problematic on this test ? I feel like the only problematic one here is line 49 "create azure archive location"

no location is created here, should not be a problem. Unless the issue is in the "pre" hook.

SylvainSenechal

General thoughts :

Waiting for you to rerun 3/4 times to analyze real impact on both timing and flakiness
Still wish we could modify zenko operator to not reconcile every backbeat pod when they are not concerned at all by the change of the configuration
One trick we may consider here : While we can't control that much the order of execution for the tests, I believe cucumber runs the tests from top to bottom of the file, and maybe also run the files from a to z (that's why the kafka cleaner scenario has its file called zzz.xxx 🌚 ). So here, it might be interesting to put all the problematic tests together at the end of the file instead of having them in the middle

Other things :

If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests
William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours

Edit :
After spending time on this CI, I have also come to the same conclusion that flakiness comes from those pods being killed and recreated a lot of times while the tests are running, so this is defintly one of the big thing we need to fix.

francoisferrand

I got really mixed feeling on this one: we are just putting a bandaid, and not addressing the actual underlying issue (the system should remain stable even when creating locations!!!), just masking the problem and increasing build time...

maybe this is acceptable as a temporary measure or to facilitate investigations, but only in that context: so work must not stop there

System MUST remain stable and functional even through location creations/etc...
Either there is a bug in these tests (not waiting on the right events or the right way), or an issue in the software : but it must be fixed
Introducing exclusivity is the exact opposite of the good practices for tests (test must be idempotent, work in parallel, etc...)

francoisferrand · 2026-03-21T12:09:33Z

tests/ctst/features/azureArchive.feature

    @Flaky
    @AzureArchive
+    @Exclusive
    Scenario Outline: Pause and resume archiving to azure (PutObject after pause)


no location is created here, should not be a problem. Unless the issue is in the "pre" hook.

francoisferrand · 2026-03-21T12:18:07Z

tests/ctst/common/hooks.ts

+    return pickle.tags.some(t => t.name === tagName);
+}
+
+setParallelCanAssign((pickle, runningPickles) => {


when/how is this function used (i.e. what is a pickle) : it is called at the beginning of each scenario, or before each "step" ?

If we want to manage "exclusivity", would be better to handle the "step" level, to minimize the degradation...

add @exclusive tag to prevent cluster-mutating tests from running in …

ffb4756

…parallel Issue: ZENKO-5228

delthas force-pushed the improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag branch from 57627e3 to ffb4756 Compare March 20, 2026 11:15

delthas requested review from SylvainSenechal and francoisferrand March 20, 2026 11:20

SylvainSenechal reviewed Mar 20, 2026

View reviewed changes

francoisferrand reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360
delthas wants to merge 1 commit intodevelopment/2.14from
improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag

delthas commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

SylvainSenechal Mar 20, 2026

Uh oh!

francoisferrand Mar 21, 2026

Uh oh!

SylvainSenechal left a comment •

edited

Loading

Uh oh!

francoisferrand left a comment

Uh oh!

francoisferrand Mar 21, 2026

Uh oh!

francoisferrand Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

delthas commented Mar 20, 2026

Summary

Problem

Observed failure (run #8809)

Root cause timeline

Solution

Scenarios tagged with @Exclusive

Alternatives considered

Estimated performance impact

Uh oh!

bert-e commented Mar 20, 2026

Hello delthas,

Uh oh!

bert-e commented Mar 20, 2026

Waiting for approval

Uh oh!

SylvainSenechal Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

francoisferrand Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

SylvainSenechal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

francoisferrand left a comment

Choose a reason for hiding this comment

Uh oh!

francoisferrand Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

francoisferrand Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Scenarios tagged with `@Exclusive`

SylvainSenechal left a comment •

edited

Loading