Skip to content

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360

Open
delthas wants to merge 1 commit intodevelopment/2.14from
improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag
Open

fix: add @Exclusive tag to prevent cluster-mutating tests from running in parallel#2360
delthas wants to merge 1 commit intodevelopment/2.14from
improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag

Conversation

@delthas
Copy link
Contributor

@delthas delthas commented Mar 20, 2026

Summary

Fixes intermittent CI failures in ctst-end2end-sharded caused by parallel cucumber workers interfering with each other when one worker runs scenarios that mutate cluster-wide state.

Problem

The ctst-end2end-sharded job runs cucumber with 4 parallel workers (--parallel $PARALLEL_RUNS). Some test scenarios create or modify Zenko locations via the management API, which triggers operator reconciliation and rolling restarts of backbeat components (replication data processor, notification processors, sorbet, etc.).

When these cluster-mutating scenarios run in one worker, the other 3 workers' tests are affected — their backbeat pods get killed and recreated mid-flight, causing replication timeouts, kafka cleaner failures, and azure archive restore retry timeouts.

Observed failure (run #8809)

8 out of 4418 scenarios failed:

  • 6 replication scenarios (s3utils + location stripping) — objects stuck in pending/processing state
  • 1 azure archive restore retry — timeout waiting for restored state
  • 1 kafka cleaner — topics not cleaned in time

Root cause timeline

  1. 11:08 — Azure archive CRUD test starts on worker pid:62, creates location e2e-azure-archive-2-non-versioned via POST /config/{id}/location
  2. 11:08–11:28 — Operator reconciles, triggering 23+ rolling update events for backbeat-replication-data-processor across 6 different ReplicaSets. The data processor is killed and recreated 15 times.
  3. 11:22–11:28 — A replication-data-processor pod fails to mount backbeat-config secret (v21 doesn't exist yet), is killed. Processor is completely down for 6 minutes.
  4. 11:28:39 — Final processor pod created, becomes ready at ~11:29
  5. 11:29:10 — Replication tests start on workers pid:48 and pid:54 — seconds after the processor came up. The freshly-started processor hasn't re-joined Kafka consumer groups yet.
  6. 11:34–11:46 — All 6 replication scenarios timeout (300s) because the processor can't keep up.

The CRUD scenario creates 3 locations + modifies 3 locations = 6 reconciliation rounds, each triggering a full rolling restart of all backbeat deployments. The waitForZenkoToStabilize() call in the CRUD test only blocks that specific worker — the other 3 workers are unaware that pods are being churned.

Solution

Add an @Exclusive tag mechanism to cucumber's setParallelCanAssign that gives tagged scenarios exclusive access to all workers:

  • When an @Exclusive scenario is running, no other scenario can start on any worker
  • An @Exclusive scenario only starts when all other running scenarios have finished
  • The existing atMostOnePicklePerTag logic for @ColdStorage, @PRA, etc. is preserved as a fallback

This is safe from races because the coordinator runs in a single Node.js process — setParallelCanAssign is called synchronously from the event loop when deciding work placement. Cucumber also has a built-in deadlock safety valve: if all workers go idle but pickles remain, it force-assigns the first one.

Scenarios tagged with @Exclusive

Scenario Feature Mutation
Create, read, update and delete azure archive location azureArchive.feature Creates 3 locations + modifies them → 6 reconciliation rounds
Pause and resume archiving to azure (PutObject after pause) azureArchive.feature Pauses/resumes lifecycle for a location
Bucket Website CRUD bucketWebsite.feature Adds endpoint to overlay (no stabilization wait)
PRA (nominal case) pra.feature Installs/uninstalls entire DR site

Alternatives considered

  1. Move location creation to configure-e2e-ctst.sh — Would eliminate the problem for azure archive CRUD but doesn't generalize to other cluster-mutating scenarios (PRA, website). Would also require significant refactoring of the CRUD test itself.

  2. Tag-based ordering (run mutating tests in a separate phase) — Cucumber doesn't natively support phased execution. Would require splitting into multiple cucumber-js invocations, losing the single-report output.

  3. Reduce parallelism globally — Would slow down all tests, not just the problematic ones.

The @Exclusive approach is the most targeted: it only serializes the specific scenarios that cause cluster-wide churn, while allowing all other tests to run in parallel as before.

Estimated performance impact

Based on a successful run (attempt 4 of #8809):

Scenario Duration Extra wall-clock if exclusive
Azure CRUD (3 examples + 1 retry) ~17 min ~13 min
Pause/Resume (3 examples) ~2 min ~1.5 min
Bucket Website CRUD ~1s ~0s
PRA N/A (excluded by not @PRA) 0

Current pipeline: ~82 min → estimated with @Exclusive: ~96 min (+17%)

Without retries, the cost drops to ~10%. This is a worthwhile tradeoff for eliminating a major source of CI flakiness that currently requires re-running the entire job (adding 82+ min per retry).

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@delthas delthas force-pushed the improvement/ZENKO-5228/fix-ci-parallel-exclusive-tag branch from 57627e3 to ffb4756 Compare March 20, 2026 11:15
@Flaky
@AzureArchive
@Exclusive
Scenario Outline: Pause and resume archiving to azure (PutObject after pause)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What step is problematic on this test ? I feel like the only problematic one here is line 49 "create azure archive location"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no location is created here, should not be a problem. Unless the issue is in the "pre" hook.

Copy link
Contributor

@SylvainSenechal SylvainSenechal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General thoughts :

  • Waiting for you to rerun 3/4 times to analyze real impact on both timing and flakiness
  • Still wish we could modify zenko operator to not reconcile every backbeat pod when they are not concerned at all by the change of the configuration
  • One trick we may consider here : While we can't control that much the order of execution for the tests, I believe cucumber runs the tests from top to bottom of the file, and maybe also run the files from a to z (that's why the kafka cleaner scenario has its file called zzz.xxx 🌚 ). So here, it might be interesting to put all the problematic tests together at the end of the file instead of having them in the middle

Other things :

  • If we merge this pr, need to update the "HOW_TO_WRITE_TEST.MD" : document the new Exclusive tag, and drop the rule 3 about not reconfiguring the env during tests
  • William had a different mechanism based on locking a file to do a single task at the same time for all worker, I think you can see the implementation in the Cli-testing folder : Worth to take a look and compare his solution with yours

Edit :
After spending time on this CI, I have also come to the same conclusion that flakiness comes from those pods being killed and recreated a lot of times while the tests are running, so this is defintly one of the big thing we need to fix.

Copy link
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got really mixed feeling on this one: we are just putting a bandaid, and not addressing the actual underlying issue (the system should remain stable even when creating locations!!!), just masking the problem and increasing build time...

maybe this is acceptable as a temporary measure or to facilitate investigations, but only in that context: so work must not stop there

  • System MUST remain stable and functional even through location creations/etc...
  • Either there is a bug in these tests (not waiting on the right events or the right way), or an issue in the software : but it must be fixed
  • Introducing exclusivity is the exact opposite of the good practices for tests (test must be idempotent, work in parallel, etc...)

@Flaky
@AzureArchive
@Exclusive
Scenario Outline: Pause and resume archiving to azure (PutObject after pause)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no location is created here, should not be a problem. Unless the issue is in the "pre" hook.

return pickle.tags.some(t => t.name === tagName);
}

setParallelCanAssign((pickle, runningPickles) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when/how is this function used (i.e. what is a pickle) : it is called at the beginning of each scenario, or before each "step" ?

If we want to manage "exclusivity", would be better to handle the "step" level, to minimize the degradation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants