fix(ci): sequence mongodb sharded deploy to prevent mongos hang by delthas · Pull Request #2361 · scality/Zenko

delthas · 2026-03-20T11:24:35Z

Summary

Fixes a ~20% flaky failure rate on ctst-end2end-sharded caused by a race condition
in the Bitnami mongodb-sharded entrypoint during mongos startup.

Problem

When kubectl apply deploys all MongoDB sharded StatefulSets at once (configsvr,
mongos, shards), mongos sometimes starts before configsvr's replica set is fully
initialized. The Bitnami mongos entrypoint does:

wait-for-port on configsvr:27017 — succeeds as soon as the port is open
Prints "Found MongoDB server listening at configsvr:27017 !"
Runs mongosh --host configsvr -u root -p <pw> admin with db.getUsers() to
verify the node is available

Step 3 is the problem: configsvr's port may be open while the replica set is still
initializing (primary election, auth user creation). The mongosh call has no
timeout in the entrypoint (mongodb_execute_print_output in libmongodb-sharded.sh),
so it blocks forever waiting for a usable authenticated session.

Meanwhile, shard0-data-0 completes its own replica set init, stops mongod, and tries
to connect to mongos to register itself as a shard. Since mongos never started, shard0
loops on "timeout reached before the port went into state inuse". The liveness probe
(pgrep mongod, initialDelaySeconds: 60, failureThreshold: 2) kills shard0 every
~2 minutes because mongod was stopped for reconfiguration. The 5-minute rollout
timeout on mongos expires, and the job fails.

This was confirmed across 3 consecutive CI attempts (run 23301276674, attempts 2-4),
all showing identical behavior: configsvr healthy, mongos hung after configsvr
connection, shard0 crash-looping from liveness kills.

Why it's flaky (not deterministic): the race window is narrow. 80% of the time,
configsvr completes replica set init before mongos reaches the mongosh auth check.
20% of the time, mongos wins the race and hangs.

Solution

Sequence the deploy so mongos only starts after configsvr is fully ready:

kubectl apply the full manifest (all StatefulSets created)
Immediately scale mongos to 0 replicas (configsvr and shards start normally)
Wait for configsvr rollout (readiness probe = mongosh auth succeeds = replica set
fully initialized)
Scale mongos back to its original replica count
Wait for mongos and shard rollouts

This ensures that when mongos starts, configsvr is guaranteed to be fully initialized,
so the mongosh auth check succeeds immediately.

Alternatives considered

Enable startupProbe on mongos and shard data nodes: The Bitnami chart ships with
startupProbe.enabled: false for mongos and shards (only configsvr has it enabled).
Enabling it would prevent the liveness probe from killing shard0 during init, giving
more time for the deadlock to self-resolve. However, this only treats the symptom — it
gives more time but doesn't prevent the mongosh hang on mongos. If configsvr is slow
enough, mongos would still hang indefinitely. We may still want to enable startupProbes
as a defense-in-depth measure separately.

Wrap mongosh with timeout in the entrypoint: Would fix the root cause (the
missing timeout), but requires patching the Bitnami container image or injecting a
custom entrypoint script. Higher maintenance burden for a vendored chart.

Increase MONGODB_INIT_RETRY_ATTEMPTS / MONGODB_INIT_RETRY_DELAY: These control
the retry_while wrapper around the mongosh call. However, since mongosh itself
blocks indefinitely (the first attempt never returns), retry_while never gets a
chance to retry. These settings would have no effect on the hang.

Split the manifest and apply resources separately: Would also work, but requires
yq to filter multi-document YAML by resource kind. The scale-to-zero approach is
simpler — it uses only kubectl and doesn't require parsing the manifest.

Delay mongos startup until configsvr is fully ready to avoid a race condition in the Bitnami mongodb-sharded entrypoint that causes ~20% CI failure rate on ctst-end2end-sharded. Issue: ZENKO-5229

bert-e · 2026-03-20T11:24:39Z

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-20T11:24:47Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

.github/scripts/end2end/install-kind-dependencies.sh

delthas · 2026-03-20T15:07:44Z

/approve

bert-e · 2026-03-20T15:56:18Z

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

✔️ development/2.14

The following branches will NOT be impacted:

development/2.10
development/2.11
development/2.12
development/2.13
development/2.5
development/2.6
development/2.7
development/2.8
development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

Any commit you add on the source branch will trigger a new cycle after the
current queue is merged.
Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

bert-e · 2026-03-21T17:21:23Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/2.14

The following branches have NOT changed:

development/2.10
development/2.11
development/2.12
development/2.13
development/2.5
development/2.6
development/2.7
development/2.8
development/2.9

Please check the status of the associated issue ZENKO-5229.

Goodbye delthas.

fix(ci): sequence mongodb sharded deploy to prevent mongos hang

f4d1c0f

Delay mongos startup until configsvr is fully ready to avoid a race condition in the Bitnami mongodb-sharded entrypoint that causes ~20% CI failure rate on ctst-end2end-sharded. Issue: ZENKO-5229

delthas requested a review from SylvainSenechal March 20, 2026 11:25

SylvainSenechal reviewed Mar 20, 2026

View reviewed changes

.github/scripts/end2end/install-kind-dependencies.sh Show resolved Hide resolved

delthas requested review from a team, SylvainSenechal and maeldonn March 20, 2026 13:24

SylvainSenechal approved these changes Mar 20, 2026

View reviewed changes

maeldonn approved these changes Mar 20, 2026

View reviewed changes

DarkIsDude approved these changes Mar 20, 2026

View reviewed changes

bert-e merged commit 5a53451 into development/2.14 Mar 21, 2026
26 of 28 checks passed

bert-e deleted the bugfix/ZENKO-5229/fix-mongos-startup-race branch March 21, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): sequence mongodb sharded deploy to prevent mongos hang#2361

fix(ci): sequence mongodb sharded deploy to prevent mongos hang#2361
bert-e merged 1 commit intodevelopment/2.14from
bugfix/ZENKO-5229/fix-mongos-startup-race

delthas commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

Uh oh!

delthas commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

bert-e commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

delthas commented Mar 20, 2026

Summary

Problem

Solution

Alternatives considered

Uh oh!

bert-e commented Mar 20, 2026

Hello delthas,

Uh oh!

bert-e commented Mar 20, 2026

Waiting for approval

Uh oh!

Uh oh!

delthas commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

In the queue

Uh oh!

bert-e commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants