Skip to content

fix(ci): sequence mongodb sharded deploy to prevent mongos hang#2361

Merged
bert-e merged 1 commit intodevelopment/2.14from
bugfix/ZENKO-5229/fix-mongos-startup-race
Mar 21, 2026
Merged

fix(ci): sequence mongodb sharded deploy to prevent mongos hang#2361
bert-e merged 1 commit intodevelopment/2.14from
bugfix/ZENKO-5229/fix-mongos-startup-race

Conversation

@delthas
Copy link
Contributor

@delthas delthas commented Mar 20, 2026

Summary

Fixes a ~20% flaky failure rate on ctst-end2end-sharded caused by a race condition
in the Bitnami mongodb-sharded entrypoint during mongos startup.

Problem

When kubectl apply deploys all MongoDB sharded StatefulSets at once (configsvr,
mongos, shards), mongos sometimes starts before configsvr's replica set is fully
initialized. The Bitnami mongos entrypoint does:

  1. wait-for-port on configsvr:27017 — succeeds as soon as the port is open
  2. Prints "Found MongoDB server listening at configsvr:27017 !"
  3. Runs mongosh --host configsvr -u root -p <pw> admin with db.getUsers() to
    verify the node is available

Step 3 is the problem: configsvr's port may be open while the replica set is still
initializing (primary election, auth user creation). The mongosh call has no
timeout
in the entrypoint (mongodb_execute_print_output in libmongodb-sharded.sh),
so it blocks forever waiting for a usable authenticated session.

Meanwhile, shard0-data-0 completes its own replica set init, stops mongod, and tries
to connect to mongos to register itself as a shard. Since mongos never started, shard0
loops on "timeout reached before the port went into state inuse". The liveness probe
(pgrep mongod, initialDelaySeconds: 60, failureThreshold: 2) kills shard0 every
~2 minutes because mongod was stopped for reconfiguration. The 5-minute rollout
timeout on mongos expires, and the job fails.

This was confirmed across 3 consecutive CI attempts (run 23301276674, attempts 2-4),
all showing identical behavior: configsvr healthy, mongos hung after configsvr
connection, shard0 crash-looping from liveness kills.

Why it's flaky (not deterministic): the race window is narrow. 80% of the time,
configsvr completes replica set init before mongos reaches the mongosh auth check.
20% of the time, mongos wins the race and hangs.

Solution

Sequence the deploy so mongos only starts after configsvr is fully ready:

  1. kubectl apply the full manifest (all StatefulSets created)
  2. Immediately scale mongos to 0 replicas (configsvr and shards start normally)
  3. Wait for configsvr rollout (readiness probe = mongosh auth succeeds = replica set
    fully initialized)
  4. Scale mongos back to its original replica count
  5. Wait for mongos and shard rollouts

This ensures that when mongos starts, configsvr is guaranteed to be fully initialized,
so the mongosh auth check succeeds immediately.

Alternatives considered

Enable startupProbe on mongos and shard data nodes: The Bitnami chart ships with
startupProbe.enabled: false for mongos and shards (only configsvr has it enabled).
Enabling it would prevent the liveness probe from killing shard0 during init, giving
more time for the deadlock to self-resolve. However, this only treats the symptom — it
gives more time but doesn't prevent the mongosh hang on mongos. If configsvr is slow
enough, mongos would still hang indefinitely. We may still want to enable startupProbes
as a defense-in-depth measure separately.

Wrap mongosh with timeout in the entrypoint: Would fix the root cause (the
missing timeout), but requires patching the Bitnami container image or injecting a
custom entrypoint script. Higher maintenance burden for a vendored chart.

Increase MONGODB_INIT_RETRY_ATTEMPTS / MONGODB_INIT_RETRY_DELAY: These control
the retry_while wrapper around the mongosh call. However, since mongosh itself
blocks indefinitely (the first attempt never returns), retry_while never gets a
chance to retry. These settings would have no effect on the hang.

Split the manifest and apply resources separately: Would also work, but requires
yq to filter multi-document YAML by resource kind. The scale-to-zero approach is
simpler — it uses only kubectl and doesn't require parsing the manifest.

Delay mongos startup until configsvr is fully ready to avoid a race
condition in the Bitnami mongodb-sharded entrypoint that causes ~20%
CI failure rate on ctst-end2end-sharded.

Issue: ZENKO-5229
@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@delthas delthas requested a review from SylvainSenechal March 20, 2026 11:25
@delthas delthas requested review from a team, SylvainSenechal and maeldonn March 20, 2026 13:24
@delthas
Copy link
Contributor Author

delthas commented Mar 20, 2026

/approve

@bert-e
Copy link
Contributor

bert-e commented Mar 20, 2026

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.14

The following branches will NOT be impacted:

  • development/2.10
  • development/2.11
  • development/2.12
  • development/2.13
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@bert-e
Copy link
Contributor

bert-e commented Mar 21, 2026

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.14

The following branches have NOT changed:

  • development/2.10
  • development/2.11
  • development/2.12
  • development/2.13
  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

Please check the status of the associated issue ZENKO-5229.

Goodbye delthas.

@bert-e bert-e merged commit 5a53451 into development/2.14 Mar 21, 2026
26 of 28 checks passed
@bert-e bert-e deleted the bugfix/ZENKO-5229/fix-mongos-startup-race branch March 21, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants