fix: bump Cruise Control to 2.5.123 for cgroup v2 compatibility by delthas · Pull Request #2359 · scality/Zenko

delthas · 2026-03-19T14:01:26Z

Summary

The Cruise Control pod (end2end-base-queue-cruisecontrol) enters CrashLoopBackOff on hosts with cgroup v2 (e.g. kernel 6.19+). This PR bumps the Cruise Control image from 2.5.101 to 2.5.123 to resolve it.

Fixes ZENKO-5227

Scope of impact

This only affects local development environments and the Zenko CI pipeline — anywhere the cluster runs on a host kernel that defaults to cgroup v2 (e.g. Arch Linux, Fedora, Ubuntu 22.04+, or any kernel 5.2+ with cgroup v2 enabled). ARTESCA production clusters are not affected: ARTESCA deploys on Rocky Linux 8.10 (kernel 4.18), which uses cgroup v1. The JDK's CgroupV2Subsystem code path is never reached on cgroup v1 hosts, so the bug cannot trigger there.

That said, this would become a blocker if ARTESCA ever moves to RHEL 9 / Rocky 9 (kernel 5.14+, cgroup v2 by default).

The error

The pod crashes on startup with a fatal JVM abort:

Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: java.lang.NullPointerException
    at jdk.internal.platform.cgroupv2.CgroupV2Subsystem.getInstance(Unknown Source)
    at jdk.internal.platform.CgroupSubsystemFactory.create(Unknown Source)
    ...
    at java.lang.management.ManagementFactory.getOperatingSystemMXBean(Unknown Source)
    at io.prometheus.jmx.shaded.io.prometheus.client.hotspot.StandardExports.<init>(StandardExports.java:43)
    at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:30)

FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed

The JMX Prometheus Java agent (-javaagent) runs during JVM startup. It calls DefaultExports.initialize() → ManagementFactory.getOperatingSystemMXBean() → CgroupV2Subsystem.getInstance(), which throws a NullPointerException when parsing cgroup v2 filesystem entries. Since Java treats agent premain failures as fatal, the entire JVM aborts before Cruise Control can start.

Root cause

The Cruise Control image 2.5.101 ships JDK 11.0.16.1 (Eclipse Temurin), which has a bug in its cgroup v2 support. The CgroupV2Subsystem.getInstance() method fails with an NPE on newer kernels (verified on 6.19.8-arch1-1 with cgroup2 mounts). This is reproducible even without the JMX agent — simply running java -XshowSettings:system in the container triggers the same crash.

This is not a JMX exporter version issue. The bug is in the JDK itself.

The fix

Bump cruise-control from 2.5.101 to 2.5.123 in solution/deps.yaml. The 2.5.123 image ships JDK 17.0.7 (Eclipse Temurin), which handles cgroup v2 correctly. Verified by running java -XshowSettings:system in a 2.5.123 container — it reads cgroup v2 metrics without error.

The upstream changes between Cruise Control 2.5.101 and 2.5.123 (LinkedIn's cruise-control) are patch-level: CVE dependency bumps (snakeyaml, scala, Netty, org.json), bug fixes (leader CPU util, offline partitions, concurrency adjuster NPE), and non-breaking additions (partition movement metrics, per-broker concurrency adjuster). No config format changes, no removed APIs. The docker-cruise-control Dockerfile diff between the two tags is just a JDK 11→17 bump, a Node 16→20 bump (build-time only), and OCI labels.

Alternatives considered

Add -XX:-UseContainerSupport to KAFKA_OPTS: This disables the JDK's container/cgroup detection entirely, sidestepping the crash. Confirmed working. Rejected because it also disables memory/CPU limit awareness — the JVM would ignore container resource constraints, which could cause OOM kills or CPU overuse.
Upgrade only the JMX exporter (from 0.16.1 to 0.17.1+): Initially suspected as the fix, but the bug is in the JDK, not the exporter. Upgrading the exporter alone would not help since CgroupV2Subsystem.getInstance() is called by the JDK's ManagementFactory, not by the exporter directly.
Pin to a patched JDK 11 build: JDK 11.0.19+ has cgroup v2 fixes. However, there is no cruise-control image built with a patched JDK 11 — Banzai moved to JDK 17 starting with 2.5.113. Building a custom image would add maintenance burden for no benefit over using the upstream 2.5.123.

Issue: ZENKO-5227

bert-e · 2026-03-19T14:01:31Z

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-19T14:01:39Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

francoisferrand

should we also switch to a "newer" (if any) adobe fork of this image?

delthas · 2026-03-20T08:19:42Z

should we also switch to a "newer" (if any) adobe fork of this image?

Adobe has newer versions:

┌──────────────────────────┬────────────┬──────────────────────────────┐
│           Tag            │ CC Version │             Date             │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 3.0.3-adbe-20251008      │ 3.0.3      │ Oct 2025                     │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 3.0.3-adbe-20250804      │ 3.0.3      │ Aug 2025 (koperator default) │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 2.5.133-adbe-20250818-rc │ 2.5.133    │ Aug 2025 (RC)                │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 2.5.133-adbe-20240806-rc │ 2.5.133    │ Aug 2024 (RC)                │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ 2.5.133-adbe-20240313    │ 2.5.133    │ Mar 2024                     │
└──────────────────────────┴────────────┴──────────────────────────────┘

But the switch to 3.0.3 sounds more invovled. I'd suggest keeping the small increment for this MR, enough to get rid of the crash loop in Zenko CI, and consider switching to the latest image when working on the move to the Adobe koperator fork.

delthas · 2026-03-20T08:19:50Z

/approve

bert-e · 2026-03-20T08:20:00Z

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

✔️ development/2.14

The following branches will NOT be impacted:

development/2.10
development/2.11
development/2.12
development/2.13
development/2.5
development/2.6
development/2.7
development/2.8
development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

Any commit you add on the source branch will trigger a new cycle after the
current queue is merged.
Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

bert-e · 2026-03-20T12:27:41Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/2.14

The following branches have NOT changed:

development/2.10
development/2.11
development/2.12
development/2.13
development/2.5
development/2.6
development/2.7
development/2.8
development/2.9

Please check the status of the associated issue ZENKO-5227.

Goodbye delthas.

fix: bump Cruise Control 2.5.101 -> 2.5.123 for cgroup v2 compat

2247e0a

Issue: ZENKO-5227

SylvainSenechal approved these changes Mar 19, 2026

View reviewed changes

delthas requested review from a team, benzekrimaha and maeldonn March 19, 2026 14:05

francoisferrand approved these changes Mar 19, 2026

View reviewed changes

maeldonn approved these changes Mar 19, 2026

View reviewed changes

scality deleted a comment from bert-e Mar 20, 2026

bert-e merged commit b466cea into development/2.14 Mar 20, 2026
51 of 56 checks passed

bert-e deleted the bugfix/ZENKO-5227/fix-cc-crashloop-cgroupv2 branch March 20, 2026 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bump Cruise Control to 2.5.123 for cgroup v2 compatibility#2359

fix: bump Cruise Control to 2.5.123 for cgroup v2 compatibility#2359
bert-e merged 1 commit intodevelopment/2.14from
bugfix/ZENKO-5227/fix-cc-crashloop-cgroupv2

delthas commented Mar 19, 2026 •

edited by atlassian bot

Loading

Uh oh!

bert-e commented Mar 19, 2026

Uh oh!

bert-e commented Mar 19, 2026

Uh oh!

francoisferrand left a comment

Uh oh!

delthas commented Mar 20, 2026

Uh oh!

delthas commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

delthas commented Mar 19, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope of impact

The error

Root cause

The fix

Alternatives considered

Uh oh!

bert-e commented Mar 19, 2026

Hello delthas,

Uh oh!

bert-e commented Mar 19, 2026

Waiting for approval

Uh oh!

francoisferrand left a comment

Choose a reason for hiding this comment

Uh oh!

delthas commented Mar 20, 2026

Uh oh!

delthas commented Mar 20, 2026

Uh oh!

bert-e commented Mar 20, 2026

In the queue

Uh oh!

bert-e commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

delthas commented Mar 19, 2026 •

edited by atlassian bot

Loading