fix: bump Cruise Control to 2.5.123 for cgroup v2 compatibility#2359
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
francoisferrand
left a comment
There was a problem hiding this comment.
should we also switch to a "newer" (if any) adobe fork of this image?
Adobe has newer versions: But the switch to 3.0.3 sounds more invovled. I'd suggest keeping the small increment for this MR, enough to get rid of the crash loop in Zenko CI, and consider switching to the latest image when working on the move to the Adobe koperator fork. |
|
/approve |
In the queueThe changeset has received all authorizations and has been added to the The changeset will be merged in:
The following branches will NOT be impacted:
There is no action required on your side. You will be notified here once IMPORTANT Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a The following options are set: approve |
|
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue ZENKO-5227. Goodbye delthas. |
Summary
The Cruise Control pod (
end2end-base-queue-cruisecontrol) enters CrashLoopBackOff on hosts with cgroup v2 (e.g. kernel 6.19+). This PR bumps the Cruise Control image from2.5.101to2.5.123to resolve it.Fixes ZENKO-5227
Scope of impact
This only affects local development environments and the Zenko CI pipeline — anywhere the cluster runs on a host kernel that defaults to cgroup v2 (e.g. Arch Linux, Fedora, Ubuntu 22.04+, or any kernel 5.2+ with cgroup v2 enabled). ARTESCA production clusters are not affected: ARTESCA deploys on Rocky Linux 8.10 (kernel 4.18), which uses cgroup v1. The JDK's
CgroupV2Subsystemcode path is never reached on cgroup v1 hosts, so the bug cannot trigger there.That said, this would become a blocker if ARTESCA ever moves to RHEL 9 / Rocky 9 (kernel 5.14+, cgroup v2 by default).
The error
The pod crashes on startup with a fatal JVM abort:
The JMX Prometheus Java agent (
-javaagent) runs during JVM startup. It callsDefaultExports.initialize()→ManagementFactory.getOperatingSystemMXBean()→CgroupV2Subsystem.getInstance(), which throws aNullPointerExceptionwhen parsing cgroup v2 filesystem entries. Since Java treats agentpremainfailures as fatal, the entire JVM aborts before Cruise Control can start.Root cause
The Cruise Control image
2.5.101ships JDK 11.0.16.1 (Eclipse Temurin), which has a bug in its cgroup v2 support. TheCgroupV2Subsystem.getInstance()method fails with an NPE on newer kernels (verified on 6.19.8-arch1-1 with cgroup2 mounts). This is reproducible even without the JMX agent — simply runningjava -XshowSettings:systemin the container triggers the same crash.This is not a JMX exporter version issue. The bug is in the JDK itself.
The fix
Bump
cruise-controlfrom2.5.101to2.5.123insolution/deps.yaml. The2.5.123image ships JDK 17.0.7 (Eclipse Temurin), which handles cgroup v2 correctly. Verified by runningjava -XshowSettings:systemin a2.5.123container — it reads cgroup v2 metrics without error.The upstream changes between Cruise Control 2.5.101 and 2.5.123 (LinkedIn's cruise-control) are patch-level: CVE dependency bumps (snakeyaml, scala, Netty, org.json), bug fixes (leader CPU util, offline partitions, concurrency adjuster NPE), and non-breaking additions (partition movement metrics, per-broker concurrency adjuster). No config format changes, no removed APIs. The docker-cruise-control Dockerfile diff between the two tags is just a JDK 11→17 bump, a Node 16→20 bump (build-time only), and OCI labels.
Alternatives considered
Add
-XX:-UseContainerSupporttoKAFKA_OPTS: This disables the JDK's container/cgroup detection entirely, sidestepping the crash. Confirmed working. Rejected because it also disables memory/CPU limit awareness — the JVM would ignore container resource constraints, which could cause OOM kills or CPU overuse.Upgrade only the JMX exporter (from 0.16.1 to 0.17.1+): Initially suspected as the fix, but the bug is in the JDK, not the exporter. Upgrading the exporter alone would not help since
CgroupV2Subsystem.getInstance()is called by the JDK'sManagementFactory, not by the exporter directly.Pin to a patched JDK 11 build: JDK 11.0.19+ has cgroup v2 fixes. However, there is no
cruise-controlimage built with a patched JDK 11 — Banzai moved to JDK 17 starting with2.5.113. Building a custom image would add maintenance burden for no benefit over using the upstream2.5.123.