-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Bug Report
The leader election process managed by JOSDK left a deployment with two replicas without a leader because etcd had performance issues. When we manually restart the existing replicas (20m after), they were able to assume the leader.
What did you do?
- We have an JOSDK operator with a k8s deployment using two replicas working in high availability
- We use the
io.javaoperatorsdk.operator.api.config.LeaderElectionConfigurationto keep only one as a leader
What did you expect to see?
- If we have one pod running, this would be the leader
- If we have more than one pod running, a leader election process would happen using a k8s lease
- We can have small periods without leader, based on the time configurations (lease duration)
- If something "bad" happen, eventually we could have no leader, but it would recover automatically in a short time period
What did you see instead? Under which circumstances?
We had a bad performance period for etcd in our k8s cluster. During a short time, the cluster got more than 100 failed etcd proposals. This affected to our operator (and probably many other systems)
The application logs for leader pod show this error io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null
Detail stack trace
2026-01-30 03:17:24,919{UTC} [pool-7-thread-3] WARN i.f.k.c.e.l.LeaderElector - Exception occurred while acquiring lock 'LeaseLock: my-namespace - my-lease (my-lease-pod) retrying...'
io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleGet(BaseOperation.java:792)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.requireFromServer(BaseOperation.java:193)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:149)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:98)
at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ResourceLock.get(ResourceLock.java:48)
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:227)
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$renewWithTimeout$6(LeaderElector.java:207)
at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:292)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
at util.TokenAwareRunnable.run(TokenAwareRunnable.java:28)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: io.vertx.core.impl.NoStackTraceTimeoutException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null
At this time, the pod lose the leadership. Checking the logs for the both replicas available, no special activity was found:
- None of the replicas took the lead
- The replicas continue alive during that time
This situation continues during 20m, when manually we restarted the pods. After pod restart, one of them takes the leadership.
Environment
Kubernetes cluster type:
vanilla in Azure
$ Mention java-operator-sdk version from pom.xml file
JOSDK version 5.1.5
$ java -version
$ java -version
openjdk version "21.0.10" 2026-01-20 LTS
OpenJDK Runtime Environment Corretto-21.0.10.7.1 (build 21.0.10+7-LTS)
OpenJDK 64-Bit Server VM Corretto-21.0.10.7.1 (build 21.0.10+7-LTS, mixed mode, sharing)
$ kubectl version
$ kubectl version
Client Version: v1.31.13
Kustomize Version: v5.4.2
Server Version: v1.30.5
Possible Solution
I don't have a proper proposal here, only sharing an experience with JOSDK that suggest a possible improvement. I know that probably the k8s side cause the problem, but focus in the leader election feature, what I expected would be an auto recovery in a short period (at most a few minutes), and this didn't happen until manual restart.
Additional context
Just in case will be required:
Configuration used
- Lease duration: 15s
- Renew deadline: 10s
- Retry period: 2s
Initialisation code
LeaderElectionConfiguration lec = new LeaderElectionConfiguration(
leConfig.leaseName(),
namespace,
leConfig.leaseDurationDuration(),
leConfig.renewDeadlineDuration(),
leConfig.retryPeriodDuration(),
podName,
leaderElectionService.createLeaderCallbacks(),
false // exitOnStopLeading - continue running when losing leadership
);
Operator operator = new Operator(override -> {
override.checkingCRDAndValidateLocalModel(true);
override.withConcurrentReconciliationThreads(1);
override.withReconciliationTerminationTimeout(Duration.ofSeconds(30));
override.withLeaderElectionConfiguration(lec);
});