Skip to content

Leader election does not auto-recover after lease acquisition timeout during etcd outage #3147

@dariocazas

Description

@dariocazas

Bug Report

The leader election process managed by JOSDK left a deployment with two replicas without a leader because etcd had performance issues. When we manually restart the existing replicas (20m after), they were able to assume the leader.

What did you do?

  • We have an JOSDK operator with a k8s deployment using two replicas working in high availability
  • We use the io.javaoperatorsdk.operator.api.config.LeaderElectionConfiguration to keep only one as a leader

What did you expect to see?

  • If we have one pod running, this would be the leader
  • If we have more than one pod running, a leader election process would happen using a k8s lease
  • We can have small periods without leader, based on the time configurations (lease duration)
  • If something "bad" happen, eventually we could have no leader, but it would recover automatically in a short time period

What did you see instead? Under which circumstances?

We had a bad performance period for etcd in our k8s cluster. During a short time, the cluster got more than 100 failed etcd proposals. This affected to our operator (and probably many other systems)

The application logs for leader pod show this error io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null

Detail stack trace
2026-01-30 03:17:24,919{UTC} [pool-7-thread-3] WARN  i.f.k.c.e.l.LeaderElector - Exception occurred while acquiring lock 'LeaseLock: my-namespace - my-lease (my-lease-pod) retrying...'
    io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleGet(BaseOperation.java:792)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.requireFromServer(BaseOperation.java:193)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:149)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:98)
    at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ResourceLock.get(ResourceLock.java:48)
    at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:227)
    at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$renewWithTimeout$6(LeaderElector.java:207)
    at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:292)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at util.TokenAwareRunnable.run(TokenAwareRunnable.java:28)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: io.vertx.core.impl.NoStackTraceTimeoutException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null

At this time, the pod lose the leadership. Checking the logs for the both replicas available, no special activity was found:

  • None of the replicas took the lead
  • The replicas continue alive during that time

This situation continues during 20m, when manually we restarted the pods. After pod restart, one of them takes the leadership.

Environment

Kubernetes cluster type:

vanilla in Azure

$ Mention java-operator-sdk version from pom.xml file

JOSDK version 5.1.5

$ java -version

$ java -version
openjdk version "21.0.10" 2026-01-20 LTS
OpenJDK Runtime Environment Corretto-21.0.10.7.1 (build 21.0.10+7-LTS)
OpenJDK 64-Bit Server VM Corretto-21.0.10.7.1 (build 21.0.10+7-LTS, mixed mode, sharing)

$ kubectl version

$ kubectl version
Client Version: v1.31.13
Kustomize Version: v5.4.2
Server Version: v1.30.5

Possible Solution

I don't have a proper proposal here, only sharing an experience with JOSDK that suggest a possible improvement. I know that probably the k8s side cause the problem, but focus in the leader election feature, what I expected would be an auto recovery in a short period (at most a few minutes), and this didn't happen until manual restart.

Additional context

Just in case will be required:

Configuration used
  • Lease duration: 15s
  • Renew deadline: 10s
  • Retry period: 2s
Initialisation code
LeaderElectionConfiguration lec = new LeaderElectionConfiguration(
                leConfig.leaseName(),
                namespace,
                leConfig.leaseDurationDuration(),
                leConfig.renewDeadlineDuration(),
                leConfig.retryPeriodDuration(),
                podName,
                leaderElectionService.createLeaderCallbacks(),
                false // exitOnStopLeading - continue running when losing leadership
                );
Operator operator = new Operator(override -> {
                override.checkingCRDAndValidateLocalModel(true);
                override.withConcurrentReconciliationThreads(1);
                override.withReconciliationTerminationTimeout(Duration.ofSeconds(30));
                    override.withLeaderElectionConfiguration(lec);
                }); 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions