Leader election does not auto-recover after lease acquisition timeout during etcd outage

## Bug Report

The leader election process managed by JOSDK left a deployment with two replicas without a leader because `etcd` had performance issues. When we manually restart the existing replicas (20m after), they were able to assume the leader.

#### What did you do?

* We have an JOSDK operator with a k8s deployment using two replicas working in high availability
* We use the `io.javaoperatorsdk.operator.api.config.LeaderElectionConfiguration` to keep only one as a leader

#### What did you expect to see?

* If we have one pod running, this would be the leader
* If we have more than one pod running, a leader election process would happen using a k8s lease
* We can have small periods without leader, based on the time configurations (lease duration)
* If something "bad" happen, eventually we could have no leader, but it would recover automatically in a short time period

#### What did you see instead? Under which circumstances?

We had a bad performance period for `etcd` in our k8s cluster. During a short time, the cluster got more than 100 failed `etcd` proposals. This affected to our operator (and probably many other systems)

The application logs for leader pod show this error `io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null`

<details>
<summary>Detail stack trace</summary>

```
2026-01-30 03:17:24,919{UTC} [pool-7-thread-3] WARN  i.f.k.c.e.l.LeaderElector - Exception occurred while acquiring lock 'LeaseLock: my-namespace - my-lease (my-lease-pod) retrying...'
    io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleGet(BaseOperation.java:792)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.requireFromServer(BaseOperation.java:193)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:149)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:98)
    at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ResourceLock.get(ResourceLock.java:48)
    at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:227)
    at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$renewWithTimeout$6(LeaderElector.java:207)
    at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:292)
    at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
    at util.TokenAwareRunnable.run(TokenAwareRunnable.java:28)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: io.vertx.core.impl.NoStackTraceTimeoutException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/my-namespace/leases/my-lease for server null
```

</details>

At this time, the pod lose the leadership. Checking the logs for the both replicas available, no special activity was found: 

* None of the replicas took the lead
* The replicas continue alive during that time

This situation continues during 20m, when manually we restarted the pods. After pod restart, one of them takes the leadership.

#### Environment

**Kubernetes cluster type:** 

vanilla in Azure

`$ Mention java-operator-sdk version from pom.xml file`

JOSDK version 5.1.5

`$ java -version`

```
$ java -version
openjdk version "21.0.10" 2026-01-20 LTS
OpenJDK Runtime Environment Corretto-21.0.10.7.1 (build 21.0.10+7-LTS)
OpenJDK 64-Bit Server VM Corretto-21.0.10.7.1 (build 21.0.10+7-LTS, mixed mode, sharing)
```

`$ kubectl version`

```
$ kubectl version
Client Version: v1.31.13
Kustomize Version: v5.4.2
Server Version: v1.30.5
```

#### Possible Solution

I don't have a proper proposal here, only sharing an experience with JOSDK that suggest a possible improvement. I know that probably the k8s side cause the problem, but focus in the leader election feature, what I expected would be an auto recovery in a short period (at most a few minutes), and this didn't happen until manual restart.

#### Additional context

Just in case will be required:

<details>
<summary>Configuration used</summary>

  * Lease duration: 15s
  * Renew deadline: 10s
  * Retry period: 2s

</details>

<details>
<summary>Initialisation code</summary>

```java
LeaderElectionConfiguration lec = new LeaderElectionConfiguration(
                leConfig.leaseName(),
                namespace,
                leConfig.leaseDurationDuration(),
                leConfig.renewDeadlineDuration(),
                leConfig.retryPeriodDuration(),
                podName,
                leaderElectionService.createLeaderCallbacks(),
                false // exitOnStopLeading - continue running when losing leadership
                );
Operator operator = new Operator(override -> {
                override.checkingCRDAndValidateLocalModel(true);
                override.withConcurrentReconciliationThreads(1);
                override.withReconciliationTerminationTimeout(Duration.ofSeconds(30));
                    override.withLeaderElectionConfiguration(lec);
                }); 
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader election does not auto-recover after lease acquisition timeout during etcd outage #3147

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Leader election does not auto-recover after lease acquisition timeout during etcd outage #3147

Description

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions