HNS endpoints are staying as stale even after pods are terminated

Please fill out _all the sections_ below for bug issues, otherwise it'll be closed as it won't be actionable for us to address.

**Describe the bug**
When operating in environments with high pod churn rates—where pods are frequently created and terminated—and multiple services are configured to reference a single deployment, we have observed a critical issue with Host Network Service (HNS) endpoint lifecycle management. Specifically, HNS endpoints are not being properly cleaned up and remain in a stale state even after their associated pods have been successfully terminated. 

The problem becomes particularly severe when Kubernetes' IP address management (IPAM) reassigns an IP address that still has a stale HNS endpoint attached to it. When a new pod is scheduled and receives this recycled IP address, the presence of the stale endpoint configuration causes a conflict in the network stack. This results in complete network connectivity failure for the newly created pod, manifesting as DNS
resolution timeouts and inability to establish network connections. The pod appears healthy from a Kubernetes perspective but is effectively isolated from the network, unable to communicate with other services or resolve domain names.

**To Reproduce** 
Steps to reproduce the behavior:
1. Deploy kubernetes cluster with limited available IPs for pods on both Linux and Windows node.
2. Deploy Linux pods that consume almost all IPs.
3. Deploy service type LoadBalancer that refer to Linux pods, This will trigger kube-proxy to create Remote HNS endpoint on Windows. This Remote HNS endpoint will have Linux pod IP.
4. Scale down the Linux pods to 1.
5. Do multiple time scaling up and down.
6. Deploy windows deployment with multiple replicas and have multiple service referencing the deployment.
7. Scale down the windows pods to 1.
8. Perform the same operation multiple time (scaling up and down) 
9. After few days we see that there are stale remote HNS endpoints on windows nodes even though the pods are terminated.
10. If the same IP get assigned to another windows pod in the next cycle, the network connectivity on that pod fails.

**Expected behavior**
There should not be any remote HNS endpoints on the windows once the pods are terminated.

**Configuration:**
 - Edition: WS 2022 and WS 2019
 - Base Image being used: servercore/iis:windowsservercore
 - Container engine: containerd
- HNS Version: Major: 13, Minor: 3
- CNI: aws-vpc-bridge (L2 bridge networking mode)
- Cloud Platform: Amazon EKS

**Additional context**
- My DEV K8S cluster has 2 linux nodes and 2 windows nodes. After few days I notices that on windows nodes there are few stale HNS remote endpoints
Node: ip-192-168-7-47.us-west-1.compute.internal
```
PS C:\Windows\system32> Get-HnsEndpoint | Format-Table Id, Name, IPAddress, IsRemoteEndpoint

ID                                   Name     IPAddress     IsRemoteEndpoint
--                                   ----     ---------     ----------------
b10f186a-ee1d-40a1-bb8a-f532cab4d131 Ethernet 192.168.7.110             True
ba5b0702-d05a-4fec-869f-4352d33f1891 Ethernet 172.0.32.0                True
ddec4b0f-0cf6-4276-bd42-5edd075fd179 Ethernet 192.168.7.84              True
fb92e3ac-3e12-461d-b73f-3ed5270ac42d Ethernet 192.168.7.51              True
e5c241d5-6123-4a8b-b7b1-01fd442ead92 Ethernet 192.168.7.52              True
248b9d31-ae56-4543-b9d0-ef1fc42f2be4 Ethernet 192.168.7.41              True
5abe1c00-18ae-4219-a527-e7d06e4522d6 Ethernet 192.168.7.39              True
042595c0-0031-4344-bbf7-6dfe3bc95b9f Ethernet 192.168.7.36              True

```
Node: ip-192-168-7-56.us-west-1.compute.internal
```
PS C:\Windows\system32> Get-HnsEndpoint | Format-Table Id, Name, IPAddress, IsRemoteEndpoint

ID                                   Name     IPAddress     IsRemoteEndpoint
--                                   ----     ---------     ----------------
ea6140b7-eee0-42c1-9ba2-5d7a4b0fefb1 Ethernet 172.0.32.0                True
b6453839-0d26-4509-931a-e9224a5135e7 Ethernet 192.168.7.49              True
1bf35c74-0c1d-4d26-97ce-5580c509f946 Ethernet 192.168.7.110             True
2b9cc7b7-bd9c-4940-b4da-70c200452d82 Ethernet 192.168.7.84              True
96b49a8c-8c29-41a0-b5ba-512b23189816 Ethernet 192.168.7.39              True
67b7082d-b7e7-4593-a766-8b3a646c8fca Ethernet 192.168.7.52              True
9c003bd3-0e56-494c-b1f6-6ae82a267013 Ethernet 192.168.7.50              True
691d9ecc-b20b-4b8e-99eb-9cf6ea0ee22b Ethernet 192.168.7.41              True
ba497e17-aba8-4146-a5a0-b55fe5338bef Ethernet 192.168.7.36              True

```
- I launched the all my nodes with in a subnet CIDR range `192.168.7.32/27`
- Below are my pods running with no windows pods.
```
❯ kubectl get pods -A -o wide
NAMESPACE          NAME                                READY   STATUS    RESTARTS         AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
amazon-guardduty   aws-guardduty-agent-2cjnx           1/1     Running   9 (4h33m ago)    8d      192.168.7.60   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
amazon-guardduty   aws-guardduty-agent-w7x9r           1/1     Running   9 (4h33m ago)    8d      192.168.7.53   ip-192-168-7-53.us-west-1.compute.internal   <none>           <none>
kube-system        aws-node-sjcx6                      2/2     Running   16 (4h33m ago)   8d      192.168.7.60   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        aws-node-zfws4                      2/2     Running   16 (4h33m ago)   8d      192.168.7.53   ip-192-168-7-53.us-west-1.compute.internal   <none>           <none>
kube-system        coredns-6b7fdfbc95-4v4n9            1/1     Running   9 (4h33m ago)    8d      192.168.7.41   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        coredns-6b7fdfbc95-zfhv2            1/1     Running   9 (4h33m ago)    8d      192.168.7.52   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        kube-proxy-jgkcg                    1/1     Running   9 (4h33m ago)    8d      192.168.7.53   ip-192-168-7-53.us-west-1.compute.internal   <none>           <none>
kube-system        kube-proxy-n64cg                    1/1     Running   9 (4h33m ago)    8d      192.168.7.60   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        metrics-server-8559b8c95f-b7dmf     1/1     Running   9 (4h33m ago)    8d      192.168.7.36   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        metrics-server-8559b8c95f-j94d9     1/1     Running   9 (4h33m ago)    8d      192.168.7.39   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
```
- From the running pods the `metric-server` and `core-dns` pods has the remote endpoints which are valid.
- The other endpoints with IP `192.168.7.49, 192.168.7.51` should be deleted but those are staying as stale on windows node.
- Scaled windows deployment and verified the pods IP. If we see from below, one of the windows pod got assigned an IP `192.168.7.49` and if I exec and do the nslook inside pod, it fails.
```
❯ kubectl get pods -A -o wide
NAMESPACE          NAME                                READY   STATUS    RESTARTS         AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
amazon-guardduty   aws-guardduty-agent-2cjnx           1/1     Running   9 (4h42m ago)    8d      192.168.7.60   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
amazon-guardduty   aws-guardduty-agent-w7x9r           1/1     Running   9 (4h42m ago)    8d      192.168.7.53   ip-192-168-7-53.us-west-1.compute.internal   <none>           <none>
default            linux-deployment-8676b68d6f-g8gkf   1/1     Running   7 (4h42m ago)    6d22h   192.168.7.38   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-46gv6           1/1     Running   0                45m     192.168.7.61   ip-192-168-7-56.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-4knr2           1/1     Running   0                45m     192.168.7.43   ip-192-168-7-56.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-6g24f           1/1     Running   0                45m     192.168.7.62   ip-192-168-7-47.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-b2ddh           1/1     Running   0                45m     192.168.7.48   ip-192-168-7-47.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-bh6xt           1/1     Running   0                45m     192.168.7.42   ip-192-168-7-56.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-c77js           1/1     Running   0                45m     192.168.7.46   ip-192-168-7-47.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-lh24g           1/1     Running   0                45m     192.168.7.58   ip-192-168-7-47.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-s444z           1/1     Running   0                45m     192.168.7.49   ip-192-168-7-56.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-tbzqn           1/1     Running   0                45m     192.168.7.45   ip-192-168-7-56.us-west-1.compute.internal   <none>           <none>
default            test-app-5d8ff7fc67-tgfv6           1/1     Running   0                45m     192.168.7.59   ip-192-168-7-47.us-west-1.compute.internal   <none>           <none>
kube-system        aws-node-sjcx6                      2/2     Running   16 (4h42m ago)   8d      192.168.7.60   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        aws-node-zfws4                      2/2     Running   16 (4h42m ago)   8d      192.168.7.53   ip-192-168-7-53.us-west-1.compute.internal   <none>           <none>
kube-system        coredns-6b7fdfbc95-4v4n9            1/1     Running   9 (4h42m ago)    8d      192.168.7.41   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        coredns-6b7fdfbc95-zfhv2            1/1     Running   9 (4h42m ago)    8d      192.168.7.52   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        kube-proxy-jgkcg                    1/1     Running   9 (4h42m ago)    8d      192.168.7.53   ip-192-168-7-53.us-west-1.compute.internal   <none>           <none>
kube-system        kube-proxy-n64cg                    1/1     Running   9 (4h42m ago)    8d      192.168.7.60   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        metrics-server-8559b8c95f-b7dmf     1/1     Running   9 (4h42m ago)    8d      192.168.7.36   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
kube-system        metrics-server-8559b8c95f-j94d9     1/1     Running   9 (4h42m ago)    8d      192.168.7.39   ip-192-168-7-60.us-west-1.compute.internal   <none>           <none>
❯ kubectl exec -it test-app-5d8ff7fc67-s444z -- powershell

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows


PS C:\> nslookup google.com
DNS request timed out.
    timeout was 2 seconds.
Server:  UnKnown
Address:  10.100.0.10

DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out
PS C:\> exit

```
- Attached the HNS trace logs which are collected using [script](https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/Debug.md). Please let us know if you need more details or anything.


[9a98.zip](https://github.com/user-attachments/files/25589713/9a98.zip) - logs on Node `ip-192-168-7-56.us-west-1.compute.internal`

[bf83.zip](https://github.com/user-attachments/files/25589729/bf83.zip) - logs on Node `ip-192-168-7-47.us-west-1.compute.internal`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNS endpoints are staying as stale even after pods are terminated #631

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HNS endpoints are staying as stale even after pods are terminated #631

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions