Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TSG/EnvironmentValidator/Networking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ For Azure Local Network Resources not related to Environment Validator, see [TSG
- [Troubleshoot: Cluster Storage Intent Exists](Troubleshoot-Network-Test-Cluster-StorageIntent-Exists.md)
- [Troubleshoot: Host Network Configuration Readiness](Troubleshoot-Network-Test-HostNetworkConfigurationReadiness.md)
- [Troubleshoot: Infrastructure IP Azure Endpoint Connection](Troubleshoot-Network-Test-InfraIP-Azure-Endpoint-Connection.md)
- [Troubleshoot: Infrastructure IP Connection Exception Found](Troubleshoot-Network-Test-Infra-IP-Connection-ExceptionFound.md)
- [Troubleshoot: Infrastructure IP DNS Client Readiness](Troubleshoot-Network-Test-InfraIP-DNS-Client-Readiness.md)
- [Troubleshoot: Infrastructure IP DNS Port 53 Connection](Troubleshoot-Network-Test-InfraIP-DNS-Port-53.md)
- [Troubleshoot: Infrastructure IP Hyper-V Readiness](Troubleshoot-Network-Test-InfraIP-Hyper-V-Readiness.md)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,78 @@ If you've resolved the underlying issue, you can retry the intent provisioning:

---

### Failure: Intent ConfigurationStatus Flipping Between Validating and Success

**Error Message:**
```text
Intent <IntentName> on host <NodeName> in pending state and hasn't stabilized. ConfigurationStatus: Validating ProvisioningStatus: <empty>
```

**Root Cause:** A known issue causes the local status returned by `Get-NetIntentStatus` to toggle between `Validating` and `Success` after a Global Intent failure. This means the intent never stays in a stable `Success` state long enough for the validator to pass, blocking solution updates.

You may observe this behavior when running `Get-NetIntentStatus` repeatedly — the `ConfigurationStatus` rapidly flips between `Validating` and `Success` on one or more nodes, while the `ProvisioningStatus` may appear empty during the `Validating` phase.

#### Remediation Steps

##### Step 1: Confirm the Flipping Behavior

1. Run `Get-NetIntentStatus` multiple times in quick succession and observe whether `ConfigurationStatus` alternates between `Validating` and `Success`:

```powershell
# Run multiple times to observe the flipping behavior
Get-NetIntentStatus | ft IntentName, Host, ConfigurationStatus, ProvisioningStatus
```

2. If you see the status toggling between `Validating` and `Success` (rather than staying in `Failed` or remaining stuck in `Validating`), proceed with the mitigation below.

##### Step 2: Reset Global Intent Overrides

The mitigation involves removing and re-adding the global cluster overrides for Network ATC. This stabilizes the intent status.

1. Check current global overrides:

```powershell
# Check current overrides
$globalIntent = Get-NetIntent -GlobalOverrides
$clusterOverride = $globalIntent.ClusterOverride
$clusterOverride
```

2. Create new cluster overrides and apply previous override values:

```powershell
$newClusterOverride = New-NetIntentGlobalClusterOverrides

# NOTE:
# MAKE SURE YOUR NEW OVERRIDE VALUE MATCHES YOUR PREVIOUS VALUE, unless there is any empty value on the previous data
# Set overrides on object based on the old value: ex) $newClusterOverride.<prop> = <val>
# If all the old properties are having empty value, you could put:
# $newClusterOverride.EnableLiveMigrationNetworkSelection = $true
# $newClusterOverride.EnableNetworkNaming= $true
# DO NOT USE OTHER DEFAULT VALUE IF THE OLD PROPERTIES HAVE EMPTY VALUE
```

3. Remove old intent global overrides and re-add them:

```powershell
# Remove old intent global overrides
Remove-NetIntent -GlobalOverrides

# Re-add global cluster overrides
Add-NetIntent -GlobalClusterOverrides $newClusterOverride
```

4. Verify that the intent status is now stable:

```powershell
# Verify intent status is stable at Success
Get-NetIntentStatus | ft IntentName, Host, ConfigurationStatus, ProvisioningStatus
```

Confirm that `ConfigurationStatus` remains `Success` and `ProvisioningStatus` is `Completed` across multiple checks.

---

### Transient States

The intent transient states include:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_ExceptionFound

<table border="1" cellpadding="6" cellspacing="0" style="border-collapse:collapse; margin-bottom:1em;">
<tr>
<th style="text-align:left; width: 180px;">Name</th>
<td><strong>AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_ExceptionFound</strong></td>
</tr>
<tr>
<th style="text-align:left; width: 180px;">Severity</th>
<td><strong>Critical</strong></td>
</tr>
<tr>
<th style="text-align:left;">Applicable Scenarios</th>
<td><strong>Deployment (without ArcGateway), Upgrade (without ArcGateway)</strong></td>
</tr>
</table>

## Overview

This is a **catch-all exception handler** for the infrastructure IP pool connection validation. When the `Test-NwkInfraConnectionValidator_InfraIpPoolConnection` function encounters an unhandled exception during any phase of infrastructure IP connectivity validation, this result is returned.

Unlike other validators in the `NetworkInfraConnection` family that report specific issues (e.g., DNS failure, vNIC readiness), this validator fires when an **unexpected error** occurs — typically due to environmental issues such as VMSwitch configuration failures, Hyper-V component errors, or other system-level problems that prevent the validator from completing normally.

The exception message and stack trace are captured in the `AdditionalData.Detail` field.

## Requirements

1. Hyper-V role must be installed and functioning
2. VMSwitch must be configurable on the host
3. Network adapters specified in the management intent must be present and operational
4. Infrastructure IP pool must be valid and reachable
5. System must have sufficient resources to create test virtual network adapters

## Troubleshooting Steps

### Review Environment Validator Output

Review the Environment Validator output JSON. The `AdditionalData.Detail` field contains the exception message and stack trace, which is essential for identifying the root cause.

```json
{
"Name": "AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_ExceptionFound",
"DisplayName": "Exception found during infra IP pool connection validation.",
"Title": "Exception found during infra IP pool connection validation.",
"Status": 1,
"Severity": 2,
"Description": "Experienced exception during infra IP pool readiness validation. Please check information in AdditionalData.Detail section",
"Remediation": "URI",
"TargetResourceID": "Infra_IP_Connection_Exception",
"TargetResourceName": "Infra_IP_Connection_Exception",
"TargetResourceType": "InfraIpPool",
"Timestamp": "<timestamp>",
"AdditionalData": {
"Source": "Infra_IP_Connection_Exception",
"Resource": "Infra_IP_Connection_Exception",
"Detail": "<Exception Message>\n\r<Stack Trace>",
"Status": "FAILURE",
"TimeStamp": "<timestamp>"
}
}
```

> **Important:** Copy the full `AdditionalData.Detail` content. The exception message and stack trace are the primary clues for diagnosing the root cause.

---

### Failure: General Exception During Validation

If the exception message does not match the VMSwitch scenario above, the exception may have been thrown by any of the sub-steps in the infra IP validation process:

1. **VMSwitch creation/discovery** — Could not find or create a suitable external VMSwitch
2. **vNIC creation** — Could not create the test virtual network adapter on the VMSwitch
3. **IP assignment** — Could not assign infrastructure IP to the test vNIC
4. **Gateway ping** — Error while testing ICMP connectivity to the default gateway
5. **DNS testing** — Error while testing DNS server connectivity on port 53
6. **Endpoint connectivity** — Error while running curl.exe tests to Azure endpoints

#### Remediation Steps

##### 1. Read the Exception Details Carefully

The `AdditionalData.Detail` field contains both the exception message (first line) and the stack trace (subsequent lines). The stack trace shows which function threw the error.

**Common patterns in the stack trace:**

| Stack Trace Contains | Likely Cause |
|----------------------|--------------|
| `EnvValidatorNwkLibConfigureVMSwitchForTesting` | VMSwitch setup failed |
| `New-VMSwitch` | VMSwitch creation failed |
| `Add-VMNetworkAdapter` or `Get-VMNetworkAdapter` | vNIC management failed |
| `New-NetIPAddress` | IP assignment failed |
| `InfraIpCurlTestToEndpoint` | curl endpoint test failed |
| `Get-AzStackHciConnectivityTarget` | Failed to fetch endpoint manifest |

##### 2. Check System Prerequisites

```powershell
# Verify Hyper-V feature and management tools
Get-WindowsFeature -Name "Hyper-V", "Hyper-V-PowerShell" | Select-Object Name, Installed

# Check VMSwitch creation capability
Get-Command Get-VMSwitch -ErrorAction SilentlyContinue | Select-Object Name, Source
```

##### 3. Review System Event Logs

```powershell
# Check for recent errors in System log related to networking/Hyper-V
$startTime = (Get-Date).AddHours(-2)
Get-WinEvent -LogName System -MaxEvents 100 |
Where-Object { $_.TimeCreated -gt $startTime } |
Where-Object { $_.Message -like "*network*" -or $_.Message -like "*Hyper-V*" -or $_.Message -like "*switch*" } |
Select-Object TimeCreated, LevelDisplayName, Message |
Format-Table -Wrap
```

##### 4. Re-install Hyper-V
If the exception message looks like the following:
```
Failed while adding virtual Ethernet switch connections. Switch port create failed, switch = "ABCDEF00-1234-5678-ABCD-ABCDABCDABCD", port name = "ABCDEF00-1234-5678-ABCD-ABCDABCDABCD", port friendly name = "ConvergedSwitch(MYINTENT)": Class not registered (0x80040154)
```
It means the Hyper-V is not installed correctly on the machine, you will need to re-install Hyper-V feature on the machine:

```powershell
Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V
# reboot machine
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V
# reboot machine again
```
##### 5. Retry the Validation
After addressing the underlying issue, re-run the Environment Validator.

---

## Additional Information

### How This Validator is Triggered

This result is produced by the top-level `catch` block in the `Test-NwkInfraConnectionValidator_InfraIpPoolConnection` function. The function performs these steps inside a `try` block:

1. Retrieves infrastructure IP range from IP pools
2. Validates Hyper-V and VMSwitch prerequisites
3. Finds or creates an external VMSwitch using management intent adapters
4. Creates a test vNIC on the VMSwitch
5. For each infrastructure IP (up to 9):
- Assigns the IP to the test vNIC
- Tests gateway connectivity (ICMP ping)
- Tests DNS connectivity (port 53)
- Tests Azure endpoint connectivity (curl.exe)
6. Cleans up test resources

If **any** unhandled exception occurs during steps 1–6, the catch block returns this `ExceptionFound` result with the exception details.

### When This Validator is Skipped

The infrastructure IP connectivity validator (including all endpoint tests) is **skipped** when **ArcGateway is enabled**. ArcGateway provides alternative connectivity.

### Related Validators

This exception can mask failures that would normally be reported by these specific validators:

- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_Hyper_V_Readiness`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_VMSwitch_Readiness`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_MANAGEMENT_VNIC_Readiness`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_vNIC_Readiness`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_DNSClientServerAddress_Readiness`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_IPReadiness`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_DNS_Server_Port_53`
- `AzureLocal_NetworkInfraConnection_Test_Infra_IP_Connection_{ServiceName}`

### Related Documentation

- [Azure Local network requirements](https://learn.microsoft.com/azure/azure-local/concepts/host-network-requirements)