Skip to content

netvsp: eqe 135 handling clashing with fhr#2909

Open
erfrimod wants to merge 4 commits intomicrosoft:mainfrom
erfrimod:erfrimod/vf-reconfiguration-fhr
Open

netvsp: eqe 135 handling clashing with fhr#2909
erfrimod wants to merge 4 commits intomicrosoft:mainfrom
erfrimod:erfrimod/vf-reconfiguration-fhr

Conversation

@erfrimod
Copy link
Copy Markdown
Contributor

@erfrimod erfrimod commented Mar 7, 2026

Problems happen when device arrival / removal occur during VF Reconfiguration delays and waits. Modifying the behavior in VF Manager to consider device arrival/removal to indicate that the old VF is gone and the reconfiguration should be abandoned.

Example: EQE 135 arrives -> delay -> FHR -> device removal

  • If ManaDeviceRemoved fires during Reconfiguring, clears vf_reconfig_backoff → no more retries
  • If ManaDeviceArrived fires while vf_reconfig_backoff is Some, code panics (assert!)
  • If the uevent add occurs while state is Reconfiguring, the handling code maps it to Continue → event discarded

@erfrimod erfrimod requested a review from a team as a code owner March 7, 2026 01:49
Copilot AI review requested due to automatic review settings March 7, 2026 01:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the netvsp VF manager worker’s reconfiguration state machine to better tolerate VF arrival/removal events that occur during VF reconfiguration delays (e.g., EQE 135 + FHR timing), avoiding panics and inappropriate retry behavior.

Changes:

  • Treat uevent “device exists” notifications as ManaDeviceArrived even while in Reconfiguring.
  • Replace a panic on device arrival during reconfig backoff with “abandon reconfiguration” behavior (clears backoff + logs).
  • On device removal, clear any outstanding reconfig backoff (with logging) rather than silently resetting.

Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 7, 2026

@ben-zen
Copy link
Copy Markdown
Contributor

ben-zen commented Mar 11, 2026

LGTM

Copilot AI review requested due to automatic review settings March 27, 2026 17:40
@erfrimod erfrimod force-pushed the erfrimod/vf-reconfiguration-fhr branch from 4f5102f to a11e0e3 Compare March 27, 2026 17:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs Outdated
Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs Outdated
Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs Outdated
let exists = Path::new(&device_path).exists();
match (vtl2_device_state, exists) {
(Vtl2DeviceState::Missing, true) => NextWorkItem::ManaDeviceArrived,
(Vtl2DeviceState::Reconfiguring, true) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should always return NextWorkItem::Continue. A device can't arrive until it is removed, and removal will move the state away from Vtl2DeviceState::Reconfiguring.

Copilot AI review requested due to automatic review settings March 27, 2026 22:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

openhcl/underhill_core/src/emuplat/netvsp.rs:783

  • ManaDeviceArrived is triggered purely by vtl2_device_state == Missing && exists, regardless of action. Since the uevent handler can emit UeventAction::Rescan (and the comment here notes rescans are not a reliable readiness signal), a rescan can spuriously drive an arrival/startup attempt. Consider gating the Missing && existsManaDeviceArrived transition on action == UeventAction::Add (and/or explicitly ignoring Rescan) to avoid false-positive arrivals.
                    let exists = Path::new(&device_path).exists();
                    match (vtl2_device_state, exists) {
                        (Vtl2DeviceState::Missing, true) => NextWorkItem::ManaDeviceArrived,
                        (state, false) => {
                            // Tracing to diagnose add that is not acted on due to missing device.
                            if action == UeventAction::Add {
                                 tracelimit::warn_ratelimited!(?state, ?action, exists, %device_path, "uevent received");
                            }

Comment thread openhcl/underhill_core/src/emuplat/netvsp.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants