Skip to content

device-plugin: gate Enabled label on SriovNetworkNodeState sync status#15

Open
beccccaboo wants to merge 9 commits intomasterfrom
fix/device-plugin-wait-for-node-state
Open

device-plugin: gate Enabled label on SriovNetworkNodeState sync status#15
beccccaboo wants to merge 9 commits intomasterfrom
fix/device-plugin-wait-for-node-state

Conversation

@beccccaboo
Copy link
Copy Markdown

@beccccaboo beccccaboo commented Apr 6, 2026

Problem

When a new node joins the cluster, the sriov-device-plugin pod can start before sriov-network-config-daemon finishes configuring VFs. The device plugin scans once at startup, finds 0 VFs, and never picks them up later — leaving the node without SR-IOV resources until the pod is manually restarted.

Solution

The device plugin DaemonSet uses nodeSelector: { sriovnetwork.openshift.io/device-plugin: Enabled } and only schedules on nodes with this label. The operator's SriovNetworkNodePolicyReconciler is exclusively responsible for setting this label.

This PR delays setting the label to Enabled until SriovNetworkNodeState.Status.SyncStatus == "Succeeded", ensuring the device plugin pod cannot start until VFs are fully configured.

A NodeStateSyncStatusPredicate watch on SriovNetworkNodeState triggers reconciliation as soon as the config daemon transitions to Succeeded, so the label is applied promptly.

Behaviour

Scenario SyncStatus Current label Result
New node, first sync "" / "InProgress" (none) Disabled; device plugin not scheduled
Config daemon finishes "Succeeded" Disabled Reconciler re-triggers → Enabled; device plugin starts, finds VFs
Node reconfiguring "InProgress" Enabled Label preserved; running device plugin not evicted
Already-configured node "Succeeded" Enabled No change

Changes

  • controllers/sriovnetworknodepolicy_controller.go: fetch SriovNetworkNodeState before labeling; only set Enabled when SyncStatus == Succeeded (or already Enabled). Add SriovNetworkNodeState watch with NodeStateSyncStatusPredicate.
  • controllers/helper.go: add NodeStateSyncStatusPredicate that fires only when SyncStatus changes.

🤖 Generated with Claude Code

Add a wait-for-node-state init container to the sriov-device-plugin
DaemonSet that blocks startup until SriovNetworkNodeState.status.syncStatus
is Succeeded for the node.

Without this, the device plugin can start before the config daemon finishes
creating and binding VFs to vfio-pci on a new node. When it finds 0 matching
devices it skips resource server creation entirely and never picks up VFs
created later, requiring a manual pod restart to surface the node annotation.

The init container polls the SriovNetworkNodeState CR every 10s using the
auto-mounted service account credentials. It exits immediately on already-
configured nodes where syncStatus is already Succeeded.

Also adds get permission on sriovnetworknodestates to the sriov-plugin Role
and exposes SRIOVNetworkConfigDaemonImage to the plugin manifest renderer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

beccccaboo and others added 4 commits April 6, 2026 10:28
ENVTEST_K8S_VERSION was referenced in Makefile test targets but never
defined, causing `setup-envtest use` to run with an empty version
argument. Newer versions of setup-envtest require an explicit version
and fail silently on empty input, leaving KUBEBUILDER_ASSETS unset.
envtest then falls back to searching PATH for etcd, which does not
exist on GitHub Actions runners, causing all envtest-based test suites
to fail at BeforeSuite.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GCS 401 error is the root cause, not the missing version variable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GCS bucket used by setup-envtest from controller-runtime@v0.16.3
now returns 401 Unauthorized, leaving KUBEBUILDER_ASSETS empty and
causing all envtest-based test suites to fail with "etcd not found".

Add an explicit step to both test jobs that installs setup-envtest from
release-0.19, which uses the GitHub-hosted envtest-releases.yaml index
instead of GCS, and pre-populates KUBEBUILDER_ASSETS in the job
environment before any make test-* targets run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@beccccaboo beccccaboo requested review from Yarosh and clarkzinzow April 6, 2026 17:55
Replace the init container approach with a controller-level fix. The
SriovNetworkNodePolicyReconciler now delays setting the
sriovnetwork.openshift.io/device-plugin=Enabled node label until
SriovNetworkNodeState.Status.SyncStatus == "Succeeded". Since the device
plugin DaemonSet uses this label as its nodeSelector, it cannot schedule
on a node until VFs are fully configured, eliminating the race between
the config daemon and device plugin on new nodes.

During reconfiguration (SyncStatus briefly InProgress on an already-labelled
node), the label is preserved to avoid evicting the running device plugin pod.

A NodeStateSyncStatusPredicate watch on SriovNetworkNodeState ensures the
reconciler re-triggers as soon as the config daemon transitions to Succeeded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@beccccaboo beccccaboo changed the title device-plugin: add init container to wait for node state sync device-plugin: gate Enabled label on SriovNetworkNodeState sync status Apr 6, 2026
Copy link
Copy Markdown

@Yarosh Yarosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change looks good (see 1 comment I left) - I would:

  1. Test it in tcloud staging, with a full test- i.e.: scale down operator, delete sriovnetworknodestate, remove label from a node, scale operator back up and see it reconcile properly. Then roll this out to prod substrates
  2. Worth adding tests to this scenario - there are quite a few tests in this repo. Pretty sure there's a test for this device-plugin code that you can tweak

// Only enable the device plugin when the node state has successfully synced.
// If already Enabled (reconfiguration case), keep it Enabled to avoid evicting the running pod.
ns := &sriovnetworkv1.SriovNetworkNodeState{}
err = r.Get(ctx, types.NamespacedName{Namespace: vars.Namespace, Name: node.Name}, ns)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move the check for currentLabel == constants.SriovDevicePluginLabelEnabled here, and just skip making any changes if already enabled?
You can short-circuit, and spare the etcd lookup for sriovnetworknodestate - saving some load on k8s api-server and etcd.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Implemented in 3f5b824

@lbogdan-together
Copy link
Copy Markdown

When sriov-network-config-daemon finishes configuring all interfaces, it restarts sriov-device-plugin, so I'm not sure what this PR is supposed to solve.

// restart device plugin pod
log.Log.Info("nodeStateSyncHandler(): restart device plugin pod")
if err := dn.restartDevicePluginPod(); err != nil {
log.Log.Error(err, "nodeStateSyncHandler(): fail to restart device plugin pod")
return err
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants