Skip to content

fix: driver and toolkit ready state file cleanup#2242

Open
shivakunv wants to merge 1 commit intomainfrom
cleanupdriverreadyfile
Open

fix: driver and toolkit ready state file cleanup#2242
shivakunv wants to merge 1 commit intomainfrom
cleanupdriverreadyfile

Conversation

@shivakunv
Copy link
Copy Markdown
Contributor

@shivakunv shivakunv commented Mar 23, 2026

Description

Fix stale readiness file handling for driver and toolkit operands during restart/upgrade flows.

During upgrade, the old driver-ready and toolkit-ready files can remain on the host because unlike .driver-ctr-ready, they were not being cleaned up during pod shutdown.
mig-manager only waits for those files to exist, not for new validation of the new driver/toolkit state.
So after driver reinstall: mig-manager can start using old ready signals before the new driver/toolkit libraries are actually available.
files are only deleted when the corresponding validator runs again:

toolkit-ready :

func (t *Toolkit) validate() error {
	// delete status file is already present
	err := deleteStatusFile(outputDirFlag + "/" + toolkitStatusFile)
	if err != nil {
		return err
	}

driver-ready:

func (d *Driver) validate() error {
	// delete driver status file if already present
	err := deleteStatusFile(outputDirFlag + "/" + driverStatusFile)
	if err != nil {
		return err
	}

The patch fixes that by cleaning up driver-ready and toolkit-ready file.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Manual validation on hardware with MIG-capable GPUs

Testing

Reproduction in a test environment was not possible because the issue appears to be timing-dependent and hard to reproduce reliably.
Validation for this change was done by code-path analysis.
I will perform manual validation on hardware with MIG-capable GPUs

Signed-off-by: Shiva Kumar (SW-CLOUD) <shivaku@nvidia.com>
@shivakunv shivakunv force-pushed the cleanupdriverreadyfile branch from ae75ec3 to 21d3d41 Compare March 23, 2026 09:51
@coveralls
Copy link
Copy Markdown

coveralls commented Mar 23, 2026

Coverage Status

coverage: 27.711%. remained the same
when pulling 21d3d41 on cleanupdriverreadyfile
into d5750f2 on main.

@shivakunv shivakunv self-assigned this Mar 23, 2026
@shivakunv shivakunv marked this pull request as ready for review March 24, 2026 03:21
@shivakunv
Copy link
Copy Markdown
Contributor Author

PTAL @rahulait @rajathagasthya

@rahulait rahulait added this to the v26.3.1 milestone Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants