Skip to content

Fix CI: apt update on runner, file lock race condition#341

Merged
bendudson merged 12 commits intomasterfrom
ci-apt-update
Apr 14, 2026
Merged

Fix CI: apt update on runner, file lock race condition#341
bendudson merged 12 commits intomasterfrom
ci-apt-update

Conversation

@mikekryjak
Copy link
Copy Markdown
Collaborator

@mikekryjak mikekryjak commented Mar 17, 2026

The first error is:

Ign:4 https://security.ubuntu.com/ubuntu noble-updates/main amd64 libcurl4-openssl-dev amd64 8.5.0-2ubuntu10.7
Err:4 mirror+file:/etc/apt/apt-mirrors.txt noble-updates/main amd64 libcurl4-openssl-dev amd64 8.5.0-2ubuntu10.7
  404  Not Found [IP: 52.161.185.214 80]
E: Failed to fetch mirror+file:/etc/apt/apt-mirrors.txt/pool/main/c/curl/libcurl4-openssl-dev_8.5.0-2ubuntu10.7_amd64.deb  404  Not Found [IP: 52.161.185.214 80]
Fetched 4720 kB in 2s (2450 kB/s)
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

I tried to follow the error advice and add sudo apt-get update, which seems to resolve the same issue in xHermes. This PR adds it to xBOUT CI.

The second issue is where the tests hang on test_boutdataset.py::TestSaveRestart::test_to_restart. I reproduced the CI environment locally and reproduced the issue. Thanks to @dschwoerer's stack trace debug and the help of an LLM, I was then able to narrow this down to an unsafely loaded dataset in that test. This resolves the issue on my end at least....

mikekryjak and others added 3 commits March 17, 2026 18:39
Was erroring out on installing libcurl4-openssl-dev before.
@mikekryjak mikekryjak changed the title Fix CI: add sudo apt-get update to actions Fix CI: apt update on runner, file lock race condition Apr 13, 2026
dschwoerer and others added 3 commits April 13, 2026 14:37
Cause of the next hang locally. This is a different file now than last time, which means it may be that many tests need to be fixed. For now, I am pushing this to see if this is enough.
@mikekryjak
Copy link
Copy Markdown
Collaborator Author

The tests were still failing locally, but intermittently. I made a bash script which loops the tests until they fail, and then prints a stack trace. I found another cause in another test inside test_boutdataset, where the dataset is opened and then quickly saved afterward. My LLM reckons that the lazy opening still kept a lock on the data files and clashed with the save operation. I made the fix here: 689613b

I also cherry picked @dschwoerer's timeout and stack trace from 62ab549.

While the CI continues, I will keep looping the tests locally to see if I can find more. If I do, I will make all file loads safe in this test file.

There is still the mystery of why it fails every time on CI and intermittently locally. My LLM thinks it could be because the runners are slow which could make timing and file locking issues worse.

@mikekryjak
Copy link
Copy Markdown
Collaborator Author

mikekryjak commented Apr 14, 2026

I added safe I/O all over test_boutdataset but this didn't eliminate all issues. I then found that boutdataset.save and boutdataset.to_restart were still using the netcdf4 backend, and the stack traces I checked were always getting stuck there:

manager = CachingFileManager(<class 'netCDF4._netCDF4.Dataset'>, '/tmp/pytest-of-runner/pytest-0/test_to_restart_change_npe0/BOU...r': True, 'diskless': False, 'persist': False, 'format': 'NETCDF4'}, manager_id='f9a0705c-71c2-4731-a9e3-b3a690f6f32e')

I changed them to use h5netcdf by default and ran my test loop for a few hours with no failures. Let's hope the CI now succeeds as well.

I got confused between engine and filetype - they are two separate things. filetype should be NETCDF4 and this uses the latest standard, but the engine for writing the file is a separate thing - now hooked up to the same tool as the other saves and defaulting to h5netcdf.
@mikekryjak
Copy link
Copy Markdown
Collaborator Author

The changes made 3.13+ succeed and the older ones fail. The logs point to a yet another mode of failure - hanging of open_boutdataset, and then once again the hanging of open_dataset in test_boutdataset.py. I noticed that neither of these use h5netcdf as a backend and that seemed to help for the other failures, so I changed the engine on all of them in a consistent way with the rest.

@mikekryjak
Copy link
Copy Markdown
Collaborator Author

mikekryjak commented Apr 14, 2026

@dschwoerer @ZedThree @bendudson I think I may have fixed the CI problem. It was really hard because it was actually several separate hangs which were very intermittent on my local machine. I had to run the tests in a loop for up to an hour to catch failures and move forward. I used an LLM to diagnose the stack traces.

Initially, I made I/O safer in the tests which were failing, which reduced the fail rate a lot locally but not in CI.

Finally, I found every single hang was happening in netCDF4, not h5netcdf. Changing open_boutdataset.save and open_boutdataset.to_restart to use the engine logic to select h5netcdf by default, as well as adding the engine options in the low level xarray reads in the associated tests eliminated failures in CI. I can't guarantee that there isn't something still left here because I observed intermittency, but maybe we can deal with that as and when it surfaces.... I am leaving @dschwoerer's stack trace on timeout in just in case.

I don't know which of the above was responsible for the fix. It could be that the safe I/O wasn't necessary, and so this PR could be simplified with further testing. However, I would argue that safe I/O is best practice and we can just merge it.

Let me know what you want me to do.

Copy link
Copy Markdown
Member

@ZedThree ZedThree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikekryjak! By "safe IO", I assume you mean the with context managers? In which case, yes, definitely best to have them everywhere we can, but I can see it's not always practical in some of these tests

@bendudson bendudson merged commit db641dd into master Apr 14, 2026
13 checks passed
@bendudson bendudson deleted the ci-apt-update branch April 14, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants