Fix CI: apt update on runner, file lock race condition by mikekryjak · Pull Request #341 · boutproject/xBOUT

mikekryjak · 2026-03-17T18:41:18Z

The first error is:

Ign:4 https://security.ubuntu.com/ubuntu noble-updates/main amd64 libcurl4-openssl-dev amd64 8.5.0-2ubuntu10.7
Err:4 mirror+file:/etc/apt/apt-mirrors.txt noble-updates/main amd64 libcurl4-openssl-dev amd64 8.5.0-2ubuntu10.7
  404  Not Found [IP: 52.161.185.214 80]
E: Failed to fetch mirror+file:/etc/apt/apt-mirrors.txt/pool/main/c/curl/libcurl4-openssl-dev_8.5.0-2ubuntu10.7_amd64.deb  404  Not Found [IP: 52.161.185.214 80]
Fetched 4720 kB in 2s (2450 kB/s)
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

I tried to follow the error advice and add sudo apt-get update, which seems to resolve the same issue in xHermes. This PR adds it to xBOUT CI.

The second issue is where the tests hang on test_boutdataset.py::TestSaveRestart::test_to_restart. I reproduced the CI environment locally and reproduced the issue. Thanks to @dschwoerer's stack trace debug and the help of an LLM, I was then able to narrow this down to an unsafely loaded dataset in that test. This resolves the issue on my end at least....

Was erroring out on installing libcurl4-openssl-dev before.

…start Fixes CI hang on Python >= 3.13

Cause of the next hang locally. This is a different file now than last time, which means it may be that many tests need to be fixed. For now, I am pushing this to see if this is enough.

mikekryjak · 2026-04-13T13:48:25Z

The tests were still failing locally, but intermittently. I made a bash script which loops the tests until they fail, and then prints a stack trace. I found another cause in another test inside test_boutdataset, where the dataset is opened and then quickly saved afterward. My LLM reckons that the lazy opening still kept a lock on the data files and clashed with the save operation. I made the fix here: 689613b

I also cherry picked @dschwoerer's timeout and stack trace from 62ab549.

While the CI continues, I will keep looping the tests locally to see if I can find more. If I do, I will make all file loads safe in this test file.

There is still the mystery of why it fails every time on CI and intermittently locally. My LLM thinks it could be because the runners are slow which could make timing and file locking issues worse.

mikekryjak · 2026-04-14T08:09:47Z

I added safe I/O all over test_boutdataset but this didn't eliminate all issues. I then found that boutdataset.save and boutdataset.to_restart were still using the netcdf4 backend, and the stack traces I checked were always getting stuck there:

manager = CachingFileManager(<class 'netCDF4._netCDF4.Dataset'>, '/tmp/pytest-of-runner/pytest-0/test_to_restart_change_npe0/BOU...r': True, 'diskless': False, 'persist': False, 'format': 'NETCDF4'}, manager_id='f9a0705c-71c2-4731-a9e3-b3a690f6f32e')

I changed them to use h5netcdf by default and ran my test loop for a few hours with no failures. Let's hope the CI now succeeds as well.

I got confused between engine and filetype - they are two separate things. filetype should be NETCDF4 and this uses the latest standard, but the engine for writing the file is a separate thing - now hooked up to the same tool as the other saves and defaulting to h5netcdf.

mikekryjak · 2026-04-14T12:24:00Z

The changes made 3.13+ succeed and the older ones fail. The logs point to a yet another mode of failure - hanging of open_boutdataset, and then once again the hanging of open_dataset in test_boutdataset.py. I noticed that neither of these use h5netcdf as a backend and that seemed to help for the other failures, so I changed the engine on all of them in a consistent way with the rest.

mikekryjak · 2026-04-14T15:18:25Z

@dschwoerer @ZedThree @bendudson I think I may have fixed the CI problem. It was really hard because it was actually several separate hangs which were very intermittent on my local machine. I had to run the tests in a loop for up to an hour to catch failures and move forward. I used an LLM to diagnose the stack traces.

Initially, I made I/O safer in the tests which were failing, which reduced the fail rate a lot locally but not in CI.

Finally, I found every single hang was happening in netCDF4, not h5netcdf. Changing open_boutdataset.save and open_boutdataset.to_restart to use the engine logic to select h5netcdf by default, as well as adding the engine options in the low level xarray reads in the associated tests eliminated failures in CI. I can't guarantee that there isn't something still left here because I observed intermittency, but maybe we can deal with that as and when it surfaces.... I am leaving @dschwoerer's stack trace on timeout in just in case.

I don't know which of the above was responsible for the fix. It could be that the safe I/O wasn't necessary, and so this PR could be simplified with further testing. However, I would argue that safe I/O is best practice and we can just merge it.

Let me know what you want me to do.

ZedThree

Thanks @mikekryjak! By "safe IO", I assume you mean the with context managers? In which case, yes, definitely best to have them everywhere we can, but I can see it's not always practical in some of these tests

mikekryjak and others added 3 commits March 17, 2026 18:39

Add sudo apt-get update to actions

90566cc

Was erroring out on installing libcurl4-openssl-dev before.

Safe dataset loading in test_boutdataset::TestSaveRestart::test_to_re…

0261f6f

…start Fixes CI hang on Python >= 3.13

Apply black formatting

a9e8f05

mikekryjak changed the title ~~Fix CI: add sudo apt-get update to actions~~ Fix CI: apt update on runner, file lock race condition Apr 13, 2026

dschwoerer and others added 3 commits April 13, 2026 14:37

Close more files while testing

8107aa8

CI: get backtrace on timeout

7e79284

Safe boutdataset loading in test_boutdataset::TestSave::test_reload_all

689613b

Cause of the next hang locally. This is a different file now than last time, which means it may be that many tests need to be fixed. For now, I am pushing this to see if this is enough.

mikekryjak added 2 commits April 14, 2026 09:06

Make I/O safe all over test_boutdataset

e4104f4

Make boutdataset.save and boutdataset.to_restart use h5netcdf

aad6212

mikekryjak force-pushed the ci-apt-update branch from 9e92aba to 6a85150 Compare April 14, 2026 10:16

mikekryjak added 3 commits April 14, 2026 11:17

Formatting

6a85150

Make open_boutdataset default to h5netcdf

b083b08

Use h5netcdf for reads in test_boutdataset.py

194dee8

ZedThree approved these changes Apr 14, 2026

View reviewed changes

bendudson merged commit db641dd into master Apr 14, 2026
13 checks passed

bendudson deleted the ci-apt-update branch April 14, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI: apt update on runner, file lock race condition#341

Fix CI: apt update on runner, file lock race condition#341
bendudson merged 12 commits intomasterfrom
ci-apt-update

mikekryjak commented Mar 17, 2026 •

edited

Loading

Uh oh!

mikekryjak commented Apr 13, 2026

Uh oh!

mikekryjak commented Apr 14, 2026 •

edited

Loading

Uh oh!

mikekryjak commented Apr 14, 2026

Uh oh!

mikekryjak commented Apr 14, 2026 •

edited

Loading

Uh oh!

ZedThree left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mikekryjak commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekryjak commented Apr 13, 2026

Uh oh!

mikekryjak commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekryjak commented Apr 14, 2026

Uh oh!

mikekryjak commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZedThree left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mikekryjak commented Mar 17, 2026 •

edited

Loading

mikekryjak commented Apr 14, 2026 •

edited

Loading

mikekryjak commented Apr 14, 2026 •

edited

Loading