Fix CI: apt update on runner, file lock race condition#341
Fix CI: apt update on runner, file lock race condition#341
Conversation
Was erroring out on installing libcurl4-openssl-dev before.
…start Fixes CI hang on Python >= 3.13
Cause of the next hang locally. This is a different file now than last time, which means it may be that many tests need to be fixed. For now, I am pushing this to see if this is enough.
|
The tests were still failing locally, but intermittently. I made a bash script which loops the tests until they fail, and then prints a stack trace. I found another cause in another test inside I also cherry picked @dschwoerer's timeout and stack trace from 62ab549. While the CI continues, I will keep looping the tests locally to see if I can find more. If I do, I will make all file loads safe in this test file. There is still the mystery of why it fails every time on CI and intermittently locally. My LLM thinks it could be because the runners are slow which could make timing and file locking issues worse. |
|
I added safe I/O all over test_boutdataset but this didn't eliminate all issues. I then found that I changed them to use |
I got confused between engine and filetype - they are two separate things. filetype should be NETCDF4 and this uses the latest standard, but the engine for writing the file is a separate thing - now hooked up to the same tool as the other saves and defaulting to h5netcdf.
9e92aba to
6a85150
Compare
|
The changes made 3.13+ succeed and the older ones fail. The logs point to a yet another mode of failure - hanging of open_boutdataset, and then once again the hanging of open_dataset in test_boutdataset.py. I noticed that neither of these use h5netcdf as a backend and that seemed to help for the other failures, so I changed the engine on all of them in a consistent way with the rest. |
|
@dschwoerer @ZedThree @bendudson I think I may have fixed the CI problem. It was really hard because it was actually several separate hangs which were very intermittent on my local machine. I had to run the tests in a loop for up to an hour to catch failures and move forward. I used an LLM to diagnose the stack traces. Initially, I made I/O safer in the tests which were failing, which reduced the fail rate a lot locally but not in CI. Finally, I found every single hang was happening in netCDF4, not h5netcdf. Changing I don't know which of the above was responsible for the fix. It could be that the safe I/O wasn't necessary, and so this PR could be simplified with further testing. However, I would argue that safe I/O is best practice and we can just merge it. Let me know what you want me to do. |
ZedThree
left a comment
There was a problem hiding this comment.
Thanks @mikekryjak! By "safe IO", I assume you mean the with context managers? In which case, yes, definitely best to have them everywhere we can, but I can see it's not always practical in some of these tests
The first error is:
I tried to follow the error advice and add
sudo apt-get update, which seems to resolve the same issue in xHermes. This PR adds it to xBOUT CI.The second issue is where the tests hang on
test_boutdataset.py::TestSaveRestart::test_to_restart. I reproduced the CI environment locally and reproduced the issue. Thanks to @dschwoerer's stack trace debug and the help of an LLM, I was then able to narrow this down to an unsafely loaded dataset in that test. This resolves the issue on my end at least....