fix(silo): do not spawn an excessive number of threads to avoid arrow bugs#1110
fix(silo): do not spawn an excessive number of threads to avoid arrow bugs#1110
Conversation
|
This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog. Note that this might contain changes that are on main, but not yet released. Changelog: 0.9.10 (2026-02-12)Bug Fixes
|
39195e9 to
0e53d34
Compare
pflanze
left a comment
There was a problem hiding this comment.
I don't understand yet what the effect is, except vaguely that the future is executed later?
0e53d34 to
354f7d5
Compare
|
I reworked the implementation to instead use arrow's future API in a different way, to avoid threads waiting for one another |
4e9d95b to
3b86853
Compare
|
This does not resolve the issue of spurious crashes on the benchmarking server during high load. While the code is more idiomatic after the changes in this PR, I would this leave this PR as "Undecided" for now, because we do not want to risk the stability of SILO in e.g. Loculus. As these changes to the threading could lead to slight changes of behavior / memory load / query timeout characteristics, these dynamics are not "worth it" if the underlying bigger issue of the spurious crashes is not resolved at the same time |
|
If you would like to do changes like these but are afraid of breakage, collecting queries from e.g. Loculus and running benchmarking jobs with them (with api-query's checksumming) could confirm that things run (as) OK (as without the change). (Run it once single-threaded to collect the checksums, then switch to stresstesting and let it verify.) |
This PR removes code that was introduced to avoid arrow bugs.
The spawning of a thread for every query that is send, just to add some indirection should be avoided. Instead, I now introduce a guard that avoids that the source nodes in the arrow plan start generating batches, before the query plan execution is fully started (i.e.
arrow_plan.StartProducing()returns)This could avoid the situation where SILO sometimes crashes in the benchmarking, where a very high load of queries are send per second (more than curl could ever send). #1035