Skip to content

fix(silo): do not spawn an excessive number of threads to avoid arrow bugs#1110

Open
Taepper wants to merge 1 commit intomainfrom
avoid-many-threads
Open

fix(silo): do not spawn an excessive number of threads to avoid arrow bugs#1110
Taepper wants to merge 1 commit intomainfrom
avoid-many-threads

Conversation

@Taepper
Copy link
Copy Markdown
Collaborator

@Taepper Taepper commented Jan 19, 2026

This PR removes code that was introduced to avoid arrow bugs.

The spawning of a thread for every query that is send, just to add some indirection should be avoided. Instead, I now introduce a guard that avoids that the source nodes in the arrow plan start generating batches, before the query plan execution is fully started (i.e. arrow_plan.StartProducing() returns)

This could avoid the situation where SILO sometimes crashes in the benchmarking, where a very high load of queries are send per second (more than curl could ever send). #1035

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 19, 2026

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.9.10 (2026-02-12)

Bug Fixes

  • silo: do not spawn an excessive number of threads to avoid arrow bugs (3b86853)

@Taepper Taepper requested a review from pflanze January 19, 2026 14:29
@Taepper Taepper force-pushed the avoid-many-threads branch 2 times, most recently from 39195e9 to 0e53d34 Compare January 19, 2026 14:36
Copy link
Copy Markdown
Contributor

@pflanze pflanze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand yet what the effect is, except vaguely that the future is executed later?

Comment thread src/silo/query_engine/actions/simple_select_action.cpp Outdated
Comment thread src/silo/query_engine/exec_node/table_scan.h Outdated
Comment thread src/silo/query_engine/query_plan.h Outdated
Comment thread src/silo/query_engine/exec_node/produce_guard.h Outdated
@Taepper
Copy link
Copy Markdown
Collaborator Author

Taepper commented Feb 5, 2026

I reworked the implementation to instead use arrow's future API in a different way, to avoid threads waiting for one another

Copy link
Copy Markdown
Contributor

@pflanze pflanze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@Taepper
Copy link
Copy Markdown
Collaborator Author

Taepper commented Mar 3, 2026

This does not resolve the issue of spurious crashes on the benchmarking server during high load.

While the code is more idiomatic after the changes in this PR, I would this leave this PR as "Undecided" for now, because we do not want to risk the stability of SILO in e.g. Loculus. As these changes to the threading could lead to slight changes of behavior / memory load / query timeout characteristics, these dynamics are not "worth it" if the underlying bigger issue of the spurious crashes is not resolved at the same time

@pflanze
Copy link
Copy Markdown
Contributor

pflanze commented Mar 3, 2026

If you would like to do changes like these but are afraid of breakage, collecting queries from e.g. Loculus and running benchmarking jobs with them (with api-query's checksumming) could confirm that things run (as) OK (as without the change). (Run it once single-threaded to collect the checksums, then switch to stresstesting and let it verify.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants