Remove deadlocks by making the scheduler signal handler signal-safe #94

AlexJones0 · 2026-02-10T16:06:41Z

We can get deadlocks rarely due to logging and threading primitives in the scheduler's signal handler, which cause the process to hang sometimes on a SIGINT/SIGTERM. We also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use time.sleep (as an early signal will not interrupt this, and a pre-check could lead to TOC/TOU races), nor can we use signal.sigtimedwait (which registers its own handlers to handle signals inside the wait, but misses signals outside).

This leaves us with one workable solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead.

Edit: The diff is unfortunately not very nice - it might be easier to view with your Git tooling of preference, or just compare the old and new code side-by-side.

This can be tested by running e.g. pytest -k test_signal --count 100 -n auto:

Before this PR, I got a result of: 17 xfailed, 586 xpassed in 30.84s
With this PR, I get a result of: 600 xpassed in 29.82s

machshev

Good spot! Thanks for fixing this @AlexJones0

See relevant comments - we can get deadlocks with logging and threading primitives rarely which causes the process to hang around 5% of the time on a SIGINT/SIGTERM, but we also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use time.sleep (an early signal will not interrupt this, and a pre-check leads to ToC-ToU races), nor signal.sigtimedwait (registers its own handlers to handle signals inside the wait, but misses signals outside). This leaves us with one clear solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>

With the scheduler signal handler fixed to by async-signal-safe, this test should now not be flaky and can be expected to consistently pass. Running pytest -k test_signal --count 100 -n auto gives 600/600 passes locally. Signed-off-by: Alex Jones <alex.jones@lowrisc.org>

AlexJones0 requested review from hcallahan-lowrisc, machshev and rswarbrick February 10, 2026 16:08

machshev approved these changes Feb 11, 2026

View reviewed changes

AlexJones0 added 2 commits February 11, 2026 13:59

AlexJones0 force-pushed the signal_safe_fix branch from e245e31 to 4208e8f Compare February 11, 2026 14:02

AlexJones0 added this pull request to the merge queue Feb 12, 2026

Merged via the queue into lowRISC:master with commit 5400046 Feb 12, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove deadlocks by making the scheduler signal handler signal-safe #94

Remove deadlocks by making the scheduler signal handler signal-safe #94

Uh oh!

AlexJones0 commented Feb 10, 2026 •

edited

Loading

Uh oh!

machshev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove deadlocks by making the scheduler signal handler signal-safe #94

Remove deadlocks by making the scheduler signal handler signal-safe #94

Uh oh!

Conversation

AlexJones0 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

machshev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexJones0 commented Feb 10, 2026 •

edited

Loading