Skip to content

Conversation

@AlexJones0
Copy link
Contributor

@AlexJones0 AlexJones0 commented Feb 10, 2026

We can get deadlocks rarely due to logging and threading primitives in the scheduler's signal handler, which cause the process to hang sometimes on a SIGINT/SIGTERM. We also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use time.sleep (as an early signal will not interrupt this, and a pre-check could lead to TOC/TOU races), nor can we use signal.sigtimedwait (which registers its own handlers to handle signals inside the wait, but misses signals outside).

This leaves us with one workable solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead.

Edit: The diff is unfortunately not very nice - it might be easier to view with your Git tooling of preference, or just compare the old and new code side-by-side.

This can be tested by running e.g. pytest -k test_signal --count 100 -n auto:

  • Before this PR, I got a result of: 17 xfailed, 586 xpassed in 30.84s
  • With this PR, I get a result of: 600 xpassed in 29.82s

Copy link
Collaborator

@machshev machshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot! Thanks for fixing this @AlexJones0

See relevant comments - we can get deadlocks with logging and threading
primitives rarely which causes the process to hang around 5% of the time
on a SIGINT/SIGTERM, but we also want to be able to have the signal
interrupt our poll wait/sleep without busy waiting (for performance),
which means we also cannot use time.sleep (an early signal will not
interrupt this, and a pre-check leads to ToC-ToU races), nor
signal.sigtimedwait (registers its own handlers to handle signals inside
the wait, but misses signals outside).

This leaves us with one clear solution - use an OS pipe and define a
selector on the read file descriptor, and have the signal handler set a
flag with the signal number and write to the write file descriptor. By
querying the flag we always know if we have handled a signal in our main
loop, and by using a fd we reliably skip the wait on a signal, where the
wait is blocking (i.e. not a busy wait). The signal handler is then
minimal and async-signal-safe, just setting a flag and writing to the
pipe. The relevant logging logic is moved to be dispatched by the main
loop instead.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
With the scheduler signal handler fixed to by async-signal-safe, this
test should now not be flaky and can be expected to consistently pass.

Running
  pytest -k test_signal --count 100 -n auto
gives 600/600 passes locally.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 added this pull request to the merge queue Feb 12, 2026
Merged via the queue into lowRISC:master with commit 5400046 Feb 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants