src: fix deadlock in NodePlatform::DrainTasks on process shutdown#61963
src: fix deadlock in NodePlatform::DrainTasks on process shutdown#61963RajeshKumar11 wants to merge 2 commits intonodejs:mainfrom
Conversation
Replace the unconditional BlockingDrain() in DrainTasks() with a 1ms timed drain loop that periodically flushes foreground tasks while waiting for outstanding kUserBlocking worker tasks to complete. This fixes a deadlock where a kUserBlocking worker task (e.g. Maglev JIT compilation) posts a foreground task and waits for it. If the foreground task is posted after the main thread flushes the foreground queue but before BlockingDrain() enters its indefinite sleep, both threads wait on each other — a mutual deadlock. With the timed drain, the main thread wakes every 1ms and flushes any foreground tasks that were posted in that window. In the common case, tasks complete before the timeout and the wait is woken immediately by NotifyOfOutstandingCompletion(), so there is no performance regression. Adds ConditionVariableBase::TimedWait() backed by uv_cond_timedwait() and TaskQueue::Locked::TimedBlockingDrain() to support this. Fixes: nodejs#54918
Fix clang-format alignment issues in node_mutex.h: - Split cond_timedwait parameters onto separate lines - Correct continuation alignment in TimedWait implementation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #61963 +/- ##
==========================================
+ Coverage 89.75% 89.77% +0.02%
==========================================
Files 675 674 -1
Lines 204721 205619 +898
Branches 39344 39422 +78
==========================================
+ Hits 183751 184600 +849
- Misses 13230 13275 +45
- Partials 7740 7744 +4
🚀 New features to boost your workflow:
|
The test mentioned in OP have already been fixed by #58047 and have stopped appearing in the CI, so this is not a valid reproduction. I think to reproduce the theoretical deadlock you'll need to force the race with a blocking task, which is a bit rare. I don't think the previous reproduction in #58047 (comment) even reproduce on the main branch, so this may have been a non-issue now. The reasoning in the OP looks like a hallucination, we do not use std::promise, also std::promise does not interact with v8 at all (even JS promise has nothing to do with this, the V8 task queue is a completely different concept from microtask queue). That kind of task only gets posted by V8 in e.g. special kind of GC or compilation, which doesn't seem to happen currently for Node.js (we don't do off-thread compilation). Considering that the issue 1 is not reproducible, doing a timed drain not only does not fix issue 2 in the comment, but make it even worse by introducing a wakeup overhead. |
|
Thank you @joyeecheung for the thorough and detailed review — this is exactly the kind of expert context that's hard to get from reading the code alone. You're right on all counts:
I'll close this PR. If the Thanks again for the explanation. |
Summary
Fixes #54918.
NodePlatform::DrainTasks()was callingBlockingDrain()which waitsindefinitely for outstanding
kUserBlockingworker tasks to complete.This creates a deadlock when:
kUserBlockingworker task (e.g. Maglev JIT compilation) completesits work and posts a foreground task, then waits for that foreground task
to run (e.g. via a
std::promise<>resolve).DrainTasks) flushes the foreground queue — missingthe newly-posted task because it was posted just after the flush.
BlockingDrain()'s indefinite sleep.Fix
Replace
BlockingDrain()withTimedBlockingDrain(1ms). On timeout, flushforeground tasks before retrying. This breaks the deadlock because the worker
task's foreground task will be flushed in the next 1ms window.
In the common (non-deadlock) case, tasks complete before the timeout and
NotifyOfOutstandingCompletion()wakes the wait immediately — noperformance regression.
New APIs added:
ConditionVariableBase::TimedWait()— wrapsuv_cond_timedwait()TaskQueue::Locked::TimedBlockingDrain()— timed version ofBlockingDrain()WorkerThreadsTaskRunner::TimedBlockingDrain()— delegates toTaskQueueTest
Stress-tested locally with 50 (and 1000) iterations of
test/parallel/test-http2-large-write-multiple-requests.js— all pass:Previously this test was a known reproducer for the deadlock on some machines.
Related
FIXME(54918)comment added in src: only block on user blocking worker tasks #58047@ronag in the issue — move the blocking wait off the hot path.