Skip to content

Comments

src: fix deadlock in NodePlatform::DrainTasks on process shutdown#61963

Closed
RajeshKumar11 wants to merge 2 commits intonodejs:mainfrom
RajeshKumar11:fix/54918-drain-tasks-deadlock
Closed

src: fix deadlock in NodePlatform::DrainTasks on process shutdown#61963
RajeshKumar11 wants to merge 2 commits intonodejs:mainfrom
RajeshKumar11:fix/54918-drain-tasks-deadlock

Conversation

@RajeshKumar11
Copy link
Contributor

Summary

Fixes #54918.

NodePlatform::DrainTasks() was calling BlockingDrain() which waits
indefinitely for outstanding kUserBlocking worker tasks to complete.
This creates a deadlock when:

  1. A kUserBlocking worker task (e.g. Maglev JIT compilation) completes
    its work and posts a foreground task, then waits for that foreground task
    to run (e.g. via a std::promise<> resolve).
  2. The main thread (in DrainTasks) flushes the foreground queue — missing
    the newly-posted task because it was posted just after the flush.
  3. The main thread enters BlockingDrain()'s indefinite sleep.
  4. Both threads are now waiting on each other: a classic mutual deadlock.

Fix

Replace BlockingDrain() with TimedBlockingDrain(1ms). On timeout, flush
foreground tasks before retrying. This breaks the deadlock because the worker
task's foreground task will be flushed in the next 1ms window.

In the common (non-deadlock) case, tasks complete before the timeout and
NotifyOfOutstandingCompletion() wakes the wait immediately — no
performance regression.

New APIs added:

  • ConditionVariableBase::TimedWait() — wraps uv_cond_timedwait()
  • TaskQueue::Locked::TimedBlockingDrain() — timed version of BlockingDrain()
  • WorkerThreadsTaskRunner::TimedBlockingDrain() — delegates to TaskQueue

Test

Stress-tested locally with 50 (and 1000) iterations of
test/parallel/test-http2-large-write-multiple-requests.js — all pass:

[00:02|% 100|+  50|-   0]: Done
All tests passed.

Previously this test was a known reproducer for the deadlock on some machines.

Related

Replace the unconditional BlockingDrain() in DrainTasks() with a 1ms
timed drain loop that periodically flushes foreground tasks while
waiting for outstanding kUserBlocking worker tasks to complete.

This fixes a deadlock where a kUserBlocking worker task (e.g. Maglev
JIT compilation) posts a foreground task and waits for it. If the
foreground task is posted after the main thread flushes the foreground
queue but before BlockingDrain() enters its indefinite sleep, both
threads wait on each other — a mutual deadlock.

With the timed drain, the main thread wakes every 1ms and flushes any
foreground tasks that were posted in that window. In the common case,
tasks complete before the timeout and the wait is woken immediately by
NotifyOfOutstandingCompletion(), so there is no performance regression.

Adds ConditionVariableBase::TimedWait() backed by uv_cond_timedwait()
and TaskQueue::Locked::TimedBlockingDrain() to support this.

Fixes: nodejs#54918
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Feb 24, 2026
Fix clang-format alignment issues in node_mutex.h:
- Split cond_timedwait parameters onto separate lines
- Correct continuation alignment in TimedWait implementation
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.77%. Comparing base (ad945c5) to head (01323a3).
⚠️ Report is 51 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #61963      +/-   ##
==========================================
+ Coverage   89.75%   89.77%   +0.02%     
==========================================
  Files         675      674       -1     
  Lines      204721   205619     +898     
  Branches    39344    39422      +78     
==========================================
+ Hits       183751   184600     +849     
- Misses      13230    13275      +45     
- Partials     7740     7744       +4     
Files with missing lines Coverage Δ
src/node_mutex.h 97.91% <100.00%> (+0.94%) ⬆️
src/node_platform.cc 75.32% <100.00%> (-0.92%) ⬇️
src/node_platform.h 91.66% <ø> (ø)

... and 126 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ronag ronag requested a review from joyeecheung February 24, 2026 09:41
@joyeecheung
Copy link
Member

joyeecheung commented Feb 24, 2026

Previously this test was a known reproducer for the deadlock on some machines.

The test mentioned in OP have already been fixed by #58047 and have stopped appearing in the CI, so this is not a valid reproduction.

I think to reproduce the theoretical deadlock you'll need to force the race with a blocking task, which is a bit rare. I don't think the previous reproduction in #58047 (comment) even reproduce on the main branch, so this may have been a non-issue now.

The reasoning in the OP looks like a hallucination, we do not use std::promise, also std::promise does not interact with v8 at all (even JS promise has nothing to do with this, the V8 task queue is a completely different concept from microtask queue). That kind of task only gets posted by V8 in e.g. special kind of GC or compilation, which doesn't seem to happen currently for Node.js (we don't do off-thread compilation). Considering that the issue 1 is not reproducible, doing a timed drain not only does not fix issue 2 in the comment, but make it even worse by introducing a wakeup overhead.

@RajeshKumar11
Copy link
Contributor Author

Thank you @joyeecheung for the thorough and detailed review — this is exactly the kind of expert context that's hard to get from reading the code alone.

You're right on all counts:

  • The std::promise<> reference in the PR description was incorrect — I conflated a general concurrency pattern with how V8 actually implements task synchronization internally.
  • Citing Maglev JIT as an example was wrong; current Node.js doesn't run off-thread compilation in a way that would trigger this path.
  • The http2 test I cited was already made non-flaky by src: only block on user blocking worker tasks #58047 and was never a valid reproducer for the theoretical race I was trying to address.
  • And importantly, the timed drain doesn't address issue 2 from the original FIXME (blocking on tasks from unrelated isolates) and adds wakeup overhead that makes things worse.

I'll close this PR. If the --concurrent-maglev-high-priority-threads path you mentioned in #58047 ever becomes a real concern worth addressing, a proper fix would need the re-architecting approach you described (allowing foreground tasks posted by worker tasks to wake the main thread) rather than a polling loop.

Thanks again for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock at process shutdown

3 participants