Skip to content

fix(batcher): fix deadlock on reinit#1518

Open
jpnurmi wants to merge 8 commits intomasterfrom
jpnurmi/fix/logs-reinit
Open

fix(batcher): fix deadlock on reinit#1518
jpnurmi wants to merge 8 commits intomasterfrom
jpnurmi/fix/logs-reinit

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Feb 13, 2026

sentry__batcher_flush() acquires g_options_lock via SENTRY_WITH_OPTIONS to get transport/run:

SENTRY_WITH_OPTIONS (options) {
if (crash_safe) {
// Write directly to disk to avoid transport queuing during
// crash
sentry__run_write_envelope(options->run, envelope);
sentry_envelope_free(envelope);
} else {
// Normal operation: use transport for HTTP transmission
sentry__capture_envelope(options->transport, envelope);
}
}

The batcher thread can never acquire this lock when the main thread holds it across sentry_close(), such as during re-init of the SDK:

Main thread                          Batcher thread
───────────                          ──────────────
sentry_init() [2nd call]
  lock(g_options_lock)  [depth 1]
  sentry_close()
    sentry__logs_shutdown()
      cond_wake ──────────────────► wakes up
      thread_join [BLOCKS]           flush_logs_queue()
           │                           SENTRY_WITH_OPTIONS
           │                             getref → lock [BLOCKS]
           │                             (different thread, recursive irrelevant)
        DEADLOCK

Close: #1515

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 13, 2026

tests/test_unit.py::test_unit[logs_reinit] Test logs_reinit...                             [sentry] INFO using database path "C:\Users\jpnurmi\AppData\Local\Temp\pytest-of-jpnurmi\pytest-310\cmake0\.sentry-native"
[sentry] DEBUG processing and pruning old runs
[sentry] DEBUG Starting batching thread
[sentry] DEBUG batcher thread ready after 10 ms
[sentry] INFO LOG: log 0
[sentry] INFO LOG: log 1
[sentry] INFO LOG: log 2
[sentry] INFO LOG: log 3
[sentry] INFO LOG: log 4
[sentry] DEBUG beginning logs system shutdown

@supervacuus
Copy link
Collaborator

supervacuus commented Feb 13, 2026

Just a quick hint from my side: since both options->run and options->transport are stable during any batcher life-cycle, this code shouldn't use SENTRY_WITH_OPTIONS and instead get references to those two during init.

We have a couple of places that use SENTRY_WITH_OPTIONS unnecessarily, and this is one of them. There is no need to be contending options in this code.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 13, 2026

At first, I was thinking of passing transport and run to the batcher, but there's another SENTRY_WITH_OPTIONS in sentry_user_consent_get, which is called from sentry__capture_envelope. We might have to pass the whole options to be able to respect user consent for logs and metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 13, 2026

Likewise, sentry__envelope_new uses SENTRY_WITH_OPTIONS too. 😓

(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000019efcb9b8 libsystem_kernel.dylib`__ulock_wait + 8
    frame #1: 0x000000019f00e06c libsystem_pthread.dylib`_pthread_join + 608
    frame #2: 0x0000000104302114 sentry_test_unit`sentry__batcher_shutdown_wait(batcher=0x0000000104364198, timeout=<unavailable>) at sentry_batcher.c:344:5 [opt]
    frame #3: 0x000000010430b764 sentry_test_unit`sentry__metrics_shutdown_wait(timeout=<unavailable>) at sentry_metrics.c:161:5 [opt]
    frame #4: 0x0000000104302aa4 sentry_test_unit`sentry_close at sentry_core.c:364:13 [opt]
    frame #5: 0x00000001043023e8 sentry_test_unit`sentry_init(options=0x000000012d609ea0) at sentry_core.c:172:5 [opt]
    frame #6: 0x000000010432a808 sentry_test_unit`test_sentry_metrics_reinit at test_metrics.c:439:5 [opt]
    frame #7: 0x000000010431e4a0 sentry_test_unit`test_do_run__(test=0x0000000104360e50, index=<unavailable>) at acutest.h:943:9 [opt]
    frame #8: 0x000000010431d99c sentry_test_unit`test_run__(test=<unavailable>, index=1, master_index=117) at acutest.h:1039:23 [opt]
    frame #9: 0x000000010431d988 sentry_test_unit`main(argc=<unavailable>, argv=0x000000016bb0aea0) at acutest.h:1623:13 [opt]
    frame #10: 0x000000019ec6ab98 dyld`start + 6076
  thread #2
    frame #0: 0x000000019efcb8b0 libsystem_kernel.dylib`__workq_kernreturn + 8
  thread #3, name = 'sentry-http'
    frame #0: 0x000000019efcd3cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019f00c09c libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x000000010430fb68 sentry_test_unit`sentry__cond_wait_timeout(cv=<unavailable>, mutex=<unavailable>, msecs=1000) at sentry_sync.h:349:12 [opt] [inlined]
    frame #3: 0x000000010430fb38 sentry_test_unit`worker_thread(data=0x000000012e805400) at sentry_sync.c:258:13 [opt]
    frame #4: 0x000000019f00bbc8 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x000000019efcc89c libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x000000019f008e14 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x000000019f006840 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 220
    frame #3: 0x0000000104302260 sentry_test_unit`sentry__options_getref at sentry_core.c:52:5 [opt]
    frame #4: 0x0000000104306bcc sentry_test_unit`sentry__envelope_new at sentry_envelope.c:196:5 [opt]
    frame #5: 0x0000000104301bc8 sentry_test_unit`sentry__batcher_flush(batcher=0x0000000104364198, crash_safe=false) at sentry_batcher.c:120:43 [opt]
    frame #6: 0x0000000104301f8c sentry_test_unit`batcher_thread_func(data=0x0000000104364198) at sentry_batcher.c:280:9 [opt]
    frame #7: 0x000000019f00bbc8 libsystem_pthread.dylib`_pthread_start + 136

@jpnurmi jpnurmi changed the title WIP: fix(logs): fix deadlock on reinit WIP: fix(batcher): fix deadlock on reinit Feb 13, 2026
jpnurmi and others added 4 commits February 13, 2026 13:52
…dlock

When sentry_init() is called while a batcher thread is mid-flush,
a deadlock occurs: the main thread holds g_options_lock and waits
for the batcher thread to join, while the batcher thread tries to
acquire g_options_lock via SENTRY_WITH_OPTIONS during flush.

Store the options pointer in the batcher at startup and use it
directly during flush instead of going through SENTRY_WITH_OPTIONS.
Add sentry__envelope_new_with_dsn() to create envelopes without
locking, and sentry__options_get_user_consent() for lock-free
consent checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the raw options pointer with individual fields (dsn,
transport, run, user_consent) to avoid unsynchronized access to
options members from the batcher thread. The dsn is incref'd,
and user_consent is a pointer to the atomic field (NULL when
consent is not required).

Revert sentry__options_get_user_consent since it is no longer
needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DSN pointer was extracted under g_options_lock but used after
the lock was released, racing with sentry_close freeing it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi jpnurmi changed the title WIP: fix(batcher): fix deadlock on reinit fix(batcher): fix deadlock on reinit Feb 13, 2026
@jpnurmi jpnurmi marked this pull request as ready for review February 13, 2026 14:18
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

jpnurmi and others added 2 commits February 13, 2026 15:29
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mujacica
Copy link
Contributor

Nice catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hang while shutting down logs system

3 participants