Skip to content

Comments

Avoid errors from asynchronous (non-c++) clients#240

Merged
ryanofsky merged 3 commits intobitcoin-core:masterfrom
ryanofsky:pr/promise
Feb 20, 2026
Merged

Avoid errors from asynchronous (non-c++) clients#240
ryanofsky merged 3 commits intobitcoin-core:masterfrom
ryanofsky:pr/promise

Conversation

@ryanofsky
Copy link
Collaborator

The PR avoids errors from non-C++ rust & python clients when they make asynchronous requests (bitcoin/bitcoin#33923) and unclean disconnects (bitcoin/bitcoin#34250). Neither of these errors are easily possible to trigger from libmultiprocess clients because of its blocking interface and RAII semantics, but they are fairly easy to trigger from rust and python and there is a test triggering both of them in bitcoin/bitcoin#34284

@DrahtBot
Copy link

DrahtBot commented Jan 21, 2026

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK Sjors
Concept ACK ismaelsadeeq

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #218 (Better error and log messages by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

  • call.) -> call. [Extra stray parentheses make the sentence fragment confusing; remove the trailing ".)" to read correctly.]
  • the the -> the [Duplicate word "the" hinders comprehension; remove the extra "the".]

2026-02-20 16:59:02

@ryanofsky
Copy link
Collaborator Author

Updated ba865d9 -> 92a2c5f (pr/promise.1 -> pr/promise.2, compare) fixing various ci errors (iwyu, compiler errors) https://github.com/bitcoin-core/libmultiprocess/actions/runs/21196838571

The change moves code responsible for posting work to the execution thread, and
notifying the event loop thread after execution is finished out of
type-context.h and into a new ProxyServer<Thread>::post() method.

This commit does not change behavior other than changing log messages, but
improvements are made in upcoming commits which extend the new method.
If multiple IPC requests happen at the same time specifying same Context.thread
to run the requests on, queue the requests to execute in the order they are
received instead of raising a "thread busy" exception.

This change has no effect on C++ clients using libmultiprocess as a client
library, since the libmultiprocess client only makes blocking calls and creates
a server thread for every client thread, so it's not possible for there to be
multiple calls on the same server thread.

But this change may be useful for rust and python clients as discussed
bitcoin/bitcoin#33923
…erface pointer

This is a fix for bitcoin/bitcoin#34250 which reports
that bitcoin node crashes if a rust stratrumv2 mining client calls
BlockTemplate.waitNext() and disconnects without waiting for a response from
the call, and if mempool fees increased so the call returns a non-null
interface BlockTemplate pointer.

The node would crash in this case while trying to call MakeProxyServer on the
returned BlockTemplate pointer, which would fail because MakeProxyServer would
try to use a reference to the Connection object that had been deleted as a
result of the disconnect.

The fix works by:

- Adding a Connection::m_canceler member variable and using it to cancel any
  IPC response promises that are pending when the connection is destroyed.

- Updating ProxyServer<Thread>::post to use promise.attach() as
  described https://capnproto.org/cxxrpc.html#cancellation to detect
  cancellation and set a ServerContext::request_canceled variable.

- Updating ServerCall to check the ServerContext::request_canceled status after
  any C++ server method returns, and throw an exception if it is set.

- Updating type-context.h PassField() function to deal with the exception
  by catching and logging it.
@ismaelsadeeq
Copy link
Member

Concept ACK

@ryanofsky
Copy link
Collaborator Author

Updated 92a2c5f -> 1f307fb (pr/promise.2 -> pr/promise.3, compare) fixing tsan error caused by deleting ready_fulfiller promise fulfiller outside the event loop thread https://github.com/bitcoin-core/libmultiprocess/actions/runs/21233968144/job/61097851891?pr=240

@ryanofsky
Copy link
Collaborator Author

Have been debugging a TSAN race condition (https://github.com/bitcoin/bitcoin/actions/runs/21421194207/job/61680781122?pr=34422#step:11:4303) the functional test in bitcoin/bitcoin#34422 detects in the new code, that is not triggered here because this PR does not currently include a test that triggers IPC request cancellation.

The race seems real and fixable, also probably unlikely to cause problems in practice. In the control flow when an IPC gets cancelled, TSAN sees the worker thread doing a CallContext::getParams call before the cancellation, triggering a read of the RpcCallContext::request pointer. Then after the cancellation occurs, the I/O thread wakes up and destroys the CallContext object, triggering a write of RpcCallContext::request pointer. This is not actually dangerous because the write always happens before the read, but TSAN is unable to detect any "happens-before" relationship between the read and write events, so considers this to be a race.

I think this should be fixable by having the worker thread acquire a lock in the brief period when it reads request parameters, before executing the IPC call, and also acquire the lock after executing the call when it is writing request results. Then the cancellation callback, if triggered, can acquire the same lock and TSAN can be aware of the ordering between the two threads

@Sjors
Copy link
Member

Sjors commented Feb 10, 2026

should be fixable by having the worker thread acquire a lock in the brief period ...

@ryanofsky is that something you want to add here?

@ryanofsky
Copy link
Collaborator Author

re: #240 (comment)

should be fixable by having the worker thread acquire a lock in the brief period ...

@ryanofsky is that something you want to add here?

Yes thanks, added a new commit which does this and fixes the TSAN error locally for me. Will update bitcoin/bitcoin#34422 and confirm it fixes the failure there too.


Rebased 1f307fb -> 1d7debf (pr/promise.3 -> pr/promise.4, compare) adding result_value.reset() call to try fix CI error https://github.com/bitcoin/bitcoin/actions/runs/21412348994/job/61652239914?pr=34422

Updated 1d7debf -> 4f42fc4 (pr/promise.4 -> pr/promise.5, compare) to fix CI error https://github.com/bitcoin/bitcoin/actions/runs/21421194207/job/61680781122?pr=34422

@theuni
Copy link
Contributor

theuni commented Feb 12, 2026

@ryanofsky I'm having a bit of trouble understanding what's happening on which thread, but it seems like a mutex is not the most straightforward solution here? It's not immediately obvious to me what it's guarding, anyway. Perhaps a simple std::atomic_flag would be better to communicate "I've reached this point"? You could set/notify when the params are safe to delete, then in m_on_cancel you could just wait for it?

@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Feb 13, 2026

@ryanofsky I'm having a bit of trouble understanding what's happening on which thread, but it seems like a mutex is not the most straightforward solution here? It's not immediately obvious to me what it's guarding, anyway. Perhaps a simple std::atomic_flag would be better to communicate "I've reached this point"? You could set/notify when the params are safe to delete, then in m_on_cancel you could just wait for it?

Thanks for looking at this! I assume you are asking about the last commit 4f42fc4 which is fixing the race condition detected by TSAN.

The issue being fixed in that commit is a race between two threads both potentially accessing capnproto request parameters in an incoming request.

  • One thread is a worker thread running the big invoke lambda which is responsible for converting the capnproto request parameters into c++ arguments, and then calling a c++ method with those arguments, and then converting the c++ method's return value & output arguments or exception into a capnproto response.
  • The other thread is the capnproto event loop thread running the on_cancel lambda when there is a disconnect or cancellation. After the on_cancel lambda returns, capnproto deletes the request parameters.

The point of the mutex is to prevent the event loop thread from deleting the request parameters while the worker thread is still reading them.

The control flow is a little complicated because of the way parameters are unpacked recursively, but the idea of guarding access to the parameters with a mutex is pretty simple. The params_mutex is locked by the worker thread before it accesses the capnproto parameters, and unlocked by that thread after all the parameters have been unpacked and it is ready to call the c++ method that is supposed to be executed. Then on the event loop thread inside the on_cancel lambda, the same mutex is locked before it returns so if there is a cancellation while the worker thread is unpacking parameters, the event loop thread will be blocked, and not delete the parameters until after the worker thread has finished reading them.

What is odd about this fix is that the event loop thread is releasing the mutex before it actually deletes the parameters. This is suspicious, because if the mutex is not locked while they are deleted it does nothing to solve the reverse race condition in the case where the cancellation happens before the parameters start being read. But this case is handled by a separate cancel_monitor.m_cancelled check which raises an exception to prevent the worker thread from reading the parameters.

So the mutex should solve delete-while-reading race condition in a fairly straightforward way even if the control flow is complicated by a lot of recursion and callbacks, and even if it doesn't solve the related delete-before-reading race conditions.

Tangent: One observation looking at the code is that while the delete-before-reading reverse case is being handled, it isn't being handled very well because the exception raised is thrown inside the sync block instead of outside of it, and it should probably be an InterruptException instead of an MP_LOG call so it can be caught higher in the stack and the thread can be cleaned up. I'm guessing right now that code will cause the worker thread to hang be leaked, never getting cleaned up. Ideally I' could write a unit test trying to trigger this, but so far I've been struggling to figure out how to test different possible races and make sure they are handled well. It's not obvious how to arrange the the request being cancelled before parameters are read. Anyway, this is only indirectly related to the TSAN race addressed in the commit, which is also unlikely to happen in practice and probably hasn't happened, even though TSAN was right to warn the request parameters were previously being accessed from different threads without synchronization between them. (EDIT: This behavior is improved in latest push pr/promise.6 so if a cancellation does happen before parameters are read this will now be handled cleanly and just result in an info message being logged)

@theuni
Copy link
Contributor

theuni commented Feb 13, 2026

@ryanofsky Thanks for the explanation, that was very helpful.

I'm still struggling to understand how this mutex is working though. It seems that since it's already been unlocked in invoke(), the mutex will already be unlocked on the worker thread when params_lock goes out of scope, which would cause a double unlock (undefined), no?

@Sjors
Copy link
Member

Sjors commented Feb 13, 2026

I created a branch and had an agent splinter 558ae11 refactor: Add ProxyServer::post() method into multiple tiny commits: sjors/2026/02/558ae11-split

No need to use it here, but it might aid others in reviewing (with dimmed-zebra). Noticed a small behavior change in that it removes MP_LOG for thread busy. It's harmless since 92fed35 drops this exception anyway.

Also drew a little diagram to orient myself about where this function lives (IIUC):

Scherm­afbeelding 2026-02-13 om 18 50 25

Review placeholder: the first two commits, up to 92fed35, mostly make sense to me now.

@theuni
Copy link
Contributor

theuni commented Feb 13, 2026

I'm still struggling to understand how this mutex is working though. It seems that since it's already been unlocked in invoke(), the mutex will already be unlocked on the worker thread when params_lock goes out of scope, which would cause a double unlock (undefined), no?

Nevermind, I see my mistake now. I was thinking it was the bare mutex that was being unlocked in invoke(), but it's the lock itself. Seems sane I suppose, just not exactly straightforward.

auto result = kj::newPromiseAndFulfiller<T>(); // Signaled when fn() is called, with its return value.
bool posted = m_thread_context.waiter->post([this, fn = std::forward<Fn>(fn), result_fulfiller = kj::mv(result.fulfiller)]() mutable {
bool posted = m_thread_context.waiter->post([this, fn = std::forward<Fn>(fn), ready_fulfiller = kj::mv(ready_fulfiller), result_fulfiller = kj::mv(result.fulfiller)]() mutable {
m_loop->sync([ready_fulfiller = kj::mv(ready_fulfiller)]() mutable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 92fed35 Allow simultaneous calls on same Context.thread: maybe add a comment that the goal is to add requested fn() calls to the promise chain as quickly as possible, so we call ready_fulfiller->fulfill(); before executing it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

maybe add a comment that the goal is to add requested fn() calls to the promise chain as quickly as possible, so we call ready_fulfiller->fulfill(); before executing it.

Thanks, added comment

Copy link
Member

@Sjors Sjors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third commit looks good to me as well.

// https://github.com/bitcoin/bitcoin/issues/34250 and there would be a
// crash if execution continued.
// TODO: Note this detection is racy because cancellation could happen
// after this check. However, fixing this would require changing the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 0a22425 Prevent crash on unclean disconnect if abandoned IPC call returns interface pointer: for waitNext() the time spent in this race-vulnerable state is minuscule compared to the multi second wait?

In that case, let's not deal with it in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

for waitNext() the time spent in this race-vulnerable state is minuscule compared to the multi second wait?

In that case, let's not deal with it in this PR.

Yes this race, like the tsan race fixed in 4f42fc4 would be pretty difficult to trigger unintentionally (or even intentionally). But it turns out the lock added in 4f42fc4 can be used to fix this race as well so I extended it to cover both cases (both event loop deleting params on cancellation while they might be being read, and event loop deleting response on cancellation while it might be being written).

// client object.
server.m_context.loop->sync([&] {
auto self_dispose{kj::mv(self)};
if (erase_thread) {
Copy link
Member

@Sjors Sjors Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 0a22425 Prevent crash on unclean disconnect if abandoned IPC call returns interface pointer: can you add a comment here to explain why it's important to call sync() without doing anything in particular (where !erase_thread)?

IIUC it's to ensure InterruptException gets thrown as needed, so we can catch it below.

No test breaks (for this commit) if I do:

diff --git a/include/mp/type-context.h b/include/mp/type-context.h
index 134542e..440a3ef 100644
--- a/include/mp/type-context.h
+++ b/include/mp/type-context.h
@@ -113,5 +113,5 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
                     // makes another IPC call), so avoid modifying the map.
                     const bool erase_thread{inserted};
-                    KJ_DEFER(
+                    KJ_DEFER(if (erase_thread) {
                         // Erase the request_threads entry on the event loop
                         // thread with loop->sync(), so if the connection is
@@ -121,5 +121,4 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
                         server.m_context.loop->sync([&] {
                             auto self_dispose{kj::mv(self)};
-                            if (erase_thread) {
                             // Look up the thread again without using existing
                             // iterator since entry may no longer be there after
@@ -133,7 +132,6 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
                                 removed = request_threads.extract(server.m_context.connection);
                             }
-                            }
                         });
-                    );
+                    });
                     KJ_IF_MAYBE(exception, kj::runCatchingExceptions([&]{
                         try {

(but I guess that makes sense because the commit doesn't touch tests in general)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

No test breaks (for this commit) if I do:

Interesting find, and I was surprised that diff works, but I think it only works because erase_thread is normally true. The only time when erase_thread should be false is when a client calls a server method, and the server method calls back into the client (for example through a std::function parameter`) and then the client makes another call back into the server on the same thread. This doesn't happen in any mining code but IIRC there are some gui interactions in bitcoin/bitcoin#10102 where it does.

The reason it's important call call loop->sync unconditionally here is run the auto self_dispose{kj::mv(self)}; line so the proxy server object is freed on the event loop thread, not the worker thread, which would not be safe, because the proxy server object is managed by capnproto. I added a comment about the self variable to clarify.

Copy link
Member

@ismaelsadeeq ismaelsadeeq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the first two commits to 92fed35

Thanks @Sjors for your split of the first commit, and the diagram, it was sweet to use that in the review.

kj::Promise<T> ProxyServer<Thread>::post(Fn&& fn)
{
{
auto ready = kj::newPromiseAndFulfiller<void>(); // Signaled when waiter is idle again.
Copy link
Member

@ismaelsadeeq ismaelsadeeq Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "Allow simultaneous calls on same Context.thread" 92fed35

Before this commit, ProxyServer<Thread>::post() threw immediately when the thread was busy:

After 92fed35, post() queues the call by chaining onto m_thread_ready instead of throwing, which enables pipelining but removes the only backpressure mechanism. There is now no bound on chain depth. Every call from a fast producer appends a new node to the chain, each holding a captured lambda with serialized arguments plus two kj::Promise objects; the chain grows without limit.

I'm curious whether it's something we should worry about in practice as a potential for OOM of the server, either by bug or deliberate?

I modified the existing test to simulate a slow server (10s sleep per request) with a large number of queued calls:

 setup.server->m_impl->m_int_fn = [&](int n) {
+    std::this_thread::sleep_for(std::chrono::seconds{10});
     assert(n == expected);
     expected += 100;
     return n;
 };

-std::atomic<size_t> running{3};
+std::atomic<size_t> running{1000001};

The mptest process consumed 17 GB RSS in under 30 seconds.

With while(true), the mptest was killed by OOM:

Out of memory: Killed process 2461509 (mptest)
total-vm:53,439,684kB anon-rss:53,617,500kB

Worth noting that in the mptest process the client and server share the same address space, so the 17 GB figure includes client-side memory (unsent/pending capnp requests) as well as server-side promise chain growth?

A potential naive solution, if this is deemed worth fixing, is a depth counter with a limit. increment before queuing, throw if the limit is exceeded, and decrement after the result is dispatched

+    uint64_t m_maximum_promise_depth{1000};
+    uint64_t m_pending_tasks{0};

 template<typename T, typename Fn>
 kj::Promise<T> ProxyServer<Thread>::post(Fn&& fn)
 {
+    if (m_pending_tasks >= m_maximum_promise_depth) {
+        throw std::runtime_error("maximum promise depth reached");
+    }
+    m_pending_tasks += 1;
+
     auto ready = kj::newPromiseAndFulfiller<void>();
     auto ret = m_thread_ready.then([this, fn = std::forward<Fn>(fn),
                                     ready_fulfiller = kj::mv(ready.fulfiller)]() mutable {
         ...
-        m_loop->sync([&result_value, &exception, result_fulfiller = kj::mv(result_fulfiller)]() mutable {
+        m_loop->sync([this, &result_value, &exception, result_fulfiller = kj::mv(result_fulfiller)]() mutable {
             KJ_IF_MAYBE(e, exception) {
                 result_fulfiller->reject(kj::mv(*e));
             } else {
                 result_fulfiller->fulfill(kj::mv(*result_value));
                 result_value.reset();
             }
+            m_pending_tasks--;
             result_fulfiller = nullptr;
         });
     });

With the fix applied, running mptest with a modified running=1001, a 10s server sleep holds RSS at a stable 17 MB and throws cleanly at depth 1000:

remote exception: std::exception: maximum promise depth reached
VmRSS: 17,484 kB

At running=1000001 with the fix, RSS reached ~9.5 GB from client memory alone.

This implies that in a real deployment where client and server are separate processes, the server would be protected by this fix, but the client could still exhaust its own memory independently. This fix addresses the server-side only.

Maybe there is even a more interesting approach with kj?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

removes the only backpressure mechanism

Thankfully, this shouldn't be the case. Cap'n Proto's API is nonblocking, so there is nothing the server can do to directly slow down the rate of client requests. The way flow control is implemented is by having the server send a response to each request after it is processed, and as long as the client waits for these responses, it can avoid sending requests faster than the server processes them, and can limit the number of requests that are queued up.

I'm curious whether it's something we should worry about in practice as a potential for OOM of the server, either by bug or deliberate?

I do think it's something we may want to worry about, and we may want to add counters and hard caps on things like number of objects, connections, threads, and pending requests in the future.

I think your m_pending_tasks idea is basically the right approach, although I probably we would want the counter to be more global, like one counter per UNIX socket path instead of one counter per thread object. This way badly behaved clients would not be able to circumvent the limit by creating many connections or many threads.

Initially in bitcoin/bitcoin#33923, I resisted the idea the clients should be able to send arbitrary numbers of requests and have them queued up on the server. But I changed my mind after realizing this is already how cap'n proto handles requests that run on the event loop thread, so this would just be making requests that run on worker threads behave consistently. Also this behavior is potentially useful if clients need multiple pieces of data from the server, so they can just send all the requests at once instead of staggering them.

~ProxyServer();
kj::Promise<void> getName(GetNameContext context) override;

//! Run a callback function returning T on this thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "refactor: Add ProxyServer::post() method" 558ae11

nit: clarify that we return a T promise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

nit: clarify that we return a T promise?

Thanks, expanded the comment to describe the behavior and what's returned.


ProxyServer<Thread>::ProxyServer(ThreadContext& thread_context, std::thread&& thread)
: m_thread_context(thread_context), m_thread(std::move(thread))
ProxyServer<Thread>::ProxyServer(Connection& connection, ThreadContext& thread_context, std::thread&& thread)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "refactor: Add ProxyServer::post() method" 558ae11

Are we passing the whole connection here for future profing, i.e we need more things apart from m_loop later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

Are we passing the whole connection here for future profing, i.e we need more things apart from m_loop later?

That's one reason, and the other reason is that ProxyServer<Thread> is a specialization of the generic ProxyServer struct which uses a connection pointer to be able to run cleanup code when the connection is destroyed, and I thought it would be good if different ProxyServer specializations were more similar. Choice is admittedly pretty arbitrary, though.

Copy link
Collaborator Author

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed reviews!

Updated 4f42fc4 -> c436ae4 (pr/promise.5 -> pr/promise.6, compare) making suggested changes, and merging 3rd and 4th commits and improving them. Instead of just handling the cancellation-while-reading params race case detected by TSAN, the code now handles the cancellation-while-writing-response race case which was previously a TODO, and better handles the cancellation-before-reading-params race case which was previously handled, but not very well, resulting in an uncaught exception. All of these cases would be very unlikely to happen since params and response structs are only accessed very briefly before and after method execution, so accesses by the event thread would be unlikely. But they are good to handle. Other than these change, the other changes were comment, logging, and naming improvements.

auto result = kj::newPromiseAndFulfiller<T>(); // Signaled when fn() is called, with its return value.
bool posted = m_thread_context.waiter->post([this, fn = std::forward<Fn>(fn), result_fulfiller = kj::mv(result.fulfiller)]() mutable {
bool posted = m_thread_context.waiter->post([this, fn = std::forward<Fn>(fn), ready_fulfiller = kj::mv(ready_fulfiller), result_fulfiller = kj::mv(result.fulfiller)]() mutable {
m_loop->sync([ready_fulfiller = kj::mv(ready_fulfiller)]() mutable {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

maybe add a comment that the goal is to add requested fn() calls to the promise chain as quickly as possible, so we call ready_fulfiller->fulfill(); before executing it.

Thanks, added comment

// https://github.com/bitcoin/bitcoin/issues/34250 and there would be a
// crash if execution continued.
// TODO: Note this detection is racy because cancellation could happen
// after this check. However, fixing this would require changing the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

for waitNext() the time spent in this race-vulnerable state is minuscule compared to the multi second wait?

In that case, let's not deal with it in this PR.

Yes this race, like the tsan race fixed in 4f42fc4 would be pretty difficult to trigger unintentionally (or even intentionally). But it turns out the lock added in 4f42fc4 can be used to fix this race as well so I extended it to cover both cases (both event loop deleting params on cancellation while they might be being read, and event loop deleting response on cancellation while it might be being written).

// client object.
server.m_context.loop->sync([&] {
auto self_dispose{kj::mv(self)};
if (erase_thread) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

No test breaks (for this commit) if I do:

Interesting find, and I was surprised that diff works, but I think it only works because erase_thread is normally true. The only time when erase_thread should be false is when a client calls a server method, and the server method calls back into the client (for example through a std::function parameter`) and then the client makes another call back into the server on the same thread. This doesn't happen in any mining code but IIRC there are some gui interactions in bitcoin/bitcoin#10102 where it does.

The reason it's important call call loop->sync unconditionally here is run the auto self_dispose{kj::mv(self)}; line so the proxy server object is freed on the event loop thread, not the worker thread, which would not be safe, because the proxy server object is managed by capnproto. I added a comment about the self variable to clarify.

kj::Promise<T> ProxyServer<Thread>::post(Fn&& fn)
{
{
auto ready = kj::newPromiseAndFulfiller<void>(); // Signaled when waiter is idle again.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

removes the only backpressure mechanism

Thankfully, this shouldn't be the case. Cap'n Proto's API is nonblocking, so there is nothing the server can do to directly slow down the rate of client requests. The way flow control is implemented is by having the server send a response to each request after it is processed, and as long as the client waits for these responses, it can avoid sending requests faster than the server processes them, and can limit the number of requests that are queued up.

I'm curious whether it's something we should worry about in practice as a potential for OOM of the server, either by bug or deliberate?

I do think it's something we may want to worry about, and we may want to add counters and hard caps on things like number of objects, connections, threads, and pending requests in the future.

I think your m_pending_tasks idea is basically the right approach, although I probably we would want the counter to be more global, like one counter per UNIX socket path instead of one counter per thread object. This way badly behaved clients would not be able to circumvent the limit by creating many connections or many threads.

Initially in bitcoin/bitcoin#33923, I resisted the idea the clients should be able to send arbitrary numbers of requests and have them queued up on the server. But I changed my mind after realizing this is already how cap'n proto handles requests that run on the event loop thread, so this would just be making requests that run on worker threads behave consistently. Also this behavior is potentially useful if clients need multiple pieces of data from the server, so they can just send all the requests at once instead of staggering them.

~ProxyServer();
kj::Promise<void> getName(GetNameContext context) override;

//! Run a callback function returning T on this thread.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

nit: clarify that we return a T promise?

Thanks, expanded the comment to describe the behavior and what's returned.


ProxyServer<Thread>::ProxyServer(ThreadContext& thread_context, std::thread&& thread)
: m_thread_context(thread_context), m_thread(std::move(thread))
ProxyServer<Thread>::ProxyServer(Connection& connection, ThreadContext& thread_context, std::thread&& thread)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #240 (comment)

Are we passing the whole connection here for future profing, i.e we need more things apart from m_loop later?

That's one reason, and the other reason is that ProxyServer<Thread> is a specialization of the generic ProxyServer struct which uses a connection pointer to be able to run cleanup code when the connection is destroyed, and I thought it would be good if different ProxyServer specializations were more similar. Choice is admittedly pretty arbitrary, though.

@ryanofsky ryanofsky force-pushed the pr/promise branch 2 times, most recently from 55e9eac to 186a6d3 Compare February 18, 2026 14:54
@ryanofsky ryanofsky marked this pull request as draft February 18, 2026 16:36
@ryanofsky ryanofsky marked this pull request as ready for review February 19, 2026 12:15
@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Feb 19, 2026

Updated 186a6d3 -> 4324218 (pr/promise.10 -> pr/promise.11, compare) to avoid memory leak detected by ASAN https://github.com/bitcoin/bitcoin/actions/runs/22145238258/job/64020426758?pr=34422 and caused by clang bug described bitcoin/bitcoin#34422 (comment). This should be ready for review again with the fix

Updated 4324218 -> 337fa9c (pr/promise.11 -> pr/promise.12, compare) fixing uninitialized variable pointed out #240 (comment)

@Sjors
Copy link
Member

Sjors commented Feb 19, 2026

The new CallThen helper looks good to me, but I'll wait for the sanitizer issue to resolved: bitcoin/bitcoin#34422 (comment)

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 19, 2026
@ryanofsky
Copy link
Collaborator Author

The new CallThen helper looks good to me, but I'll wait for the sanitizer issue to resolved: bitcoin/bitcoin#34422 (comment)

Should be resolved now! (With help from Marco identifying the issue and probably saving me a lot of time debugging.)

@Sjors
Copy link
Member

Sjors commented Feb 20, 2026

ACK 337fa9c

@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Feb 20, 2026

Updated 337fa9c -> ec788f1 (pr/promise.12 -> pr/promise.13, compare) just making small tweaks: renames, improved comments, updated commit messages, and improving exception logging

Updated ec788f1 -> 0174450 (pr/promise.13 -> pr/promise.14, compare) to fix iwyu errors https://github.com/bitcoin-core/libmultiprocess/actions/runs/22231928222/job/64313628070?pr=240

@Sjors
Copy link
Member

Sjors commented Feb 20, 2026

utACK 0174450 if CI is happy. The additional "uncaught exception ([actual message])" seems useful.

ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 20, 2026
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 20, 2026
@ryanofsky
Copy link
Collaborator Author

Planning to merge this soon to unblock bitcoin/bitcoin#34422 and #218, but would still welcome any more review and happy to make followup changes!

@ryanofsky ryanofsky merged commit 290702c into bitcoin-core:master Feb 20, 2026
10 checks passed
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 20, 2026
…74ce

290702c74ce Merge bitcoin-core/libmultiprocess#240: Avoid errors from asynchronous (non-c++) clients
3a69d4755af Merge bitcoin-core/libmultiprocess#241: doc: Bump version number v7 -> v8
0174450ca2e Prevent crash on unclean disconnect if abandoned IPC call returns interface pointer
ddb5f74196f Allow simultaneous calls on same Context.thread
c4762c7b513 refactor: Add ProxyServer<Thread>::post() method
0ade1b40ac5 doc: Bump version number

git-subtree-dir: src/ipc/libmultiprocess
git-subtree-split: 290702c74cebde683aad5ca9c587e5bdb6873f66
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 20, 2026
Upstream PR bitcoin-core/libmultiprocess#240 fixed various issues which require
updates to python IPC tests. Those changes are made in this commit.
Sjors added a commit to Sjors/bitcoin that referenced this pull request Feb 21, 2026
413f915979 ipc: Serialize null CTransactionRef as empty Data
586fa578be refactor: add missing includes to mp/type-data.h
290702c74c Merge bitcoin-core/libmultiprocess#240: Avoid errors from asynchronous (non-c++) clients
3a69d4755a Merge bitcoin-core/libmultiprocess#241: doc: Bump version number v7 -> v8
0174450ca2 Prevent crash on unclean disconnect if abandoned IPC call returns interface pointer
ddb5f74196 Allow simultaneous calls on same Context.thread
c4762c7b51 refactor: Add ProxyServer<Thread>::post() method
0ade1b40ac doc: Bump version number

git-subtree-dir: src/ipc/libmultiprocess
git-subtree-split: 413f91597932921176904492ca09673549750fd1
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 21, 2026
Upstream PR bitcoin-core/libmultiprocess#240 fixed various issues which require
updates to python IPC tests. Those changes are made in this commit.
ryanofsky added a commit to ryanofsky/bitcoin that referenced this pull request Feb 23, 2026
Upstream PR bitcoin-core/libmultiprocess#240 fixed various issues which require
updates to python IPC tests. Those changes are made in this commit.
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Feb 24, 2026
As pointed out by janb84 in
bitcoin/bitcoin#34422 (comment) it makes
sense for the on_cancel callback to lock cancel_mutex while it is assigning
request_canceled = true.

The lock and assigment were introduced in bitcoin-core#240 and in an earlier version of
that PR, request_canceled was a std::atomic and the assignment happened before
the lock was acquired instead of after, so it was ok for the lock to be unnamed
and immediately released after being acquired.

But in the final verion of bitcoin-core#240 request_canceled is an ordinary non-atomic
bool, and it should be assigned true with the lock held to prevent a
theoretical race condition where capn'proto event loop cancels the request
before the execution thread runs, and the execution thread sees the old
request_canceled = false value and then unsafely accesses deleted parameters.
The request being canceled so quickly and parameters being accessed so slowly,
and stale request_canceled value being read even after the execution thread has
the cancel_mutex lock should be very unlikely to occur in practice, but could
happen in theory and is good to fix.
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants