Merge durability actors to reduce per-tx scheduler handoffs by joshua-spacetime · Pull Request #4767 · clockworklabs/SpacetimeDB

joshua-spacetime · 2026-04-09T07:17:00Z

Description of Changes

Before this change, local durability had two workers. The first worker reordered slightly out-of-order submissions by TxOffset and materialized the commitlog payload, and then forwarded the payload to the 2nd worker which wrote to the commitlog.

The problem was that the first worker did such little work that it was mostly parked, and so every transaction would pay the cost of waking it up. And for very high throughput workloads, this cost was significant and on the critical path - the database thread.

This patch merges the two workers so that the database thread feeds a mostly non-idle task.

API and ABI breaking changes

N/A

Expected complexity level and risk

I'll call it a 3 mainly because it touches the durability layer in some capacity, although overall it's a simplification since it removes the transaction reordering logic that previously existed to paper over a race condition that @kim had originally fixed in #4661.

Testing

Pure refactor

kim

The additional worker (under core) was introduced during work on the keynote benchmark, where it was found that moving the conversion TxData -> Transaction<T> off the database thread helped with throughput. So you're saying that this was biased by observing only the benchmark workload? Why is it an issue for single replica databases, but not replicated ones?

I also don't understand how the core worker can be on the database thread -- are you referring to just the overhead of waker.wake()? Shouldn't the receiver not be parked most of the time when there's high throughput?

In any case, I think moving the reordering further away from relational db (whose locking behavior it works around) is not a good idea for maintainability reasons. How about we do it in request_durability directly? It shouldn't be too much overhead, given that most transactions will be in the correct order anyways.

As for the TxData -> Transaction conversion, I think we want to just change the append_tx signature to something like this:

fn append_tx(&self, tx: impl Into<Transaction<Self::TxData>>);

crates/durability/src/imp/local.rs

joshua-spacetime · 2026-04-09T15:33:34Z

@kim

So you're saying that this was biased by observing only the benchmark workload?

No, I'm saying that it just replaced that overhead with different overhead.

Why is it an issue for single replica databases, but not replicated ones?

It is an issue for replicated ones. I just haven't moved the required logic over into the replicated durability worker yet. I wanted to flesh out all the details using local durability first.

are you referring to just the overhead of waker.wake()? Shouldn't the receiver not be parked most of the time when there's high throughput?

Yes I'm referring to the wake. But the database thread the is one doing the waking, which means that overhead is on the critical path. More accurately, the cost of going from TxData -> Transaction<T> is so low that an extremely high throughput was needed to keep the worker busy. Otherwise it just stayed parked on the recv most of the time. The keynote benchmark actually does not have sufficient throughput to keep it busy, so while it's no longer performing the conversion on the critical path, it had to pay the wake() cost on every transaction.

I think moving the reordering further away from relational db (whose locking behavior it works around) is not a good idea for maintainability reasons.

Can we avoid reordering altogether? As long as we release the lock only after we have sent the payload to the durability worker, it should be impossible to get out of order transactions, no?

kim · 2026-04-09T16:13:34Z

Can we avoid reordering altogether? As long as we release the lock only after we have sent the payload to the durability worker, it should be impossible to get out of order transactions, no?

The first version of the patch that introduced the reordering did that, but it involved passing a callback down to the datastore so that we can send to the durability channel before releasing or downgrading the lock. It was deemed to have an unwanted performance penalty, but we can benchmark it to judge if it’s worthwhile.

kim · 2026-04-09T16:15:05Z

Said first version is #4661

joshua-spacetime · 2026-04-10T05:38:56Z

@kim I have resurrected your original fix from #4661.

It was deemed to have an unwanted performance penalty, but we can benchmark it to judge if it’s worthwhile.

I have benchmarked it using one of our higher throughput benchmarks and unsurprisingly it makes no difference since the next (reducer) transaction cannot run until we have handed the payload off to the durability layer, regardless of when we release the tx lock.

kim · 2026-04-10T07:07:59Z

I have resurrected your original fix from #4661.

I have benchmarked it using one of our higher throughput benchmarks and unsurprisingly it makes no difference.

Alright, then let's go with this, then.

Do you think it is still worthwhile to move the conversion function off the database thread?

joshua-spacetime · 2026-04-10T16:23:55Z

Do you think it is still worthwhile to move the conversion function off the database thread?

I could go either way. It's probably fine in most circumstances. It looks like there's a sort by table_id in there and an allocation for the reducer args. Probably not a big deal.

kim · 2026-04-10T17:38:07Z

Probably not a big deal.

I guess @Centril might disagree ;)

I’d be fine with a concrete type containing (Arc<TxData>, Option<ReducerArgs>) that can be into()'d Transaction in the trait impl, but I think that might create a circular crate dependency. So maybe impl Into<Transaction> as proposed above is the way to go.

Centril · 2026-04-10T17:45:44Z

I guess @Centril might disagree ;)

Could you summarize what I'd disagree with? 😅

joshua-spacetime · 2026-04-10T22:48:48Z

@kim how about this (it simplifies things quite a bit):

pub type PreparedTx<T> = Box<dyn IntoTransaction<T>>;

pub trait Durability: Send + Sync {
    fn append_tx(&self, tx: PreparedTx<Self::TxData>);
}

And remove intermediate worker from generic durability path entirely.

joshua-spacetime requested a review from kim April 9, 2026 07:17

kim reviewed Apr 9, 2026

View reviewed changes

crates/durability/src/imp/local.rs Outdated Show resolved Hide resolved

crates/durability/src/imp/local.rs Outdated Show resolved Hide resolved

joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch 3 times, most recently from 8ab3811 to ee41165 Compare April 10, 2026 05:27

joshua-spacetime requested a review from kim April 10, 2026 05:39

joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch from e0c6141 to 5322bf0 Compare April 10, 2026 22:42

joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch from 5322bf0 to a51854f Compare April 11, 2026 05:32

joshua-spacetime requested review from Centril and gefjon as code owners April 11, 2026 05:32

joshua-spacetime added 3 commits April 10, 2026 22:33

Merge durability actors to reduce per-tx scheduler handoffs

2576315

Drop datastore lock after requesting durability

f68c938

Move payload conversion into Durability trait

04b1a24

And remove intermediate worker from generic durability path entirely.

joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch from a51854f to 04b1a24 Compare April 11, 2026 05:34

joshua-spacetime removed request for Centril and gefjon April 11, 2026 05:34

joshua-spacetime changed the base branch from joshua/ws/v3 to master April 11, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge durability actors to reduce per-tx scheduler handoffs#4767

Merge durability actors to reduce per-tx scheduler handoffs#4767
joshua-spacetime wants to merge 3 commits intomasterfrom
joshua/merge-durability-actors-v2

joshua-spacetime commented Apr 9, 2026 •

edited

Loading

Uh oh!

kim left a comment

Uh oh!

Uh oh!

Uh oh!

joshua-spacetime commented Apr 9, 2026 •

edited

Loading

Uh oh!

kim commented Apr 9, 2026 •

edited

Loading

Uh oh!

kim commented Apr 9, 2026

Uh oh!

joshua-spacetime commented Apr 10, 2026

Uh oh!

kim commented Apr 10, 2026

Uh oh!

joshua-spacetime commented Apr 10, 2026 •

edited

Loading

Uh oh!

kim commented Apr 10, 2026

Uh oh!

Centril commented Apr 10, 2026

Uh oh!

joshua-spacetime commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joshua-spacetime commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

kim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joshua-spacetime commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kim commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kim commented Apr 9, 2026

Uh oh!

joshua-spacetime commented Apr 10, 2026

Uh oh!

kim commented Apr 10, 2026

Uh oh!

joshua-spacetime commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kim commented Apr 10, 2026

Uh oh!

Centril commented Apr 10, 2026

Uh oh!

joshua-spacetime commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshua-spacetime commented Apr 9, 2026 •

edited

Loading

joshua-spacetime commented Apr 9, 2026 •

edited

Loading

kim commented Apr 9, 2026 •

edited

Loading

joshua-spacetime commented Apr 10, 2026 •

edited

Loading