Skip to content

Merge durability actors to reduce per-tx scheduler handoffs#4767

Open
joshua-spacetime wants to merge 3 commits intomasterfrom
joshua/merge-durability-actors-v2
Open

Merge durability actors to reduce per-tx scheduler handoffs#4767
joshua-spacetime wants to merge 3 commits intomasterfrom
joshua/merge-durability-actors-v2

Conversation

@joshua-spacetime
Copy link
Copy Markdown
Collaborator

@joshua-spacetime joshua-spacetime commented Apr 9, 2026

Description of Changes

Before this change, local durability had two workers. The first worker reordered slightly out-of-order submissions by TxOffset and materialized the commitlog payload, and then forwarded the payload to the 2nd worker which wrote to the commitlog.

The problem was that the first worker did such little work that it was mostly parked, and so every transaction would pay the cost of waking it up. And for very high throughput workloads, this cost was significant and on the critical path - the database thread.

This patch merges the two workers so that the database thread feeds a mostly non-idle task.

API and ABI breaking changes

N/A

Expected complexity level and risk

I'll call it a 3 mainly because it touches the durability layer in some capacity, although overall it's a simplification since it removes the transaction reordering logic that previously existed to paper over a race condition that @kim had originally fixed in #4661.

Testing

Pure refactor

@joshua-spacetime joshua-spacetime requested a review from kim April 9, 2026 07:17
Copy link
Copy Markdown
Contributor

@kim kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional worker (under core) was introduced during work on the keynote benchmark, where it was found that moving the conversion TxData -> Transaction<T> off the database thread helped with throughput. So you're saying that this was biased by observing only the benchmark workload? Why is it an issue for single replica databases, but not replicated ones?

I also don't understand how the core worker can be on the database thread -- are you referring to just the overhead of waker.wake()? Shouldn't the receiver not be parked most of the time when there's high throughput?

In any case, I think moving the reordering further away from relational db (whose locking behavior it works around) is not a good idea for maintainability reasons. How about we do it in request_durability directly? It shouldn't be too much overhead, given that most transactions will be in the correct order anyways.

As for the TxData -> Transaction conversion, I think we want to just change the append_tx signature to something like this:

fn append_tx(&self, tx: impl Into<Transaction<Self::TxData>>);

@joshua-spacetime
Copy link
Copy Markdown
Collaborator Author

joshua-spacetime commented Apr 9, 2026

@kim

So you're saying that this was biased by observing only the benchmark workload?

No, I'm saying that it just replaced that overhead with different overhead.

Why is it an issue for single replica databases, but not replicated ones?

It is an issue for replicated ones. I just haven't moved the required logic over into the replicated durability worker yet. I wanted to flesh out all the details using local durability first.

are you referring to just the overhead of waker.wake()? Shouldn't the receiver not be parked most of the time when there's high throughput?

Yes I'm referring to the wake. But the database thread the is one doing the waking, which means that overhead is on the critical path. More accurately, the cost of going from TxData -> Transaction<T> is so low that an extremely high throughput was needed to keep the worker busy. Otherwise it just stayed parked on the recv most of the time. The keynote benchmark actually does not have sufficient throughput to keep it busy, so while it's no longer performing the conversion on the critical path, it had to pay the wake() cost on every transaction.

I think moving the reordering further away from relational db (whose locking behavior it works around) is not a good idea for maintainability reasons.

Can we avoid reordering altogether? As long as we release the lock only after we have sent the payload to the durability worker, it should be impossible to get out of order transactions, no?

@kim
Copy link
Copy Markdown
Contributor

kim commented Apr 9, 2026

Can we avoid reordering altogether? As long as we release the lock only after we have sent the payload to the durability worker, it should be impossible to get out of order transactions, no?

The first version of the patch that introduced the reordering did that, but it involved passing a callback down to the datastore so that we can send to the durability channel before releasing or downgrading the lock. It was deemed to have an unwanted performance penalty, but we can benchmark it to judge if it’s worthwhile.

@kim
Copy link
Copy Markdown
Contributor

kim commented Apr 9, 2026

Said first version is #4661

@joshua-spacetime joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch 3 times, most recently from 8ab3811 to ee41165 Compare April 10, 2026 05:27
@joshua-spacetime
Copy link
Copy Markdown
Collaborator Author

@kim I have resurrected your original fix from #4661.

It was deemed to have an unwanted performance penalty, but we can benchmark it to judge if it’s worthwhile.

I have benchmarked it using one of our higher throughput benchmarks and unsurprisingly it makes no difference since the next (reducer) transaction cannot run until we have handed the payload off to the durability layer, regardless of when we release the tx lock.

@joshua-spacetime joshua-spacetime requested a review from kim April 10, 2026 05:39
@kim
Copy link
Copy Markdown
Contributor

kim commented Apr 10, 2026

I have resurrected your original fix from #4661.

I have benchmarked it using one of our higher throughput benchmarks and unsurprisingly it makes no difference.

Alright, then let's go with this, then.

Do you think it is still worthwhile to move the conversion function off the database thread?

@joshua-spacetime
Copy link
Copy Markdown
Collaborator Author

joshua-spacetime commented Apr 10, 2026

Do you think it is still worthwhile to move the conversion function off the database thread?

I could go either way. It's probably fine in most circumstances. It looks like there's a sort by table_id in there and an allocation for the reducer args. Probably not a big deal.

@kim
Copy link
Copy Markdown
Contributor

kim commented Apr 10, 2026

Probably not a big deal.

I guess @Centril might disagree ;)

I’d be fine with a concrete type containing (Arc<TxData>, Option<ReducerArgs>) that can be into()'d Transaction in the trait impl, but I think that might create a circular crate dependency. So maybe impl Into<Transaction> as proposed above is the way to go.

@Centril
Copy link
Copy Markdown
Contributor

Centril commented Apr 10, 2026

I guess @Centril might disagree ;)

Could you summarize what I'd disagree with? 😅

@joshua-spacetime joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch from e0c6141 to 5322bf0 Compare April 10, 2026 22:42
@joshua-spacetime
Copy link
Copy Markdown
Collaborator Author

@kim how about this (it simplifies things quite a bit):

pub type PreparedTx<T> = Box<dyn IntoTransaction<T>>;

pub trait Durability: Send + Sync {
    fn append_tx(&self, tx: PreparedTx<Self::TxData>);
}

@joshua-spacetime joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch from 5322bf0 to a51854f Compare April 11, 2026 05:32
@joshua-spacetime joshua-spacetime force-pushed the joshua/merge-durability-actors-v2 branch from a51854f to 04b1a24 Compare April 11, 2026 05:34
@joshua-spacetime joshua-spacetime changed the base branch from joshua/ws/v3 to master April 11, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants