Merge durability actors to reduce per-tx scheduler handoffs#4767
Merge durability actors to reduce per-tx scheduler handoffs#4767joshua-spacetime wants to merge 3 commits intomasterfrom
Conversation
kim
left a comment
There was a problem hiding this comment.
The additional worker (under core) was introduced during work on the keynote benchmark, where it was found that moving the conversion TxData -> Transaction<T> off the database thread helped with throughput. So you're saying that this was biased by observing only the benchmark workload? Why is it an issue for single replica databases, but not replicated ones?
I also don't understand how the core worker can be on the database thread -- are you referring to just the overhead of waker.wake()? Shouldn't the receiver not be parked most of the time when there's high throughput?
In any case, I think moving the reordering further away from relational db (whose locking behavior it works around) is not a good idea for maintainability reasons. How about we do it in request_durability directly? It shouldn't be too much overhead, given that most transactions will be in the correct order anyways.
As for the TxData -> Transaction conversion, I think we want to just change the append_tx signature to something like this:
fn append_tx(&self, tx: impl Into<Transaction<Self::TxData>>);
No, I'm saying that it just replaced that overhead with different overhead.
It is an issue for replicated ones. I just haven't moved the required logic over into the replicated durability worker yet. I wanted to flesh out all the details using local durability first.
Yes I'm referring to the wake. But the database thread the is one doing the waking, which means that overhead is on the critical path. More accurately, the cost of going from
Can we avoid reordering altogether? As long as we release the lock only after we have sent the payload to the durability worker, it should be impossible to get out of order transactions, no? |
The first version of the patch that introduced the reordering did that, but it involved passing a callback down to the datastore so that we can send to the durability channel before releasing or downgrading the lock. It was deemed to have an unwanted performance penalty, but we can benchmark it to judge if it’s worthwhile. |
|
Said first version is #4661 |
8ab3811 to
ee41165
Compare
|
@kim I have resurrected your original fix from #4661.
I have benchmarked it using one of our higher throughput benchmarks and unsurprisingly it makes no difference since the next (reducer) transaction cannot run until we have handed the payload off to the durability layer, regardless of when we release the tx lock. |
Alright, then let's go with this, then. Do you think it is still worthwhile to move the conversion function off the database thread? |
I could go either way. It's probably fine in most circumstances. It looks like there's a sort by |
I guess @Centril might disagree ;) I’d be fine with a concrete type containing |
Could you summarize what I'd disagree with? 😅 |
e0c6141 to
5322bf0
Compare
|
@kim how about this (it simplifies things quite a bit): pub type PreparedTx<T> = Box<dyn IntoTransaction<T>>;
pub trait Durability: Send + Sync {
fn append_tx(&self, tx: PreparedTx<Self::TxData>);
} |
5322bf0 to
a51854f
Compare
And remove intermediate worker from generic durability path entirely.
a51854f to
04b1a24
Compare
Description of Changes
Before this change, local durability had two workers. The first worker reordered slightly out-of-order submissions by
TxOffsetand materialized the commitlog payload, and then forwarded the payload to the 2nd worker which wrote to the commitlog.The problem was that the first worker did such little work that it was mostly parked, and so every transaction would pay the cost of waking it up. And for very high throughput workloads, this cost was significant and on the critical path - the database thread.
This patch merges the two workers so that the database thread feeds a mostly non-idle task.
API and ABI breaking changes
N/A
Expected complexity level and risk
I'll call it a
3mainly because it touches the durability layer in some capacity, although overall it's a simplification since it removes the transaction reordering logic that previously existed to paper over a race condition that @kim had originally fixed in #4661.Testing
Pure refactor