A handful of optimizations for the DRC collector#12974
A handful of optimizations for the DRC collector#12974fitzgen wants to merge 7 commits intobytecodealliance:mainfrom
Conversation
Also add fast-path entry points that take a `u32` size directly that has already been rounded to the free list's alignment. Altogether, this shaves off ~309B instructions retired (48%) from the benchmark in bytecodealliance#11141
Ideally we would just use a `SecondaryMap<VMSharedTypeIndex, TraceInfo>` here but allocating `O(num engine types)` space inside a store that uses only a couple types seems not great. So instead, we just have a fixed size cache that is probably big enough for most things in practice.
Inline `dec_ref`, `trace_gc_ref`, and `dealloc` into `dec_ref_and_maybe_dealloc`'s main loop so that we read the `VMDrcHeader` once per object to get `ref_count`, type index, and `object_size`, avoiding 3 separate GC heap accesses and bounds checks per freed object. For struct tracing, read gc_ref fields directly from the heap slice at known offsets instead of going through gc_object_data → object_range → object_size which would re-read the object_size from the header. 301,333,979,721 -> 291,038,676,119 instructions (~3.4% improvement)
…exists When the GC store is already initialized and the allocation succeeds, avoid async machinery entirely. This avoids the overhead of taking/restoring fiber async state pointers on every allocation. 291,038,676,119 -> 230,503,364,489 instructions (~20.8% improvement)
Avoids converting `ModuleInternedTypeIndex` to `VMSharedTypeIndex` in host code, which requires look ups in the instance's module's `TypeCollection`. We already have helpers to do this conversion inline in JIT code. 230,503,364,489 -> 216,937,168,529 instructions (~5.9% improvement)
Moves the `externref` host data cleanup inside the `ty.is_none()` branch of `dec_ref_and_maybe_dealloc`, since only `externref`s have host data. Additionally the type check is sort of expensive since it involves additional bounds-checked reads from the GC heap.
79013cf to
56a5b5a
Compare
Subscribe to Label Actioncc @fitzgen DetailsThis issue or pull request has been labeled: "wasmtime:api", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
alexcrichton
left a comment
There was a problem hiding this comment.
I need to spend more time looking at Combine dec_ref, trace, and dealloc into single-pass loop but this is one thing I noticed. The later commits seem fine though.
This is another case though where in-wasm GC allocation, GC mark/sweep, etc, would I suspect remove a huge amount of the overhead since the host has to dance around "the heap could be corrupt at any time" which loses a lot of perf I believe. I realize that's a big undertaking, but we may want to discuss more seriously in a meeting at some point if it's table stakes or not for shipping gc.
| /// Get the trace information associated with the given type index. | ||
| pub fn get(&mut self, ty: VMSharedTypeIndex) -> &TraceInfo { | ||
| if let Some((ty2, info)) = self.cache[Self::cache_index(ty)] | ||
| && ty == ty2 | ||
| { | ||
| return info; | ||
| } | ||
|
|
||
| self.get_slow(ty) | ||
| } | ||
|
|
||
| #[inline] | ||
| fn cache_index(ty: VMSharedTypeIndex) -> usize { | ||
| let bits = ty.bits(); | ||
| let bits = usize::try_from(bits).unwrap(); | ||
| bits % Self::CACHE_CAPACITY | ||
| } |
There was a problem hiding this comment.
This seems like it's a bit of a poor-man's hash map here. Since this is already using a hash map, what's the performance of using a custom hasher where the hash of VMSharedTypeIndex is just its bit value?
Happy to discuss at a meeting, I'll add an item, but I find it super surprising that we would even entertain the idea of blocking enabling the GC proposal by default on self-hosting the free list (or even worse from a time-to-shipping perspective: self-hosting the whole collector runtime). |
Depends on #12969
See each commit message for details.
More coming soon after this.