Conversation
Polar Signals Profiling ResultsLatest Run
Previous Runs (6)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.980x ➖ datafusion / vortex-file-compressed (0.980x ➖, 1↑ 0↓)
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.047x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.040x ➖, 0↑ 2↓)
datafusion / parquet (1.049x ➖, 0↑ 2↓)
datafusion / arrow (1.080x ➖, 0↑ 8↓)
duckdb / vortex-file-compressed (1.075x ➖, 0↑ 5↓)
duckdb / vortex-compact (1.057x ➖, 0↑ 1↓)
duckdb / parquet (1.024x ➖, 2↑ 3↓)
duckdb / duckdb (1.045x ➖, 0↑ 2↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.993x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.988x ➖, 0↑ 0↓)
datafusion / parquet (0.979x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.972x ➖, 2↑ 0↓)
duckdb / vortex-compact (1.026x ➖, 0↑ 2↓)
duckdb / parquet (1.000x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.937x ➖, 25↑ 1↓)
datafusion / vortex-compact (0.979x ➖, 4↑ 0↓)
datafusion / parquet (0.999x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.918x ➖, 28↑ 0↓)
duckdb / vortex-compact (0.916x ➖, 36↑ 0↓)
duckdb / parquet (0.937x ➖, 19↑ 0↓)
duckdb / duckdb (0.925x ➖, 23↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.106x ➖, 1↑ 9↓)
datafusion / vortex-compact (0.775x ➖, 9↑ 1↓)
datafusion / parquet (0.862x ➖, 5↑ 0↓)
duckdb / vortex-file-compressed (0.883x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.912x ➖, 0↑ 0↓)
duckdb / parquet (0.936x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.061x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.057x ➖, 0↑ 0↓)
datafusion / parquet (1.046x ➖, 0↑ 1↓)
datafusion / arrow (1.085x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (0.930x ➖, 6↑ 0↓)
duckdb / vortex-compact (0.956x ➖, 8↑ 0↓)
duckdb / parquet (1.031x ➖, 0↑ 2↓)
duckdb / duckdb (0.977x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.887x ✅ unknown / unknown (0.970x ➖, 6↑ 0↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.991x ➖, 1↑ 1↓)
duckdb / vortex-compact (0.964x ➖, 1↑ 0↓)
duckdb / parquet (0.989x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.917x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.897x ➖, 1↑ 0↓)
datafusion / parquet (0.852x ➖, 4↑ 0↓)
duckdb / vortex-file-compressed (0.925x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.913x ➖, 1↑ 0↓)
duckdb / parquet (0.987x ➖, 0↑ 1↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.968x ➖, 3↑ 0↓)
datafusion / parquet (1.022x ➖, 0↑ 5↓)
duckdb / vortex-file-compressed (1.009x ➖, 2↑ 2↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
duckdb / duckdb (0.989x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.008x ➖ unknown / unknown (1.006x ➖, 0↑ 3↓)
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.907x ➖, 3↑ 2↓)
datafusion / vortex-compact (0.847x ➖, 2↑ 0↓)
datafusion / parquet (0.889x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.943x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.919x ➖, 0↑ 0↓)
duckdb / parquet (0.903x ➖, 0↑ 0↓)
Full attributed analysis
|
bfb7f6c to
2f94e47
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
d45d3ee to
187e742
Compare
187e742 to
d25093e
Compare
1600b75 to
ff9e7bc
Compare
311ade1 to
682de48
Compare
9c98ea4 to
084fba4
Compare
|
Please can you make a PR that moves the files in one go and other that does real changes |
084fba4 to
0b30df9
Compare
|
@joseph-isaacs It's not super clear that that is possible... Even though things have been pulled out of Edit: We will keep this PR open but also have parallel PRs that will pull out just the new Edit: Split out into #7104 Edit: That ^ was a terrible idea |
3a62499 to
349d259
Compare
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
349d259 to
ed4cd52
Compare
There was a problem hiding this comment.
I like the direction a lot.
But I think we should follow more closely the conventions we already have for plugins, i.e. a vtable for Scheme with all the associated machinery and associated type for options.
I'd also like to think about how we can define compression graphs, or "pipelines" for compressing arrays of certain types.
It might also be worth replacing stats with AggregateFnRef. We're going to do this in the main vortex-array crate soon, and we may as well start doing stats this way within the compressor if we're going to change it a lot.
Summary: Extensible and Pluggable Compressor
Tracking Issue: #6872
This is a major step in supporting extension types as a first-class feature in Vortex.
The
vortex-btrblockscompressor currently depends on every encoding crate in the workspace, and extension types (Vector, UUID, Tensor, JSON) have no mechanism for type-specific compression.This PR introduces a new
vortex-compressorcrate that extracts the encoding-agnostic compression framework, inverting the dependency graph so that encoding crates can implement a singleSchemetrait and register themselves with the compressor. Additionally,vortex-btrblocksremains the "batteries-included" default compressor, and depends onvortex-compressor.Basically, the entire compressor has been rewritten. Below are the major changes, but there are a lot of logical changes that may not be as important but did warrant a change to make sense with this new compressor.
Theoretically, there was a way to hack pluggablity into the existing compressor without a complete rewrite, but I determined that it would not provide the level of expressiveness needed to fully support extension types and encodings as a first-class citizen. I could be wrong (and this was all a waste of time) but I also found a lot of strange things in the existing compressor that didn't make a lot of sense that are now eliminated in this new compressor.
Changes
Schemetrait replaces the old type-specificIntegerScheme/FloatScheme/StringSchemetraits andIntCode/FloatCode/StringCodeenums. Schemes are identified by opaqueSchemeId(obtained only viaSchemeExt::id()). The oldCompressor/CompressorExt/CanonicalCompressortraits andIntCompressor/FloatCompressor/StringCompressorstructs are replaced by aCascadingCompressorthat selects from a vec of&'static dyn Scheme.vortex-compressorcrate contains the framework (trait definitions, cascading compressor, stats, sampling) with zero encoding dependencies (other than built-in ones fromvortex-array).ArrayAndStatsbundle replaces the old pattern of passing arrays and stats caches separately. Stats are generated lazily on first access via typed methods (integer_stats(),float_stats(),string_stats()). Each scheme declares any expensive required stats viastats_options()(specifically, distinct values and their frequencies via a hash map), and the compressor merges all eligible schemes' options before generating stats so that expensive computations only run when needed.vortex-btrblocksremains the batteries-included writer, except now it depends onvortex-compressorfor the compression logic and registers all encoding-specific schemes (BitPacking, FoR, ALP, FSST, etc.) that we host inencodings/.new_excludesvectors. Schemes declaredescendant_exclusions(push) andancestor_exclusions(pull) to prevent incompatible combinations in the cascade chain. The compressor enforces these automatically along with self-exclusion (no scheme appears twice in a chain). We do this specifically to avoid a dependency cycle.compress_childencapsulates cascade budget tracking. Schemes callcompressor.compress_child(array, &ctx, self.id(), child_index)instead of manually building contexts and callingcompress_canonical. If the cascade budget is exhausted, the child is returned as-is.compress_canonicalbranches intoSchemeimplementations (DecimalScheme,TemporalScheme), registered inALL_SCHEMESjust like any other scheme.Note that essentially none of the scheme logic was changed (so the estimation and compress logic is all mostly identical to before). The main things that were changed consist of just the framework around schemes.
API Changes
TODO need to verify what APIs we want to not break from
voetex-btrblocks(how much can we keep the same by re-exporting).Testing
Existing tests never broke, so that's a good sign.
TODO still need to test the exclusion system a bit better and also it might be beneficial to have more microbenchmarks for the compressor.
Notes
For reviewers: I would just look at the whole
vortex-compressorandvortex-btrblockscrates instead of the git diff since basically everything has changed.Some other observations:
The current mechanism in which we stop cascading is not very smart (there are 3 levels and then that's it). I believe the old compressor had this simply to have bounded search.
With this new compressor and its exclusion system (where we always exclude the scheme that we just applied for future cascades), the entire search space is now bounded, and it's actually a tree (not even a DAG!). If there are cases where we actually want to apply the same scheme to children (maybe for some reason that I don't understand, you want dict applied to to run end runs applied to dict codes, but I doubt that this is good) we can get around that by creating a new leaf scheme (though again, this sounds bad anyways).
So we can be a lot smarter with deciding how to search because the search space is known on construction of the compressor. Instead of a hardcoded level, we could make decisions based on how expensive running a compression scheme is. I think there is lots of potential here for improvement.