Optimize splitPart scalar function to reduce allocation#17660
Optimize splitPart scalar function to reduce allocation#17660Jackie-Jiang merged 5 commits intoapache:masterfrom
Conversation
| * TODO: Revisit if index should be one-based (both Presto and Postgres use one-based index, which starts with 1) | ||
| * @param input | ||
| * @param delimiter | ||
| * @param input the input String to be split into parts. |
There was a problem hiding this comment.
copied verbatim from the the overloaded splitPart function below to keep in sync.
33336f1 to
5a2b4f1
Compare
| @@ -0,0 +1,160 @@ | |||
| /** | |||
| * Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
I can remove this if required. Added it for my own validation of the perf
5a2b4f1 to
24644cf
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17660 +/- ##
============================================
- Coverage 63.29% 63.22% -0.08%
- Complexity 1466 1525 +59
============================================
Files 3190 3194 +4
Lines 191904 193705 +1801
Branches 29390 29809 +419
============================================
+ Hits 121474 122469 +995
- Misses 60940 61622 +682
- Partials 9490 9614 +124
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I will debug these tests and update the PR. I will also try to get the existing coverage to 100% so we can be confident before migrating |
|
Try rebasing. The test failures might not be related |
Updated the splitPart function to use direct string traversal instead of allocating a full String[] array, significantly reducing memory pressure in high-throughput query scenarios. Performance improvements: - Space complexity: O(n) → O(1) - Time complexity: O(n) for both implementations Added JMH benchmarks to demonstrate improvements and regressions. The primary regression is in the backward index case with a large index value. This is the uncommon case and maybe worth the tradeoff since the memory allocation in the common case is now significantly reduced.
b4272e2 to
cb95a56
Compare
|
Done. I also added full test coverage which uncovered a behavior change with this new implementation, let me know your thoughts @Jackie-Jiang Behavioral difference from old implementationThe new implementation differs from
I feel that the new implementation is more correct, rather than returning null. wdyt? |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the splitPart scalar function in pinot-common to avoid allocating a full String[] split result, reducing memory pressure for high-throughput query workloads. It also adds a JMH benchmark and expands unit test coverage for additional edge cases.
Changes:
- Reimplemented
StringFunctions.splitPart(input, delimiter, index)using direct string traversal (including a specialized negative-index fast path for single-char delimiters). - Added JMH benchmarks to compare the new vs old
splitPartimplementations. - Added many new
splitPartunit test cases covering edge scenarios (empty delimiter, consecutive delimiters, leading/trailing delimiters, multi-char delimiters, empty input).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java |
Replaces splitPart implementation with allocation-reduced traversal logic and adds a negative-index optimized helper. |
pinot-common/src/test/java/org/apache/pinot/common/function/scalar/StringFunctionsTest.java |
Expands splitPart test matrix for more delimiter/index boundary cases. |
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkSplitPart.java |
Introduces a JMH benchmark to measure old vs new splitPart performance across scenarios. |
| _delimiter = ":"; | ||
| _input = buildString(50, repeatStr(_delimiter, 10)); |
There was a problem hiding this comment.
In the multi_char_delim scenario, _delimiter is set to ":" but _input is built using repeatStr(_delimiter, 10) as the delimiter between fields. This means the benchmark isn't actually exercising a multi-character delimiter for splitPart (it passes a single-char delimiter that happens to appear consecutively), so the results can be misleading. Consider setting _delimiter to the repeated string (or use something like "::") and building _input with that same _delimiter.
| _delimiter = ":"; | |
| _input = buildString(50, repeatStr(_delimiter, 10)); | |
| _delimiter = repeatStr(":", 10); | |
| _input = buildString(50, _delimiter); |
There was a problem hiding this comment.
this is semantically the same. its more readable this way so I will address it but the logic is the same
| case "large_index": | ||
| // Extract 80th element from 100-field string | ||
| _delimiter = ","; | ||
| _input = buildString(10000, _delimiter); | ||
| _index = 790; | ||
| break; | ||
| case "negative_index": | ||
| // Extract last element from 100-field string | ||
| _delimiter = ","; | ||
| _input = buildString(100, _delimiter); | ||
| _index = -1; | ||
| break; | ||
| case "large_negative_index": | ||
| // Extract last element from 100-field string | ||
| _delimiter = ","; | ||
| _input = buildString(10000, _delimiter); | ||
| _index = -7242; | ||
| break; | ||
| case "adjacent_delimiters": | ||
| // String with many consecutive delimiters | ||
| _input = "field0+++field1+++field2+++field3"; | ||
| _delimiter = "+"; | ||
| _index = 1; | ||
| break; | ||
| case "many_fields": | ||
| // Extract 5th element from 1000-field string | ||
| _delimiter = ","; | ||
| _input = buildString(10000, _delimiter); | ||
| _index = 4; |
There was a problem hiding this comment.
Several scenario comments don't match the actual parameters (e.g., large_index/many_fields mention 100/1000 fields but build 10,000; large_index says 80th but uses index 790). Align the comments (or the parameters) so the benchmark scenarios remain self-explanatory.
There was a problem hiding this comment.
thanks for this. updated the comments.
| start += delimLen; | ||
| } | ||
| // optimization for negative index with single-char delimiter since common case | ||
| // multi-char delimiter with negative index can be handled in future since less common and more complex |
There was a problem hiding this comment.
-index will overflow when index == Integer.MIN_VALUE, causing incorrect behavior for that edge case (and potentially passing a negative value into the helper). Add an explicit guard for index == Integer.MIN_VALUE (e.g., return "null" early) or handle the absolute-value conversion without overflow.
| // multi-char delimiter with negative index can be handled in future since less common and more complex | |
| // multi-char delimiter with negative index can be handled in future since less common and more complex | |
| if (index == Integer.MIN_VALUE) { | |
| return "null"; | |
| } |
…d multi_char_delim scenario
| } else { | ||
| // compare with {@link BenchmarkSplitPart} for future changes | ||
| int delimLen = delimiter.length(); | ||
| if (delimLen == 0) { |
There was a problem hiding this comment.
This seems like a behavior change. When delimiter is empty, the current behavior splits on whitespace.
Can you also check the behavior on PostgreSQL?
There was a problem hiding this comment.
Thanks for the review! I checked the behavior for postgres;
When the delimiter is an empty string:
- PostgreSQL: returns the entire input for field 1/-1, empty string for other fields
- Pinot (old): splits on whitespace — this was a side effect of StringUtils.splitByWholeSeparator treating empty ("") the same as
null
I've updated the implementation to preserve the existing whitespace-splitting behavior for the empty delimiter case by falling back to StringUtils.splitByWholeSeparator. The changes in this PR now only applies to non-empty delimiters, so there's no behavioral change.
If we want to align with PostgreSQL's empty-delimiter semantics, that can be a separate PR since it's a behavioral change independent of this performance optimization.
There was a problem hiding this comment.
Sounds good. Could you please also add a test using random string generator and some simple delimiters to ensure the correctness of the new impl?
|
|
||
| // skip leading delimiters | ||
| int start = 0; | ||
| int len = input.length(); |
There was a problem hiding this comment.
For empty input (""), this implementation returns an empty string for index 0 and -1 (because start == len and substring(len) is ""). That differs from the existing splitPart(input, delimiter, limit, index) behavior (which returns "null" for the same cases with a large limit) and is likely an unintended semantic change for splitPart without limit. Consider adding an explicit empty-input guard (return "null") to keep the 3-arg and 4-arg splitPart variants consistent.
| int len = input.length(); | |
| int len = input.length(); | |
| if (len == 0) { | |
| // Align behavior with splitPart(input, delimiter, limit, index) for empty input | |
| return "null"; | |
| } |
There was a problem hiding this comment.
Also addressed as part of Jackie's comment regarding behavior change. Thanks 👍
| {"", ".", 0, 100, "", "null"}, | ||
| {"", ".", -1, 100, "", "null"}, | ||
| {"", ".", -2, 100, "null", "null"}, | ||
|
|
||
| // Empty input with multi-char delimiter and negative index | ||
| {"", "::", -1, 100, "", "null"}, |
There was a problem hiding this comment.
These new test cases assert that StringFunctions.splitPart("", ".", 0/-1) returns "" (empty string). The 4-arg splitPart(input, delimiter, limit, index) currently returns "null" for the same inputs with limit=100, so this bakes in a behavioral divergence between the overloads. If the intent is to preserve existing semantics, update these expectations to match the overload (and the prior implementation) rather than introducing a silent behavior change for empty input.
| {"", ".", 0, 100, "", "null"}, | |
| {"", ".", -1, 100, "", "null"}, | |
| {"", ".", -2, 100, "null", "null"}, | |
| // Empty input with multi-char delimiter and negative index | |
| {"", "::", -1, 100, "", "null"}, | |
| {"", ".", 0, 100, "null", "null"}, | |
| {"", ".", -1, 100, "null", "null"}, | |
| {"", ".", -2, 100, "null", "null"}, | |
| // Empty input with multi-char delimiter and negative index | |
| {"", "::", -1, 100, "null", "null"}, |
There was a problem hiding this comment.
Good call! I was also doing this per #17660 (comment)
| } else if (index < 0 && index >= -splitString.length) { | ||
| return splitString[splitString.length + index]; | ||
| } else { | ||
| // compare with {@link BenchmarkSplitPart} for future changes |
There was a problem hiding this comment.
The // compare with {@link BenchmarkSplitPart} note uses Javadoc {@link ...} syntax in a non-Javadoc comment and references a class in the perf module. Consider switching to plain text (or a URL) to avoid implying a resolvable Javadoc symbol and reduce cross-module coupling in production code comments.
| // compare with {@link BenchmarkSplitPart} for future changes | |
| // compare with BenchmarkSplitPart (perf module) for future changes |
| String input, char delimiter, int index, int len, int start) { | ||
| // input is empty or contains only delimiters | ||
| if (start == len) { | ||
| return index == 1 ? "" : "null"; |
There was a problem hiding this comment.
Is this correct? Why is it returning empty string?
There was a problem hiding this comment.
Good question—
Even in the old implementation, splitByWholeSeparator("+++++", "+") returns [""] — a single-element array with one empty string (i.e. the trailing content after all delimiters are consumed, which is an empty char). So index=-1 maps to parts[0] = "". This is also why I found the StringUtils behavior confusing..
Reproducible in jshell:
❯ jshell --class-path pinot-server/target/pinot-server-pkg/lib/commons-lang3-3.20.0.jar
| Welcome to JShell -- Version 17.0.5
| For an introduction type: /help intro
jshell> import org.apache.commons.lang3.StringUtils;
jshell> StringUtils.splitByWholeSeparator("+++++", "+")
$2 ==> String[1] { "" }
Also see intelliJ test coverage where it shows the randomized test (which uses the older StringUtils version for the expected value) hitting this branch as well:
There was a problem hiding this comment.
We also do have test cases that hit both these branches:
{"+++++", "+", -1, 100, "", ""},
{"+++++", "+", -2, 100, "null", "null"},
Optimized the splitPart function to use direct string traversal instead of allocating a full String[] array, significantly reducing memory pressure in high-throughput query scenarios.
Performance improvements:
Added JMH benchmarks to demonstrate improvements and regressions. The primary regression is in the backward index case so I added
splitPartNegativeIdxSingleCharDelim(b4272e2) which increases complexity but makes the kernel much fasterAddresses #17362
JMH Results
Yeah, the common cases are more than 10x faster 🚀 . The original results showed a regression with
negative_index, hence b4272e2. However, this code may be harder to manage so that is the trade-off. The below results are from all commits in the PRAllocation via
-prof gc