[client]Fix memory leak if oom when decompress data.#2647
[client]Fix memory leak if oom when decompress data.#2647loserwang1024 wants to merge 2 commits intoapache:mainfrom
Conversation
dd1a57b to
6daa427
Compare
|
@platinumhamburg @wuchong , CC |
| // vectorSchemaRootMap, | ||
| // for example temporary buffers left behind when an OOM occurs during decompression | ||
| // (see https://github.com/apache/fluss/issues/2646). | ||
| bufferAllocator.releaseBytes(bufferAllocator.getAllocatedMemory()); |
There was a problem hiding this comment.
Thanks for the fix! The approach is consistent with RecordAccumulator.close().
One concern: releaseBytes() only adjusts the allocator's accounting but doesn't actually free the underlying direct memory - those buffers will
remain until GC. This fix prevents the IllegalStateException on close, which is good for avoiding cascading failures, but the actual memory may still
leak temporarily.
For a more robust solution, consider adding try-finally in ZstdArrowCompressionCodec.doDecompress() to ensure temp buffers are released on failure.
But the current fix is acceptable as a defensive measure.
There was a problem hiding this comment.
@platinumhamburg , I modified this PR and add a patched version of Arrow's {@code VectorLoader} to fix this problem. Detail see VectorLoaderTest
0b07ab6 to
0e740b6
Compare
platinumhamburg
left a comment
There was a problem hiding this comment.
Great work identifying and fixing this Arrow buffer leak! I verified that the upstream Arrow-Java main branch
still has this bug in VectorLoader.loadBuffers() — the decompression loop sits outside the try block, and
buf.close() only runs on the success path. So this patch is definitely necessary.
The fix logic is correct: moving the decompression loop inside the try block and releasing buffers in the finally
clause properly handles both the success path (loadFieldBuffers retains +1, finally close -1) and the error path
(finally close frees all already-decompressed buffers).
Below are only one suggestion to improve maintainability.
| * | ||
| * @see <a href="https://github.com/apache/fluss/issues/2646">FLUSS-2646</a> | ||
| */ | ||
| public class VectorLoader { |
There was a problem hiding this comment.
I'd suggest changing the approach from class-shadowing to an explicitly named class under a
Fluss-owned package. Specifically: rename this to FlussVectorLoader and move it to
org.apache.fluss.record, then update the call site in ArrowUtils.createArrowReader().
I've verified locally that all Arrow APIs used in loadBuffers() — TypeLayout, FieldVector,
Collections2, Preconditions, etc. — are public in the shaded JAR, so there's no package-visibility
blocker. Compilation and all 3 tests pass with this change.
Why this is better than class-shadowing:
-
Explicit over implicit. The current approach relies on classpath ordering to silently
override the shaded Arrow JAR's VectorLoader. WithFlussVectorLoader, the call site in
ArrowUtils makes it obvious that a patched loader is being used — no hidden magic. -
Compile-time safety on Arrow upgrades. If Arrow's VectorLoader API changes in a future
version (e.g. the upstreammainbranch already added avariadicBufferCountsparameter),
the shadowed class would silently diverge. With an explicit class, any API incompatibility
surfaces as a compile error immediately. -
No risk of classpath conflicts. Class-shadowing behavior depends on JAR ordering, which
can vary across build tools, IDEs, and deployment environments. An explicitly named class
eliminates this uncertainty entirely.
The change is straightforward:
- Rename
VectorLoader→FlussVectorLoaderinorg.apache.fluss.record - Update import + usage in
ArrowUtils.createArrowReader() - Update import + usage in
VectorLoaderTest
Please also add a TODO in the class Javadoc for future cleanup:
// TODO(FLUSS-2646): Remove this patched VectorLoader once Apache Arrow fixes
// the buffer leak in loadBuffers(). Current Arrow version: 15.0.0
Additionally, I'd recommend filing an issue in https://github.com/apache/arrow-java/issues to
push for an upstream fix — that way we have a clear signal for when this patch can be retired.

Purpose
Linked issue: close #2646
Brief change log
Tests
API and Format
Documentation