[core] Extract Parquet stats from in-memory footer instead of re-reading file#7489
Open
majian1998 wants to merge 1 commit intoapache:masterfrom
Open
[core] Extract Parquet stats from in-memory footer instead of re-reading file#7489majian1998 wants to merge 1 commit intoapache:masterfrom
majian1998 wants to merge 1 commit intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Closes #7467.
Currently, Paimon extracts Parquet column statistics by re-reading the file footer after the writer is closed. On object stores like OSS, a newly closed file may not be immediately visible for reading due to eventual consistency, causing
FileNotFoundException.This PR caches the
ParquetMetadatain memory during writer close (viaParquetWriter.getFooter()), and passes it through the stats extraction pipeline to avoid re-reading the file. When the in-memory metadata is unavailable (e.g. non-Parquet formats), it falls back to the original file-based extraction path.Tests
ParquetInMemoryStatsTest#testInMemoryStatsMatchFileStats— verifies in-memory stats match file-based stats across multiple data typesParquetInMemoryStatsTest#testInMemoryStatsFallbackWhenMetadataIsNull— verifies null metadata falls back to file readingParquetInMemoryStatsTest#testWriterMetadataDefaultIsNull— verifies defaultFormatWriterreturns null metadataAPI and Format
No. All new interface methods use default implementations with null/fallback behavior. No breaking changes to public APIs or storage format.
Documentation
No.
Generative AI tooling
Generated-by: Claude Opus 4.6