Skip to content

[core] Extract Parquet stats from in-memory footer instead of re-reading file#7489

Open
majian1998 wants to merge 1 commit intoapache:masterfrom
majian1998:fix/parquet-inmemory-stats-extraction
Open

[core] Extract Parquet stats from in-memory footer instead of re-reading file#7489
majian1998 wants to merge 1 commit intoapache:masterfrom
majian1998:fix/parquet-inmemory-stats-extraction

Conversation

@majian1998
Copy link
Contributor

Purpose

Closes #7467.

Currently, Paimon extracts Parquet column statistics by re-reading the file footer after the writer is closed. On object stores like OSS, a newly closed file may not be immediately visible for reading due to eventual consistency, causing FileNotFoundException.

This PR caches the ParquetMetadata in memory during writer close (via ParquetWriter.getFooter()), and passes it through the stats extraction pipeline to avoid re-reading the file. When the in-memory metadata is unavailable (e.g. non-Parquet formats), it falls back to the original file-based extraction path.

Tests

  • ParquetInMemoryStatsTest#testInMemoryStatsMatchFileStats — verifies in-memory stats match file-based stats across multiple data types
  • ParquetInMemoryStatsTest#testInMemoryStatsFallbackWhenMetadataIsNull — verifies null metadata falls back to file reading
  • ParquetInMemoryStatsTest#testWriterMetadataDefaultIsNull — verifies default FormatWriter returns null metadata

API and Format

No. All new interface methods use default implementations with null/fallback behavior. No breaking changes to public APIs or storage format.

Documentation

No.

Generative AI tooling

Generated-by: Claude Opus 4.6

@JingsongLi JingsongLi closed this Mar 20, 2026
@JingsongLi JingsongLi reopened this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] the close() of the OSS stream may not guarantee the file is visible/readable on OSS immediately

2 participants