Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@JFinis has been working on a proposal to better store statistics for floating point values in Parquet. The most recent proposal is here
In order to change the format, there needs to be at least 2 open source implementations of a proposal
There is also some question (see this link from @tustvold ) about how complex this would be to implement / get right.
Describe the solution you'd like
I would like to implement a draft of the specification in apache/parquet-format#514 in arrow-rs to show it is possible and keep the Rust implementation on the leading edge of implementation.
Describe alternatives you've considered
We would also need to implement the nan_count field along with filtering out nans when writing statistics for floats.
Some good tests would be to
-
Write floating point data (specified below) to a parquet file
-
Read the metadata back and verify min/max values and nan_count for the following cases
-
A column with no Nan values,
-
A column with a single +Nan value (should not appear in stats)
-
A column with a single -Nan value (should not appear in stats)
-
A column of Only Nan values
-
A column with Inf and some +/- Nans
-
A column with -Inf and some +/- Nans
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@JFinis has been working on a proposal to better store statistics for floating point values in Parquet. The most recent proposal is here
In order to change the format, there needs to be at least 2 open source implementations of a proposal
There is also some question (see this link from @tustvold ) about how complex this would be to implement / get right.
Describe the solution you'd like
I would like to implement a draft of the specification in apache/parquet-format#514 in arrow-rs to show it is possible and keep the Rust implementation on the leading edge of implementation.
Describe alternatives you've considered
We would also need to implement the
nan_countfield along with filtering out nans when writing statistics for floats.Some good tests would be to
Write floating point data (specified below) to a parquet file
Read the metadata back and verify min/max values and
nan_countfor the following casesA column with no Nan values,
A column with a single +Nan value (should not appear in stats)
A column with a single -Nan value (should not appear in stats)
A column of Only Nan values
A column with Inf and some +/- Nans
A column with -Inf and some +/- Nans
Additional context