Describe the bug, including details regarding any error messages, version, and platform.
Tested using v1.16.0 on openJDK 11 and 17.
- nation.dict-malformed.parquet
> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat nation.dict-malformed.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file nation.dict-malformed.parquet
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
at org.apache.parquet.cli.Main.run(Main.java:169)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: java.lang.RuntimeException: Failed while reading Parquet file: nation.dict-malformed.parquet
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:360)
at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
... 3 more
Caused by: java.io.EOFException
at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:2100)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1990)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1920)
at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1454)
at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1188)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:1135)
at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1380)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
... 6 more
This seems related to an issue with an older version of the java writer: apache/arrow#42298
It's been fixed in the C++/python version by apache/parquet-cpp#209 (file loads fine with pyArrow), but maybe not in the hadoop reader ?
Links to the fix and the test in current version:
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/file_reader.cc#L199
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/reader_test.cc#L977
Note that the file can be read by the old parquet-tools (tested with v1.10.1).
- fixed_length_byte_array.parquet
pyArrow (and parquet-tools) also fails to read this one, so I think it's a wider problem. I'll open an issue on their repository, this is more to let you know.
> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat fixed_length_byte_array.parquet
{"flba_field": [0, 0, 3, -24]}
[...]
{"flba_field": [0, 0, 3, -122]}
Unknown error
java.lang.RuntimeException: Failed on record 90 in file fixed_length_byte_array.parquet
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
at org.apache.parquet.cli.Main.run(Main.java:169)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 92 in block 0 in file file:/home/ccleva/dev/tlabs-data/tablesaw-parquet/target/test/data/parquet-testing-master/data/fixed_length_byte_array.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:350)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [flba_field] required fixed_len_byte_array(4) flba_field at value 92 out of 1000, 92 out of 100 in currentPage. repetition level: 0, definition level: 0
at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:604)
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:477)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
... 7 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes at offset 364
at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:47)
at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:411)
at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:579)
... 12 more
Caused by: java.io.EOFException
at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:45)
... 14 more
I recreated the file using the script in apache/parquet-testing#31 in case it was corrupted but got the same result (at a different offset).
Component(s)
No response
Describe the bug, including details regarding any error messages, version, and platform.
Tested using v1.16.0 on openJDK 11 and 17.
This seems related to an issue with an older version of the java writer: apache/arrow#42298
It's been fixed in the C++/python version by apache/parquet-cpp#209 (file loads fine with pyArrow), but maybe not in the hadoop reader ?
Links to the fix and the test in current version:
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/file_reader.cc#L199
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/reader_test.cc#L977
Note that the file can be read by the old parquet-tools (tested with v1.10.1).
pyArrow (and parquet-tools) also fails to read this one, so I think it's a wider problem. I'll open an issue on their repository, this is more to let you know.
I recreated the file using the script in apache/parquet-testing#31 in case it was corrupted but got the same result (at a different offset).
Component(s)
No response