Skip to content

Hadoop parquet reader (and parquet-CLI) fails on some files from parquet-testing #3336

@ccleva

Description

@ccleva

Describe the bug, including details regarding any error messages, version, and platform.

Tested using v1.16.0 on openJDK 11 and 17.

  1. nation.dict-malformed.parquet
> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat nation.dict-malformed.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file nation.dict-malformed.parquet
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
	at org.apache.parquet.cli.Main.run(Main.java:169)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: java.lang.RuntimeException: Failed while reading Parquet file: nation.dict-malformed.parquet
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:360)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
	... 3 more
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:2100)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1990)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1920)
	at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1454)
	at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:1188)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:1135)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1380)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	... 6 more

This seems related to an issue with an older version of the java writer: apache/arrow#42298

It's been fixed in the C++/python version by apache/parquet-cpp#209 (file loads fine with pyArrow), but maybe not in the hadoop reader ?

Links to the fix and the test in current version:
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/file_reader.cc#L199
https://github.com/apache/arrow/blob/64f2055ffb68e5077420f4253e76d78952438cab/cpp/src/parquet/reader_test.cc#L977

Note that the file can be read by the old parquet-tools (tested with v1.10.1).

  1. fixed_length_byte_array.parquet

pyArrow (and parquet-tools) also fails to read this one, so I think it's a wider problem. I'll open an issue on their repository, this is more to let you know.

> java -cp parquet-cli-1.16.0.jar:dependency/* org.apache.parquet.cli.Main cat fixed_length_byte_array.parquet
{"flba_field": [0, 0, 3, -24]}
[...]
{"flba_field": [0, 0, 3, -122]}
Unknown error
java.lang.RuntimeException: Failed on record 90 in file fixed_length_byte_array.parquet
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
	at org.apache.parquet.cli.Main.run(Main.java:169)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:197)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 92 in block 0 in file file:/home/ccleva/dev/tlabs-data/tablesaw-parquet/target/test/data/parquet-testing-master/data/fixed_length_byte_array.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
	at org.apache.parquet.cli.BaseCommand$1$1.next(BaseCommand.java:350)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
	... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [flba_field] required fixed_len_byte_array(4) flba_field at value 92 out of 1000, 92 out of 100 in currentPage. repetition level: 0, definition level: 0
	at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:604)
	at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
	at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:477)
	at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:425)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:249)
	... 7 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read bytes at offset 364
	at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:47)
	at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:411)
	at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:579)
	... 12 more
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.slice(SingleBufferInputStream.java:116)
	at org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesReader.readBytes(FixedLenByteArrayPlainValuesReader.java:45)
	... 14 more

I recreated the file using the script in apache/parquet-testing#31 in case it was corrupted but got the same result (at a different offset).

Component(s)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions