feat: add list type support to ArrowWriter#149
feat: add list type support to ArrowWriter#149jding-xyz wants to merge 7 commits intoapache:mainfrom
Conversation
|
Could you add a test for this? |
There was a problem hiding this comment.
Pull request overview
Adds Arrow IPC writer support for list<T> (including nested list<struct<...>>) by extending schema/type serialization and record-batch node/buffer/data emission, plus a couple of related correctness fixes in streaming output and nested builders.
Changes:
- Add FlatBuffers schema/type emission for Arrow
listinArrowWriterHelper. - Extend
ArrowWriterto serialize list field children, field nodes, buffers, and nested record-batch data; fix nested field-node ordering and ensure schema message padding in streaming format. - Fix nested
appendAnybehavior forStructArrayBuilder/ListArrayBuilderand makeArrowFieldinitializer public for external schema construction.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| Sources/Arrow/ArrowWriterHelper.swift | Map Arrow list to FlatBuffers type enum and create List FB table. |
| Sources/Arrow/ArrowWriter.swift | Serialize list children/nodes/buffers/data; fix nested node ordering; pad schema metadata in streaming. |
| Sources/Arrow/ArrowArrayBuilder.swift | Override appendAny in nested builders to correctly distribute to child builders. |
| Sources/Arrow/ArrowSchema.swift | Make ArrowField initializer public for external schema construction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // FlatBuffer vectors use prepend semantics: last-written element becomes | ||
| // the first when read. Arrow IPC requires depth-first pre-order (parent | ||
| // before children), so children must be written before their parent here. | ||
| if let nestedType = column.type as? ArrowTypeStruct { | ||
| let nestedArray = column.array as? NestedArray | ||
| if let nestedFields = nestedArray?.fields { | ||
| writeFieldNodes(nestedType.fields, columns: nestedFields, offsets: &offsets, fbb: &fbb) | ||
| } | ||
| } else if let listType = column.type as? ArrowTypeList { | ||
| if let nestedArray = column.array as? NestedArray, let valuesHolder = nestedArray.values { | ||
| writeFieldNodes([listType.elementField], columns: [valuesHolder], offsets: &offsets, fbb: &fbb) | ||
| } | ||
| } | ||
| let fieldNode = | ||
| org_apache_arrow_flatbuf_FieldNode(length: Int64(column.length), | ||
| nullCount: Int64(column.nullCount)) | ||
| offsets.append(fbb.create(struct: fieldNode)) |
There was a problem hiding this comment.
Add IPC-level unit tests for list columns (at least list<int32> and list<struct<...>>) that round-trip through ArrowWriter.writeStreaming/ArrowReader.readStreaming. This change adds list handling and adjusts nested field-node ordering; without regression tests it’s easy to reintroduce invalid node/buffer ordering that external readers reject.
There was a problem hiding this comment.
Added testListInt32RBInMemoryToFromStream and testListStructRBInMemoryToFromStream in IPCTests.
| public func writeStreaming(_ info: ArrowWriter.Info) -> Result<Data, ArrowError> { | ||
| let writer: any DataWriter = InMemDataWriter() | ||
| switch toMessage(info.schema) { | ||
| case .success(let schemaData): | ||
| case .success(var schemaData): | ||
| addPadForAlignment(&schemaData) | ||
| withUnsafeBytes(of: CONTINUATIONMARKER.littleEndian) {writer.append(Data($0))} | ||
| withUnsafeBytes(of: UInt32(schemaData.count).littleEndian) {writer.append(Data($0))} | ||
| writer.append(schemaData) |
There was a problem hiding this comment.
Consider adding a regression test that asserts the schema message metadata length written by writeStreaming is padded to an 8-byte multiple (and therefore the first record-batch body starts at an 8-byte-aligned offset). This is a spec requirement and a subtle invariant that can regress without a targeted test.
There was a problem hiding this comment.
Added testStreamingSchemaMetadataPadding
| public override func appendAny(_ val: Any?) { | ||
| self.append(val as? [Any?]) | ||
| } |
There was a problem hiding this comment.
Add a unit test that exercises appendAny on StructArrayBuilder and ListArrayBuilder (especially list<struct<...>>), since this change fixes a correctness issue where appendAny previously bypassed child-builder distribution. Existing tests cover appendAny for primitive builders but not nested builders, so this could regress without targeted coverage.
There was a problem hiding this comment.
Added testStructArrayBuilderAppendAny and testListOfStructAppendAny
| public override func appendAny(_ val: Any?) { | ||
| self.append(val as? [Any?]) | ||
| } |
There was a problem hiding this comment.
Add a unit test that calls ListArrayBuilder.appendAny with non-nil lists and verifies the produced array’s values child contains the appended elements. This override is fixing a nested-builder correctness bug; having a regression test here would help prevent silent reintroductions.
There was a problem hiding this comment.
Added testListArrayBuilderAppendAny.
ArrowWriterHelper.toFBTypeEnum and toFBType now handle .list, and ArrowWriter.writeField/writeFieldNodes/writeBufferInfo/ writeRecordBatchData recurse into ArrowTypeList children the same way they already do for ArrowTypeStruct. This unblocks writing schemas that use list<T> columns (e.g. the cloud stream server 'raw' schema with list<float32>). Also moves the nested-type recursion in writeBufferInfo and writeRecordBatchData outside the inner buffer loop so it runs once per column rather than once per buffer. Made-with: Cursor
writeFieldNodes iterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child field nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. pyarrow rejects such streams with "Array length did not match record batch length". Move child recursion (for both struct and list types) before the parent fbb.create call so children are prepended first and end up after their parent in the resulting vector. Made-with: Cursor
The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes. writeFile already does this via addPadForAlignment, but writeStreaming did not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream. Made-with: Cursor
The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly, bypassing the overridden append methods in StructArrayBuilder and ListArrayBuilder that distribute values to child builders. This causes list<struct<...>> columns to produce struct arrays with empty children, since the struct's field builders never receive data. Override appendAny in both classes to delegate to self.append, ensuring child builder distribution runs for nested types like list<struct<...>>. Made-with: Cursor
ArrowField is a public class but its initializer was internal, preventing external consumers from constructing ArrowTypeStruct instances with explicit fields. This is needed for list<struct<...>> column schemas. Made-with: Cursor
- IPC round-trip tests for list<int32> and list<struct<string, float>> via writeStreaming/readStreaming with full data value verification - Regression test asserting streaming schema metadata is padded to 8-byte multiples (Arrow IPC spec requirement) - Unit tests for StructArrayBuilder.appendAny and ListArrayBuilder.appendAny verifying child builder distribution - End-to-end test for list<struct<...>> appendAny chain Made-with: Cursor
cbb1006 to
df1c4dc
Compare
Done. Added IPC round-trip tests and appendAny unit tests. All tests pass with |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Made-with: Cursor
|
@willtemperley Could you review this? |
Summary
ArrowWritercannot serializelist<T>columns. This PR adds list type support and fixes several related issues:ArrowWriterHelper.swift,ArrowWriter.swift): Added.listtotoFBTypeEnum/toFBType, and list handling inwriteField,writeFieldNodes,writeBufferInfo,writeRecordBatchData— mirrors the existing struct pattern.ArrowWriter.swift): Moved child recursion before parentfbb.createinwriteFieldNodesso the IPC output follows depth-first pre-order as required by the spec. Without this, pyarrow rejects the stream.ArrowWriter.swift): AddedaddPadForAlignmentinwriteStreamingto matchwriteFile— the spec requires 8-byte-aligned metadata.appendAnyfix for nested builders (ArrowArrayBuilder.swift): OverrodeappendAnyinStructArrayBuilderandListArrayBuilderto delegate toself.append, so child builders receive data correctly.ArrowFieldinit (ArrowSchema.swift): MadeArrowField.initpublic so external consumers can constructArrowTypeStructwith explicit fields.Testing
Added IPC round-trip tests (
testListInt32RBInMemoryToFromStream,testListStructRBInMemoryToFromStream), a streaming schema padding assertion (testStreamingSchemaMetadataPadding), andappendAnyunit tests for nested builders (testStructArrayBuilderAppendAny,testListArrayBuilderAppendAny,testListOfStructAppendAny). Also validated end-to-end against pyarrow. All existing tests continue to pass.AI disclosure
An AI coding assistant (Cursor with Claude) was used during development. All changes were manually reviewed, debugged, and verified against the Arrow IPC spec and by round-tripping against pyarrow.