Skip to content

feat: add list type support to ArrowWriter#149

Open
jding-xyz wants to merge 7 commits intoapache:mainfrom
jding-xyz:feat/list-type-writer-support
Open

feat: add list type support to ArrowWriter#149
jding-xyz wants to merge 7 commits intoapache:mainfrom
jding-xyz:feat/list-type-writer-support

Conversation

@jding-xyz
Copy link
Copy Markdown
Contributor

@jding-xyz jding-xyz commented Mar 19, 2026

Summary

ArrowWriter cannot serialize list<T> columns. This PR adds list type support and fixes several related issues:

  • List type writer support (ArrowWriterHelper.swift, ArrowWriter.swift): Added .list to toFBTypeEnum/toFBType, and list handling in writeField, writeFieldNodes, writeBufferInfo, writeRecordBatchData — mirrors the existing struct pattern.
  • Field-node ordering fix (ArrowWriter.swift): Moved child recursion before parent fbb.create in writeFieldNodes so the IPC output follows depth-first pre-order as required by the spec. Without this, pyarrow rejects the stream.
  • Streaming schema padding (ArrowWriter.swift): Added addPadForAlignment in writeStreaming to match writeFile — the spec requires 8-byte-aligned metadata.
  • appendAny fix for nested builders (ArrowArrayBuilder.swift): Overrode appendAny in StructArrayBuilder and ListArrayBuilder to delegate to self.append, so child builders receive data correctly.
  • Public ArrowField init (ArrowSchema.swift): Made ArrowField.init public so external consumers can construct ArrowTypeStruct with explicit fields.

Testing

Added IPC round-trip tests (testListInt32RBInMemoryToFromStream, testListStructRBInMemoryToFromStream), a streaming schema padding assertion (testStreamingSchemaMetadataPadding), and appendAny unit tests for nested builders (testStructArrayBuilderAppendAny, testListArrayBuilderAppendAny, testListOfStructAppendAny). Also validated end-to-end against pyarrow. All existing tests continue to pass.

AI disclosure

An AI coding assistant (Cursor with Claude) was used during development. All changes were manually reviewed, debugged, and verified against the Arrow IPC spec and by round-tripping against pyarrow.

@kou kou requested a review from Copilot March 25, 2026 09:42
@kou
Copy link
Copy Markdown
Member

kou commented Mar 25, 2026

Could you add a test for this?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Arrow IPC writer support for list<T> (including nested list<struct<...>>) by extending schema/type serialization and record-batch node/buffer/data emission, plus a couple of related correctness fixes in streaming output and nested builders.

Changes:

  • Add FlatBuffers schema/type emission for Arrow list in ArrowWriterHelper.
  • Extend ArrowWriter to serialize list field children, field nodes, buffers, and nested record-batch data; fix nested field-node ordering and ensure schema message padding in streaming format.
  • Fix nested appendAny behavior for StructArrayBuilder / ListArrayBuilder and make ArrowField initializer public for external schema construction.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
Sources/Arrow/ArrowWriterHelper.swift Map Arrow list to FlatBuffers type enum and create List FB table.
Sources/Arrow/ArrowWriter.swift Serialize list children/nodes/buffers/data; fix nested node ordering; pad schema metadata in streaming.
Sources/Arrow/ArrowArrayBuilder.swift Override appendAny in nested builders to correctly distribute to child builders.
Sources/Arrow/ArrowSchema.swift Make ArrowField initializer public for external schema construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +188 to +204
// FlatBuffer vectors use prepend semantics: last-written element becomes
// the first when read. Arrow IPC requires depth-first pre-order (parent
// before children), so children must be written before their parent here.
if let nestedType = column.type as? ArrowTypeStruct {
let nestedArray = column.array as? NestedArray
if let nestedFields = nestedArray?.fields {
writeFieldNodes(nestedType.fields, columns: nestedFields, offsets: &offsets, fbb: &fbb)
}
} else if let listType = column.type as? ArrowTypeList {
if let nestedArray = column.array as? NestedArray, let valuesHolder = nestedArray.values {
writeFieldNodes([listType.elementField], columns: [valuesHolder], offsets: &offsets, fbb: &fbb)
}
}
let fieldNode =
org_apache_arrow_flatbuf_FieldNode(length: Int64(column.length),
nullCount: Int64(column.nullCount))
offsets.append(fbb.create(struct: fieldNode))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add IPC-level unit tests for list columns (at least list<int32> and list<struct<...>>) that round-trip through ArrowWriter.writeStreaming/ArrowReader.readStreaming. This change adds list handling and adjusts nested field-node ordering; without regression tests it’s easy to reintroduce invalid node/buffer ordering that external readers reject.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testListInt32RBInMemoryToFromStream and testListStructRBInMemoryToFromStream in IPCTests.

Comment on lines 372 to 379
public func writeStreaming(_ info: ArrowWriter.Info) -> Result<Data, ArrowError> {
let writer: any DataWriter = InMemDataWriter()
switch toMessage(info.schema) {
case .success(let schemaData):
case .success(var schemaData):
addPadForAlignment(&schemaData)
withUnsafeBytes(of: CONTINUATIONMARKER.littleEndian) {writer.append(Data($0))}
withUnsafeBytes(of: UInt32(schemaData.count).littleEndian) {writer.append(Data($0))}
writer.append(schemaData)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a regression test that asserts the schema message metadata length written by writeStreaming is padded to an 8-byte multiple (and therefore the first record-batch body starts at an 8-byte-aligned offset). This is a spec requirement and a subtle invariant that can regress without a targeted test.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testStreamingSchemaMetadataPadding

Comment on lines +149 to +151
public override func appendAny(_ val: Any?) {
self.append(val as? [Any?])
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a unit test that exercises appendAny on StructArrayBuilder and ListArrayBuilder (especially list<struct<...>>), since this change fixes a correctness issue where appendAny previously bypassed child-builder distribution. Existing tests cover appendAny for primitive builders but not nested builders, so this could regress without targeted coverage.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testStructArrayBuilderAppendAny and testListOfStructAppendAny

Comment on lines +193 to +195
public override func appendAny(_ val: Any?) {
self.append(val as? [Any?])
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a unit test that calls ListArrayBuilder.appendAny with non-nil lists and verifies the produced array’s values child contains the appended elements. This override is fixing a nested-builder correctness bug; having a regression test here would help prevent silent reintroductions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testListArrayBuilderAppendAny.

ArrowWriterHelper.toFBTypeEnum and toFBType now handle .list,
and ArrowWriter.writeField/writeFieldNodes/writeBufferInfo/
writeRecordBatchData recurse into ArrowTypeList children the
same way they already do for ArrowTypeStruct.

This unblocks writing schemas that use list<T> columns
(e.g. the cloud stream server 'raw' schema with list<float32>).

Also moves the nested-type recursion in writeBufferInfo and
writeRecordBatchData outside the inner buffer loop so it runs
once per column rather than once per buffer.

Made-with: Cursor
writeFieldNodes iterates in reverse to match FlatBuffers prepend
semantics, but wrote parent field nodes before recursing into
children. This placed child field nodes before their parent in
the final vector, violating Arrow IPC's depth-first pre-order
requirement. pyarrow rejects such streams with "Array length did
not match record batch length".

Move child recursion (for both struct and list types) before the
parent fbb.create call so children are prepended first and end up
after their parent in the resulting vector.

Made-with: Cursor
The Arrow IPC spec requires all message metadata to be padded to
a multiple of 8 bytes. writeFile already does this via
addPadForAlignment, but writeStreaming did not. Without padding,
the record batch body can start at a non-8-byte-aligned offset,
causing alignment violations when strict readers (pyarrow + Rust
FFI) create zero-copy buffers from the stream.

Made-with: Cursor
The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly,
bypassing the overridden append methods in StructArrayBuilder and
ListArrayBuilder that distribute values to child builders. This causes
list<struct<...>> columns to produce struct arrays with empty children,
since the struct's field builders never receive data.

Override appendAny in both classes to delegate to self.append, ensuring
child builder distribution runs for nested types like list<struct<...>>.

Made-with: Cursor
ArrowField is a public class but its initializer was internal, preventing
external consumers from constructing ArrowTypeStruct instances with
explicit fields. This is needed for list<struct<...>> column schemas.

Made-with: Cursor
- IPC round-trip tests for list<int32> and list<struct<string, float>>
  via writeStreaming/readStreaming with full data value verification
- Regression test asserting streaming schema metadata is padded to
  8-byte multiples (Arrow IPC spec requirement)
- Unit tests for StructArrayBuilder.appendAny and
  ListArrayBuilder.appendAny verifying child builder distribution
- End-to-end test for list<struct<...>> appendAny chain

Made-with: Cursor
@jding-xyz jding-xyz force-pushed the feat/list-type-writer-support branch from cbb1006 to df1c4dc Compare March 26, 2026 06:44
@jding-xyz
Copy link
Copy Markdown
Contributor Author

Could you add a test for this?

Done. Added IPC round-trip tests and appendAny unit tests. All tests pass with swift test

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kou
Copy link
Copy Markdown
Member

kou commented Mar 30, 2026

@willtemperley Could you review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants