feat: add list type support to ArrowWriter by jding-xyz · Pull Request #149 · apache/arrow-swift

jding-xyz · 2026-03-19T19:30:40Z

Summary

ArrowWriter cannot serialize list<T> columns. This PR adds list type support and fixes several related issues:

List type writer support (ArrowWriterHelper.swift, ArrowWriter.swift): Added .list to toFBTypeEnum/toFBType, and list handling in writeField, writeFieldNodes, writeBufferInfo, writeRecordBatchData — mirrors the existing struct pattern.
Field-node ordering fix (ArrowWriter.swift): Moved child recursion before parent fbb.create in writeFieldNodes so the IPC output follows depth-first pre-order as required by the spec. Without this, pyarrow rejects the stream.
Streaming schema padding (ArrowWriter.swift): Added addPadForAlignment in writeStreaming to match writeFile — the spec requires 8-byte-aligned metadata.
appendAny fix for nested builders (ArrowArrayBuilder.swift): Overrode appendAny in StructArrayBuilder and ListArrayBuilder to delegate to self.append, so child builders receive data correctly.
Public ArrowField init (ArrowSchema.swift): Made ArrowField.init public so external consumers can construct ArrowTypeStruct with explicit fields.

Testing

Added IPC round-trip tests (testListInt32RBInMemoryToFromStream, testListStructRBInMemoryToFromStream), a streaming schema padding assertion (testStreamingSchemaMetadataPadding), and appendAny unit tests for nested builders (testStructArrayBuilderAppendAny, testListArrayBuilderAppendAny, testListOfStructAppendAny). Also validated end-to-end against pyarrow. All existing tests continue to pass.

AI disclosure

An AI coding assistant (Cursor with Claude) was used during development. All changes were manually reviewed, debugged, and verified against the Arrow IPC spec and by round-tripping against pyarrow.

kou · 2026-03-25T09:43:02Z

Could you add a test for this?

Copilot

Pull request overview

Adds Arrow IPC writer support for list<T> (including nested list<struct<...>>) by extending schema/type serialization and record-batch node/buffer/data emission, plus a couple of related correctness fixes in streaming output and nested builders.

Changes:

Add FlatBuffers schema/type emission for Arrow list in ArrowWriterHelper.
Extend ArrowWriter to serialize list field children, field nodes, buffers, and nested record-batch data; fix nested field-node ordering and ensure schema message padding in streaming format.
Fix nested appendAny behavior for StructArrayBuilder / ListArrayBuilder and make ArrowField initializer public for external schema construction.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
Sources/Arrow/ArrowWriterHelper.swift	Map Arrow `list` to FlatBuffers type enum and create `List` FB table.
Sources/Arrow/ArrowWriter.swift	Serialize list children/nodes/buffers/data; fix nested node ordering; pad schema metadata in streaming.
Sources/Arrow/ArrowArrayBuilder.swift	Override `appendAny` in nested builders to correctly distribute to child builders.
Sources/Arrow/ArrowSchema.swift	Make `ArrowField` initializer public for external schema construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T09:51:56Z

Sources/Arrow/ArrowWriter.swift

+            // FlatBuffer vectors use prepend semantics: last-written element becomes
+            // the first when read. Arrow IPC requires depth-first pre-order (parent
+            // before children), so children must be written before their parent here.
            if let nestedType = column.type as? ArrowTypeStruct {
                let nestedArray = column.array as? NestedArray
                if let nestedFields = nestedArray?.fields {
                    writeFieldNodes(nestedType.fields, columns: nestedFields, offsets: &offsets, fbb: &fbb)
                }
+            } else if let listType = column.type as? ArrowTypeList {
+                if let nestedArray = column.array as? NestedArray, let valuesHolder = nestedArray.values {
+                    writeFieldNodes([listType.elementField], columns: [valuesHolder], offsets: &offsets, fbb: &fbb)
+                }
            }
+            let fieldNode =
+                org_apache_arrow_flatbuf_FieldNode(length: Int64(column.length),
+                                                   nullCount: Int64(column.nullCount))
+            offsets.append(fbb.create(struct: fieldNode))


Add IPC-level unit tests for list columns (at least list<int32> and list<struct<...>>) that round-trip through ArrowWriter.writeStreaming/ArrowReader.readStreaming. This change adds list handling and adjusts nested field-node ordering; without regression tests it’s easy to reintroduce invalid node/buffer ordering that external readers reject.

Added testListInt32RBInMemoryToFromStream and testListStructRBInMemoryToFromStream in IPCTests.

Copilot · 2026-03-25T09:51:57Z

Sources/Arrow/ArrowWriter.swift

    public func writeStreaming(_ info: ArrowWriter.Info) -> Result<Data, ArrowError> {
        let writer: any DataWriter = InMemDataWriter()
        switch toMessage(info.schema) {
-        case .success(let schemaData):
+        case .success(var schemaData):
+            addPadForAlignment(&schemaData)
            withUnsafeBytes(of: CONTINUATIONMARKER.littleEndian) {writer.append(Data($0))}
            withUnsafeBytes(of: UInt32(schemaData.count).littleEndian) {writer.append(Data($0))}
            writer.append(schemaData)


Consider adding a regression test that asserts the schema message metadata length written by writeStreaming is padded to an 8-byte multiple (and therefore the first record-batch body starts at an 8-byte-aligned offset). This is a spec requirement and a subtle invariant that can regress without a targeted test.

Added testStreamingSchemaMetadataPadding

Copilot · 2026-03-25T09:51:57Z

Sources/Arrow/ArrowArrayBuilder.swift

+    public override func appendAny(_ val: Any?) {
+        self.append(val as? [Any?])
+    }


Add a unit test that exercises appendAny on StructArrayBuilder and ListArrayBuilder (especially list<struct<...>>), since this change fixes a correctness issue where appendAny previously bypassed child-builder distribution. Existing tests cover appendAny for primitive builders but not nested builders, so this could regress without targeted coverage.

Added testStructArrayBuilderAppendAny and testListOfStructAppendAny

Copilot · 2026-03-25T09:51:57Z

Sources/Arrow/ArrowArrayBuilder.swift

+    public override func appendAny(_ val: Any?) {
+        self.append(val as? [Any?])
+    }


Add a unit test that calls ListArrayBuilder.appendAny with non-nil lists and verifies the produced array’s values child contains the appended elements. This override is fixing a nested-builder correctness bug; having a regression test here would help prevent silent reintroductions.

Added testListArrayBuilderAppendAny.

ArrowWriterHelper.toFBTypeEnum and toFBType now handle .list, and ArrowWriter.writeField/writeFieldNodes/writeBufferInfo/ writeRecordBatchData recurse into ArrowTypeList children the same way they already do for ArrowTypeStruct. This unblocks writing schemas that use list<T> columns (e.g. the cloud stream server 'raw' schema with list<float32>). Also moves the nested-type recursion in writeBufferInfo and writeRecordBatchData outside the inner buffer loop so it runs once per column rather than once per buffer. Made-with: Cursor

writeFieldNodes iterates in reverse to match FlatBuffers prepend semantics, but wrote parent field nodes before recursing into children. This placed child field nodes before their parent in the final vector, violating Arrow IPC's depth-first pre-order requirement. pyarrow rejects such streams with "Array length did not match record batch length". Move child recursion (for both struct and list types) before the parent fbb.create call so children are prepended first and end up after their parent in the resulting vector. Made-with: Cursor

The Arrow IPC spec requires all message metadata to be padded to a multiple of 8 bytes. writeFile already does this via addPadForAlignment, but writeStreaming did not. Without padding, the record batch body can start at a non-8-byte-aligned offset, causing alignment violations when strict readers (pyarrow + Rust FFI) create zero-copy buffers from the stream. Made-with: Cursor

The base ArrowArrayBuilder.appendAny calls bufferBuilder.append directly, bypassing the overridden append methods in StructArrayBuilder and ListArrayBuilder that distribute values to child builders. This causes list<struct<...>> columns to produce struct arrays with empty children, since the struct's field builders never receive data. Override appendAny in both classes to delegate to self.append, ensuring child builder distribution runs for nested types like list<struct<...>>. Made-with: Cursor

ArrowField is a public class but its initializer was internal, preventing external consumers from constructing ArrowTypeStruct instances with explicit fields. This is needed for list<struct<...>> column schemas. Made-with: Cursor

- IPC round-trip tests for list<int32> and list<struct<string, float>> via writeStreaming/readStreaming with full data value verification - Regression test asserting streaming schema metadata is padded to 8-byte multiples (Arrow IPC spec requirement) - Unit tests for StructArrayBuilder.appendAny and ListArrayBuilder.appendAny verifying child builder distribution - End-to-end test for list<struct<...>> appendAny chain Made-with: Cursor

jding-xyz · 2026-03-26T06:49:36Z

Could you add a test for this?

Done. Added IPC round-trip tests and appendAny unit tests. All tests pass with swift test

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Made-with: Cursor

kou · 2026-03-30T06:14:02Z

@willtemperley Could you review this?

kou requested a review from Copilot March 25, 2026 09:42

Copilot started reviewing on behalf of kou March 25, 2026 09:43 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

jding-xyz added 6 commits March 25, 2026 23:36

fix: make ArrowField initializer public

8cf201e

ArrowField is a public class but its initializer was internal, preventing external consumers from constructing ArrowTypeStruct instances with explicit fields. This is needed for list<struct<...>> column schemas. Made-with: Cursor

jding-xyz force-pushed the feat/list-type-writer-support branch from cbb1006 to df1c4dc Compare March 26, 2026 06:44

kou requested a review from Copilot March 26, 2026 08:54

Copilot started reviewing on behalf of kou March 26, 2026 08:54 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

style: fix swiftlint alignment in IPCTests

d8b2e93

Made-with: Cursor

Conversation

jding-xyz commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

AI disclosure

Uh oh!

kou commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jding-xyz Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jding-xyz Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jding-xyz Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jding-xyz Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

jding-xyz commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

kou commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jding-xyz commented Mar 19, 2026 •

edited

Loading