fix: correct VariableBufferBuilder null handling, first-value resize, and offset semantics#152
fix: correct VariableBufferBuilder null handling, first-value resize, and offset semantics#152jding-xyz wants to merge 8 commits intoapache:mainfrom
Conversation
Preserve and update nulls.length during resize and append so all three variable-width buffers follow the same bookkeeping pattern. Add a regression test for nil-first then long-value sequences. Made-with: Cursor
|
@willtemperley @maltzsama Could you review this? |
There was a problem hiding this comment.
Pull request overview
Fixes correctness and Arrow-spec compliance issues for variable-width (String/Binary) arrays by enforcing N+1 offsets semantics, correct null offset behavior, and consistent logical length tracking across builders, reader, and C data import.
Changes:
- Reworked
VariableBufferBuilder.append/resize/finishto maintain Arrow invariants (N+1 offsets, nulls don’t advance offsets, values/offsets/nulls logical lengths updated each append). - Updated
ArrowReader/ArrowCImporter/ArrowDatato interpret variable-width buffers using N+1 offsets and derive value buffer lengths from the final offset. - Added regression tests for long-first-value, null offset behavior, binary parity, and IPC streaming round-trip.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/ArrowTests/IPCTests.swift | Adds IPC streaming round-trip regression test for long first string + null handling. |
| Tests/ArrowTests/CDataTests.swift | Loosens capacity assertion to account for new/variable growth behavior. |
| Tests/ArrowTests/ArrayTests.swift | Adds raw-buffer assertions for offsets/value-length semantics for String/Binary builders. |
| Sources/Arrow/ArrowReader.swift | Reads N+1 offsets and sizes value buffer by last offset for variable-width arrays. |
| Sources/Arrow/ArrowData.swift | Derives variable-width array length from offsets length (offsets.count - 1). |
| Sources/Arrow/ArrowCImporter.swift | Imports variable-width offsets as N+1 entries (and derives value length from last offset). |
| Sources/Arrow/ArrowBufferBuilder.swift | Fixes variable-width builder semantics: null handling, first-value resizing, length tracking, N+1 offsets. |
| Sources/Arrow/ArrowBuffer.swift | Adds internal updateLength helper to keep logical lengths in sync. |
Comments suppressed due to low confidence (1)
Sources/Arrow/ArrowCImporter.swift:144
- In the C data importer,
appendToBuffersetsArrowBuffer.capacityequal to the passedlength, butcapacityis treated as a byte count elsewhere (e.g., IPC writer usesbuffer.capacityfor buffer sizes). Passinglength + 1for the offsets buffer makescapacityrepresent an element count, not bytes, which can lead to incorrect serialization sizes for imported arrays. Consider updatingappendToBufferto accept an element size (or explicit byteLength) so the offsets buffer uses(length + 1) * MemoryLayout<Int32>.stridebytes for capacity (while keepinglengthas the element count).
appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: (length + 7) / 8)
appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
let lastOffsetLength = cArray.buffers[1]!
.advanced(by: Int(length) * MemoryLayout<Int32>.stride)
.load(as: Int32.self)
appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let arrowOffsetBuffer = makeBuffer(offsetBuffer, fileData: loadInfo.fileData, | ||
| length: UInt(node.length), messageOffset: loadInfo.messageOffset) | ||
| length: UInt(node.length + 1), messageOffset: loadInfo.messageOffset) | ||
| let lastOffset = arrowOffsetBuffer.rawPointer | ||
| .advanced(by: Int(node.length) * MemoryLayout<Int32>.stride) | ||
| .load(as: Int32.self) | ||
| let arrowValueBuffer = makeBuffer(valueBuffer, fileData: loadInfo.fileData, | ||
| length: UInt(node.length), messageOffset: loadInfo.messageOffset) | ||
| length: UInt(lastOffset), messageOffset: loadInfo.messageOffset) |
There was a problem hiding this comment.
loadVariableData derives the values buffer logical length from the last offset without validating it. If the IPC input is malformed, lastOffset can be negative or larger than the underlying valueBuffer.length, and the subsequent UInt(lastOffset) conversion / later value reads can crash or read out of bounds. Add guards to ensure the offset buffer is large enough to read the last offset, lastOffset >= 0, and lastOffset <= valueBuffer.length (bytes) before creating the value buffer (otherwise return .failure(.invalid(...))).
There was a problem hiding this comment.
Added guards to reject negative offsets and offsets exceeding the value buffer length, returning .failure(.invalid(...)) with a descriptive message. This is consistent with the existing guard chain in the method.
Sources/Arrow/ArrowData.swift
Outdated
| let arrayLength: UInt | ||
| switch arrowType.info { | ||
| case .variableInfo: | ||
| arrayLength = max(buffers[1].length, 1) - 1 |
There was a problem hiding this comment.
The variable-width ArrowData convenience init silently maps an offsets buffer length of 0 to an array length of 0 via max(buffers[1].length, 1) - 1. That avoids underflow but can also mask invalid input (variable-width arrays should have at least 1 offset). Consider validating buffers.count >= 2 and buffers[1].length >= 1 for .variableInfo and throwing an .invalid(...) error when the invariant is violated.
| arrayLength = max(buffers[1].length, 1) - 1 | |
| guard buffers.count >= 2 else { | |
| throw ArrowError.invalid("Variable-width ArrowData requires at least two buffers (null bitmap and offsets).") | |
| } | |
| let offsetsLength = buffers[1].length | |
| guard offsetsLength >= 1 else { | |
| throw ArrowError.invalid("Variable-width ArrowData requires a non-empty offsets buffer.") | |
| } | |
| arrayLength = offsetsLength - 1 |
There was a problem hiding this comment.
Modified as suggested. Also updated checkHolderForType in the test to provide spec-compliant dummy buffers for variable-width types.
| if isNull { | ||
| if let val = newValue { | ||
| let binData = getBytesFor(val)! | ||
| nextOffset += Int32(binData.count) |
There was a problem hiding this comment.
VariableBufferBuilder.append uses Int32 offsets and does nextOffset += Int32(binData.count). If a single value (or the running total) exceeds Int32.max, the Int32(...) conversion will trap or the addition can overflow, causing a crash. Consider explicitly checking that binData.count and the computed nextOffset stay within Int32.max (and fail fast with a clear error/precondition), or gate this builder to types that are guaranteed to fit Int32 offsets.
| nextOffset += Int32(binData.count) | |
| // Ensure per-value size fits into Int32. | |
| precondition( | |
| binData.count <= Int(Int32.max), | |
| "VariableBufferBuilder.append: element size exceeds Int32.max" | |
| ) | |
| // Compute the next offset in a wider type to detect overflow. | |
| let proposedOffset = Int64(currentOffset) + Int64(binData.count) | |
| precondition( | |
| proposedOffset <= Int64(Int32.max), | |
| "VariableBufferBuilder.append: total buffer size exceeds Int32.max" | |
| ) | |
| nextOffset = Int32(proposedOffset) |
There was a problem hiding this comment.
Modified as suggested.
Address review feedback on PR apache#152: - Validate lastOffset is non-negative and within value buffer bounds before constructing the value buffer in ArrowReader.loadVariableData - Replace silent max(_, 1) - 1 underflow guard in ArrowData with an explicit throw when the offsets buffer is missing or empty - Widen offset arithmetic to Int64 in VariableBufferBuilder.append and add a precondition for Int32 overflow - Update checkHolderForType test to provide spec-compliant buffers for variable-width types (3 buffers with a valid initial offset) Made-with: Cursor
Adopt Copilot's suggested form: two separate guards for missing buffers vs. empty offsets buffer, giving clearer diagnostics on failure. Made-with: Cursor
Add a separate precondition for the individual element size before checking the cumulative total, giving a more specific diagnostic when a single value exceeds Int32.max. Made-with: Cursor
I think this is a real issue but it needs more consideration and out of scope of this PR. We should address it later. |
|
Addressed all review comments. Please take a look. Thanks! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
Sources/Arrow/ArrowCImporter.swift:144
- In the variable-width import path,
lastOffsetLengthis loaded asInt32and then immediately converted toUIntfor the values buffer length. If the C data is malformed (negative last offset) this will trap. AlsocArray.buffers[1]!will crash if the offsets buffer pointer is nil. Add guards to ensure the offsets buffer exists andlastOffsetLength >= 0(and ideally that it’s consistent with the provided value buffer) before constructing the Swift buffers.
appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
let lastOffsetLength = cArray.buffers[1]!
.advanced(by: Int(length) * MemoryLayout<Int32>.stride)
.load(as: Int32.self)
appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Sources/Arrow/ArrowReader.swift
Outdated
| guard lastOffset >= 0 else { | ||
| return .failure(.invalid("Negative last offset (\(lastOffset)) in variable-width buffer")) | ||
| } | ||
| guard lastOffset <= valueBuffer.length else { |
There was a problem hiding this comment.
loadVariableData compares lastOffset (Int32) to valueBuffer.length (Int64). As written (lastOffset <= valueBuffer.length) this won’t compile in Swift due to mismatched integer types. Convert lastOffset to Int64 (or valueBuffer.length to Int32 after range-checking) before comparing.
| guard lastOffset <= valueBuffer.length else { | |
| guard Int64(lastOffset) <= valueBuffer.length else { |
There was a problem hiding this comment.
Good finding. Modified as suggested.
Sources/Arrow/ArrowData.swift
Outdated
| guard buffers.count >= 2 else { | ||
| throw ArrowError.invalid( | ||
| "Variable-width ArrowData requires at least two buffers (null bitmap and offsets).") |
There was a problem hiding this comment.
For .variableInfo, this convenience init only validates buffers.count >= 2, but the rest of the codebase (e.g. StringArray/BinaryArray) assumes the [nullBitmap, offsets, values] layout and will access buffers[2]. Allowing only 2 buffers can create an ArrowData that crashes later at runtime. Consider requiring buffers.count >= 3 for variable-width types (or otherwise validating that the values buffer exists).
| guard buffers.count >= 2 else { | |
| throw ArrowError.invalid( | |
| "Variable-width ArrowData requires at least two buffers (null bitmap and offsets).") | |
| guard buffers.count >= 3 else { | |
| throw ArrowError.invalid( | |
| "Variable-width ArrowData requires at least three buffers (null bitmap, offsets, and values).") |
There was a problem hiding this comment.
Good catch. Modified as suggested.
Made-with: Cursor
- Use explicit Int64 conversion in lastOffset comparison for clarity - Require 3 buffers (not 2) for variable-width types since downstream code accesses buffers[2] for the values buffer Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
Sources/Arrow/ArrowCImporter.swift:144
- In
ArrowCImporter, the offsets buffer is appended withlength: length + 1, butappendToBuffersetscapacityequal tolength. For offsets,capacityneeds to be the byte size ((length + 1) * MemoryLayout<Int32>.stride) so that downstreamgetBufferDataSizes/ IPC writing includes the full offsets buffer. Consider changingappendToBufferto accept both a logical element count and a byteCapacity (or compute byteCapacity based on buffer kind) and setArrowBuffer.capacityappropriately.
appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: (length + 7) / 8)
appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
let lastOffsetLength = cArray.buffers[1]!
.advanced(by: Int(length) * MemoryLayout<Int32>.stride)
.load(as: Int32.self)
appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))
Sources/Arrow/ArrowCImporter.swift:144
lastOffsetLengthis read from the imported offsets buffer and then converted toUIntfor the values buffer length. If the C data contains a negative last offset,UInt(lastOffsetLength)will trap at runtime (DoS). Add aguard lastOffsetLength >= 0(and ideally validate monotonic offsets / that lastOffset fits within the provided values buffer) and return.failure(.invalid(...))on malformed input.
let lastOffsetLength = cArray.buffers[1]!
.advanced(by: Int(length) * MemoryLayout<Int32>.stride)
.load(as: Int32.self)
appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let emptyPtr = UnsafeMutableRawPointer.allocate(byteCount: 0, alignment: .zero) | ||
| let buffers: [ArrowBuffer] | ||
| switch checkType.info { | ||
| case .variableInfo: | ||
| let offsetPtr = UnsafeMutableRawPointer.allocate(byteCount: MemoryLayout<Int32>.stride, alignment: 4) | ||
| offsetPtr.storeBytes(of: Int32(0), as: Int32.self) | ||
| buffers = [ | ||
| ArrowBuffer(length: 0, capacity: 0, rawPointer: emptyPtr), | ||
| ArrowBuffer(length: 1, capacity: UInt(MemoryLayout<Int32>.stride), rawPointer: offsetPtr), | ||
| ArrowBuffer(length: 0, capacity: 0, | ||
| rawPointer: UnsafeMutableRawPointer.allocate(byteCount: 0, alignment: .zero)) | ||
| ] | ||
| default: | ||
| buffers = [ | ||
| ArrowBuffer(length: 0, capacity: 0, rawPointer: emptyPtr), | ||
| ArrowBuffer(length: 0, capacity: 0, | ||
| rawPointer: UnsafeMutableRawPointer.allocate(byteCount: 0, alignment: .zero)) | ||
| ] | ||
| } |
There was a problem hiding this comment.
checkHolderForType reuses the same emptyPtr for multiple ArrowBuffer instances that all default to isMemoryOwner: true. When these buffers are deallocated this will double-free the same pointer and can crash tests / trigger undefined behavior. Allocate a distinct empty pointer per buffer (e.g. use ArrowBuffer.createEmptyBuffer()), or mark shared buffers with isMemoryOwner: false so only one owner deallocates.
|
@kou Unfortunately I believe Arrow Swift has serious architectural issues, as exemplified here, which is why I have gone ahead with an almost complete rewrite: [1] https://github.com/willtemperley/swift-arrow/ I have completely rewritten the type system, with the new version based on arrow-rs. In addition to the types supported by Arrow Swift I've added binary views and large-offset types. I've built a test harnesss that uses the Arrow-Gold files and it passes the folliowing gold tests, which includes a round-trip through pyarrow: Also I've made the IPC zero-copy. Arrow swift currently copies all file data into in-memory buffers. Maybe we could come up with a roadmap to create an Arrow-Swift v2 using this work, but under the Apache umbrella. |
|
OK. Could you open an issue that focuses on how to proceed it? |
Problem
VariableBufferBuilder<T>(used by bothStringArrayBuilderandBinaryArrayBuilder) has several correctness issues that can cause memory corruption and produce Arrow-invalid output:First-value buffer overflow. The
append()method only checks values buffer capacity insideif index > 0, so the first non-nil value can overflow the initial 64-byte allocation if it is larger than that. This corrupts memory silently.Null slots write phantom bytes. Null entries create a 4-byte zero placeholder and copy it into the values buffer. Per the Arrow columnar format, null variable-width slots should repeat the previous offset and append zero value bytes.
Offsets buffer is sized for N entries instead of N+1. The Arrow spec requires N+1 offsets for N rows. The old
finish()creates an offsets buffer withlengthentries, and the reader similarly expectslengthoffsets, leading to an off-by-one in the offset buffer.Logical buffer lengths are never updated.
values.lengthandoffsets.lengthstay at their initial values (0) after construction.finish()then uses stale lengths — in particular, the values buffer is created withlength = 0regardless of actual content.Resize checks the wrong metric.
value_resize()compares againstvalues.length(stale logical length) instead ofvalues.capacity(actual allocation).These issues were discovered while using
StringArrayBuilderin a production IPC-based data pipeline, where the first string value in a column was typically longer than 64 bytes, causing silent memory corruption during batch construction.Fix
Rewrote
VariableBufferBuilder<T>.append(_:)around explicit Arrow invariants rather than patching individual branches:currentOffsetfrom the offsets buffer uniformly for all indices (including row 0), instead of special-casing the first row.nextOffset = currentOffset + byteCount, ensure values buffer capacity, copy bytes.nextOffset = currentOffset, do not write any bytes.self.length,nulls.length,offsets.length, andvalues.lengthconsistently after every append.Int32(0)entry at construction (the required first offset).Updated the corresponding resize and finish methods:
value_resize()now checkscapacityinstead oflength.resize()decouples null bitmap and offset buffer growth, checks againstcapacity, and preserves logical lengths across reallocation.finish()creates the offsets buffer withlength + 1entries.Propagated the N+1 offset convention to the reader and C data importer:
ArrowReader: offset buffer length is nownode.length + 1; values buffer length is derived from the last offset entry.ArrowCImporter: offset buffer length changed fromlengthtolength + 1.ArrowDataconvenience init: for variable-width types, derives array length asoffsets.length - 1.Testing
Added targeted regression tests with raw-buffer assertions (not just same-library round-trip checks):
[0, 256]and values buffer length = 256.["a", nil, "bbb"]verifies offsets =[0, 1, 1, 4]and values buffer length = 4.[nil, 512-byte string]verifies offsets =[0, 0, 512].BinaryArrayBuilder.[256-byte string, nil, "tail"]throughwriteStreaming/readStreaming, verifying schema, row count, decoded values, and null handling.All existing tests continue to pass. The 5 pre-existing failures on
main(missing generated test data files) are unrelated.AI disclosure
This fix was developed with AI assistance (Cursor with Claude). The bug was identified through real-world usage in a project that uses Arrow IPC for data serialization — string columns with longer first values caused memory corruption during batch construction. The root cause was then traced by manual code inspection of
VariableBufferBuilder.append()against the Arrow columnar format spec, and documented in a detailed plan before any code was written. All generated code was reviewed line-by-line, traced through concrete examples, and validated againstswift test. The approach — writing regression tests first, then fixing the implementation — was deliberate to ensure the tests could catch regressions independently of the fix.Made with Cursor