Skip to content

fix: correct VariableBufferBuilder null handling, first-value resize, and offset semantics#152

Open
jding-xyz wants to merge 8 commits intoapache:mainfrom
jding-xyz:fix/variable-buffer-builder-safety
Open

fix: correct VariableBufferBuilder null handling, first-value resize, and offset semantics#152
jding-xyz wants to merge 8 commits intoapache:mainfrom
jding-xyz:fix/variable-buffer-builder-safety

Conversation

@jding-xyz
Copy link
Copy Markdown
Contributor

Problem

VariableBufferBuilder<T> (used by both StringArrayBuilder and BinaryArrayBuilder) has several correctness issues that can cause memory corruption and produce Arrow-invalid output:

  1. First-value buffer overflow. The append() method only checks values buffer capacity inside if index > 0, so the first non-nil value can overflow the initial 64-byte allocation if it is larger than that. This corrupts memory silently.

  2. Null slots write phantom bytes. Null entries create a 4-byte zero placeholder and copy it into the values buffer. Per the Arrow columnar format, null variable-width slots should repeat the previous offset and append zero value bytes.

  3. Offsets buffer is sized for N entries instead of N+1. The Arrow spec requires N+1 offsets for N rows. The old finish() creates an offsets buffer with length entries, and the reader similarly expects length offsets, leading to an off-by-one in the offset buffer.

  4. Logical buffer lengths are never updated. values.length and offsets.length stay at their initial values (0) after construction. finish() then uses stale lengths — in particular, the values buffer is created with length = 0 regardless of actual content.

  5. Resize checks the wrong metric. value_resize() compares against values.length (stale logical length) instead of values.capacity (actual allocation).

These issues were discovered while using StringArrayBuilder in a production IPC-based data pipeline, where the first string value in a column was typically longer than 64 bytes, causing silent memory corruption during batch construction.

Fix

Rewrote VariableBufferBuilder<T>.append(_:) around explicit Arrow invariants rather than patching individual branches:

  • Read currentOffset from the offsets buffer uniformly for all indices (including row 0), instead of special-casing the first row.
  • For non-null values: compute nextOffset = currentOffset + byteCount, ensure values buffer capacity, copy bytes.
  • For null values: keep nextOffset = currentOffset, do not write any bytes.
  • Store the next offset, then update self.length, nulls.length, offsets.length, and values.length consistently after every append.
  • Initialize the offsets buffer with a single Int32(0) entry at construction (the required first offset).

Updated the corresponding resize and finish methods:

  • value_resize() now checks capacity instead of length.
  • resize() decouples null bitmap and offset buffer growth, checks against capacity, and preserves logical lengths across reallocation.
  • finish() creates the offsets buffer with length + 1 entries.

Propagated the N+1 offset convention to the reader and C data importer:

  • ArrowReader: offset buffer length is now node.length + 1; values buffer length is derived from the last offset entry.
  • ArrowCImporter: offset buffer length changed from length to length + 1.
  • ArrowData convenience init: for variable-width types, derives array length as offsets.length - 1.

Testing

Added targeted regression tests with raw-buffer assertions (not just same-library round-trip checks):

  • Long first value: 256-byte first string, verifies offsets = [0, 256] and values buffer length = 256.
  • Null does not advance offsets: ["a", nil, "bbb"] verifies offsets = [0, 1, 1, 4] and values buffer length = 4.
  • Null first, then long value: [nil, 512-byte string] verifies offsets = [0, 0, 512].
  • Binary parity: same null-handling test for BinaryArrayBuilder.
  • IPC round-trip: [256-byte string, nil, "tail"] through writeStreaming / readStreaming, verifying schema, row count, decoded values, and null handling.

All existing tests continue to pass. The 5 pre-existing failures on main (missing generated test data files) are unrelated.

AI disclosure

This fix was developed with AI assistance (Cursor with Claude). The bug was identified through real-world usage in a project that uses Arrow IPC for data serialization — string columns with longer first values caused memory corruption during batch construction. The root cause was then traced by manual code inspection of VariableBufferBuilder.append() against the Arrow columnar format spec, and documented in a detailed plan before any code was written. All generated code was reviewed line-by-line, traced through concrete examples, and validated against swift test. The approach — writing regression tests first, then fixing the implementation — was deliberate to ensure the tests could catch regressions independently of the fix.

Made with Cursor

Preserve and update nulls.length during resize and append so all three
variable-width buffers follow the same bookkeeping pattern. Add a
regression test for nil-first then long-value sequences.

Made-with: Cursor
@kou
Copy link
Copy Markdown
Member

kou commented Mar 29, 2026

@willtemperley @maltzsama Could you review this?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes correctness and Arrow-spec compliance issues for variable-width (String/Binary) arrays by enforcing N+1 offsets semantics, correct null offset behavior, and consistent logical length tracking across builders, reader, and C data import.

Changes:

  • Reworked VariableBufferBuilder.append/resize/finish to maintain Arrow invariants (N+1 offsets, nulls don’t advance offsets, values/offsets/nulls logical lengths updated each append).
  • Updated ArrowReader / ArrowCImporter / ArrowData to interpret variable-width buffers using N+1 offsets and derive value buffer lengths from the final offset.
  • Added regression tests for long-first-value, null offset behavior, binary parity, and IPC streaming round-trip.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Tests/ArrowTests/IPCTests.swift Adds IPC streaming round-trip regression test for long first string + null handling.
Tests/ArrowTests/CDataTests.swift Loosens capacity assertion to account for new/variable growth behavior.
Tests/ArrowTests/ArrayTests.swift Adds raw-buffer assertions for offsets/value-length semantics for String/Binary builders.
Sources/Arrow/ArrowReader.swift Reads N+1 offsets and sizes value buffer by last offset for variable-width arrays.
Sources/Arrow/ArrowData.swift Derives variable-width array length from offsets length (offsets.count - 1).
Sources/Arrow/ArrowCImporter.swift Imports variable-width offsets as N+1 entries (and derives value length from last offset).
Sources/Arrow/ArrowBufferBuilder.swift Fixes variable-width builder semantics: null handling, first-value resizing, length tracking, N+1 offsets.
Sources/Arrow/ArrowBuffer.swift Adds internal updateLength helper to keep logical lengths in sync.
Comments suppressed due to low confidence (1)

Sources/Arrow/ArrowCImporter.swift:144

  • In the C data importer, appendToBuffer sets ArrowBuffer.capacity equal to the passed length, but capacity is treated as a byte count elsewhere (e.g., IPC writer uses buffer.capacity for buffer sizes). Passing length + 1 for the offsets buffer makes capacity represent an element count, not bytes, which can lead to incorrect serialization sizes for imported arrays. Consider updating appendToBuffer to accept an element size (or explicit byteLength) so the offsets buffer uses (length + 1) * MemoryLayout<Int32>.stride bytes for capacity (while keeping length as the element count).
                appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: (length + 7) / 8)
                appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
                let lastOffsetLength = cArray.buffers[1]!
                    .advanced(by: Int(length) * MemoryLayout<Int32>.stride)
                    .load(as: Int32.self)
                appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 197 to +203
let arrowOffsetBuffer = makeBuffer(offsetBuffer, fileData: loadInfo.fileData,
length: UInt(node.length), messageOffset: loadInfo.messageOffset)
length: UInt(node.length + 1), messageOffset: loadInfo.messageOffset)
let lastOffset = arrowOffsetBuffer.rawPointer
.advanced(by: Int(node.length) * MemoryLayout<Int32>.stride)
.load(as: Int32.self)
let arrowValueBuffer = makeBuffer(valueBuffer, fileData: loadInfo.fileData,
length: UInt(node.length), messageOffset: loadInfo.messageOffset)
length: UInt(lastOffset), messageOffset: loadInfo.messageOffset)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadVariableData derives the values buffer logical length from the last offset without validating it. If the IPC input is malformed, lastOffset can be negative or larger than the underlying valueBuffer.length, and the subsequent UInt(lastOffset) conversion / later value reads can crash or read out of bounds. Add guards to ensure the offset buffer is large enough to read the last offset, lastOffset >= 0, and lastOffset <= valueBuffer.length (bytes) before creating the value buffer (otherwise return .failure(.invalid(...))).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added guards to reject negative offsets and offsets exceeding the value buffer length, returning .failure(.invalid(...)) with a descriptive message. This is consistent with the existing guard chain in the method.

let arrayLength: UInt
switch arrowType.info {
case .variableInfo:
arrayLength = max(buffers[1].length, 1) - 1
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable-width ArrowData convenience init silently maps an offsets buffer length of 0 to an array length of 0 via max(buffers[1].length, 1) - 1. That avoids underflow but can also mask invalid input (variable-width arrays should have at least 1 offset). Consider validating buffers.count >= 2 and buffers[1].length >= 1 for .variableInfo and throwing an .invalid(...) error when the invariant is violated.

Suggested change
arrayLength = max(buffers[1].length, 1) - 1
guard buffers.count >= 2 else {
throw ArrowError.invalid("Variable-width ArrowData requires at least two buffers (null bitmap and offsets).")
}
let offsetsLength = buffers[1].length
guard offsetsLength >= 1 else {
throw ArrowError.invalid("Variable-width ArrowData requires a non-empty offsets buffer.")
}
arrayLength = offsetsLength - 1

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified as suggested. Also updated checkHolderForType in the test to provide spec-compliant dummy buffers for variable-width types.

if isNull {
if let val = newValue {
let binData = getBytesFor(val)!
nextOffset += Int32(binData.count)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariableBufferBuilder.append uses Int32 offsets and does nextOffset += Int32(binData.count). If a single value (or the running total) exceeds Int32.max, the Int32(...) conversion will trap or the addition can overflow, causing a crash. Consider explicitly checking that binData.count and the computed nextOffset stay within Int32.max (and fail fast with a clear error/precondition), or gate this builder to types that are guaranteed to fit Int32 offsets.

Suggested change
nextOffset += Int32(binData.count)
// Ensure per-value size fits into Int32.
precondition(
binData.count <= Int(Int32.max),
"VariableBufferBuilder.append: element size exceeds Int32.max"
)
// Compute the next offset in a wider type to detect overflow.
let proposedOffset = Int64(currentOffset) + Int64(binData.count)
precondition(
proposedOffset <= Int64(Int32.max),
"VariableBufferBuilder.append: total buffer size exceeds Int32.max"
)
nextOffset = Int32(proposedOffset)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified as suggested.

Address review feedback on PR apache#152:
- Validate lastOffset is non-negative and within value buffer bounds
  before constructing the value buffer in ArrowReader.loadVariableData
- Replace silent max(_, 1) - 1 underflow guard in ArrowData with an
  explicit throw when the offsets buffer is missing or empty
- Widen offset arithmetic to Int64 in VariableBufferBuilder.append and
  add a precondition for Int32 overflow
- Update checkHolderForType test to provide spec-compliant buffers for
  variable-width types (3 buffers with a valid initial offset)

Made-with: Cursor
Adopt Copilot's suggested form: two separate guards for missing buffers
vs. empty offsets buffer, giving clearer diagnostics on failure.

Made-with: Cursor
Add a separate precondition for the individual element size before
checking the cumulative total, giving a more specific diagnostic when
a single value exceeds Int32.max.

Made-with: Cursor
@jding-xyz
Copy link
Copy Markdown
Contributor Author

Comments suppressed due to low confidence (1)
Sources/Arrow/ArrowCImporter.swift:144

  • In the C data importer, appendToBuffer sets ArrowBuffer.capacity equal to the passed length, but capacity is treated as a byte count elsewhere (e.g., IPC writer uses buffer.capacity for buffer sizes). Passing length + 1 for the offsets buffer makes capacity represent an element count, not bytes, which can lead to incorrect serialization sizes for imported arrays. Consider updating appendToBuffer to accept an element size (or explicit byteLength) so the offsets buffer uses (length + 1) * MemoryLayout<Int32>.stride bytes for capacity (while keeping length as the element count).
                appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: (length + 7) / 8)
                appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
                let lastOffsetLength = cArray.buffers[1]!
                    .advanced(by: Int(length) * MemoryLayout<Int32>.stride)
                    .load(as: Int32.self)
                appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

I think this is a real issue but it needs more consideration and out of scope of this PR. We should address it later.

@jding-xyz
Copy link
Copy Markdown
Contributor Author

Addressed all review comments. Please take a look. Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

Sources/Arrow/ArrowCImporter.swift:144

  • In the variable-width import path, lastOffsetLength is loaded as Int32 and then immediately converted to UInt for the values buffer length. If the C data is malformed (negative last offset) this will trap. Also cArray.buffers[1]! will crash if the offsets buffer pointer is nil. Add guards to ensure the offsets buffer exists and lastOffsetLength >= 0 (and ideally that it’s consistent with the provided value buffer) before constructing the Swift buffers.
                appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
                let lastOffsetLength = cArray.buffers[1]!
                    .advanced(by: Int(length) * MemoryLayout<Int32>.stride)
                    .load(as: Int32.self)
                appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

guard lastOffset >= 0 else {
return .failure(.invalid("Negative last offset (\(lastOffset)) in variable-width buffer"))
}
guard lastOffset <= valueBuffer.length else {
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadVariableData compares lastOffset (Int32) to valueBuffer.length (Int64). As written (lastOffset <= valueBuffer.length) this won’t compile in Swift due to mismatched integer types. Convert lastOffset to Int64 (or valueBuffer.length to Int32 after range-checking) before comparing.

Suggested change
guard lastOffset <= valueBuffer.length else {
guard Int64(lastOffset) <= valueBuffer.length else {

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good finding. Modified as suggested.

Comment on lines +32 to +34
guard buffers.count >= 2 else {
throw ArrowError.invalid(
"Variable-width ArrowData requires at least two buffers (null bitmap and offsets).")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For .variableInfo, this convenience init only validates buffers.count >= 2, but the rest of the codebase (e.g. StringArray/BinaryArray) assumes the [nullBitmap, offsets, values] layout and will access buffers[2]. Allowing only 2 buffers can create an ArrowData that crashes later at runtime. Consider requiring buffers.count >= 3 for variable-width types (or otherwise validating that the values buffer exists).

Suggested change
guard buffers.count >= 2 else {
throw ArrowError.invalid(
"Variable-width ArrowData requires at least two buffers (null bitmap and offsets).")
guard buffers.count >= 3 else {
throw ArrowError.invalid(
"Variable-width ArrowData requires at least three buffers (null bitmap, offsets, and values).")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Modified as suggested.

- Use explicit Int64 conversion in lastOffset comparison for clarity
- Require 3 buffers (not 2) for variable-width types since downstream
  code accesses buffers[2] for the values buffer

Made-with: Cursor
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

Sources/Arrow/ArrowCImporter.swift:144

  • In ArrowCImporter, the offsets buffer is appended with length: length + 1, but appendToBuffer sets capacity equal to length. For offsets, capacity needs to be the byte size ((length + 1) * MemoryLayout<Int32>.stride) so that downstream getBufferDataSizes / IPC writing includes the full offsets buffer. Consider changing appendToBuffer to accept both a logical element count and a byteCapacity (or compute byteCapacity based on buffer kind) and set ArrowBuffer.capacity appropriately.
                appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: (length + 7) / 8)
                appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length + 1)
                let lastOffsetLength = cArray.buffers[1]!
                    .advanced(by: Int(length) * MemoryLayout<Int32>.stride)
                    .load(as: Int32.self)
                appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))

Sources/Arrow/ArrowCImporter.swift:144

  • lastOffsetLength is read from the imported offsets buffer and then converted to UInt for the values buffer length. If the C data contains a negative last offset, UInt(lastOffsetLength) will trap at runtime (DoS). Add a guard lastOffsetLength >= 0 (and ideally validate monotonic offsets / that lastOffset fits within the provided values buffer) and return .failure(.invalid(...)) on malformed input.
                let lastOffsetLength = cArray.buffers[1]!
                    .advanced(by: Int(length) * MemoryLayout<Int32>.stride)
                    .load(as: Int32.self)
                appendToBuffer(cArray.buffers[2], arrowBuffers: &arrowBuffers, length: UInt(lastOffsetLength))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +414 to +432
let emptyPtr = UnsafeMutableRawPointer.allocate(byteCount: 0, alignment: .zero)
let buffers: [ArrowBuffer]
switch checkType.info {
case .variableInfo:
let offsetPtr = UnsafeMutableRawPointer.allocate(byteCount: MemoryLayout<Int32>.stride, alignment: 4)
offsetPtr.storeBytes(of: Int32(0), as: Int32.self)
buffers = [
ArrowBuffer(length: 0, capacity: 0, rawPointer: emptyPtr),
ArrowBuffer(length: 1, capacity: UInt(MemoryLayout<Int32>.stride), rawPointer: offsetPtr),
ArrowBuffer(length: 0, capacity: 0,
rawPointer: UnsafeMutableRawPointer.allocate(byteCount: 0, alignment: .zero))
]
default:
buffers = [
ArrowBuffer(length: 0, capacity: 0, rawPointer: emptyPtr),
ArrowBuffer(length: 0, capacity: 0,
rawPointer: UnsafeMutableRawPointer.allocate(byteCount: 0, alignment: .zero))
]
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkHolderForType reuses the same emptyPtr for multiple ArrowBuffer instances that all default to isMemoryOwner: true. When these buffers are deallocated this will double-free the same pointer and can crash tests / trigger undefined behavior. Allocate a distinct empty pointer per buffer (e.g. use ArrowBuffer.createEmptyBuffer()), or mark shared buffers with isMemoryOwner: false so only one owner deallocates.

Copilot uses AI. Check for mistakes.
@willtemperley
Copy link
Copy Markdown
Contributor

@kou Unfortunately I believe Arrow Swift has serious architectural issues, as exemplified here, which is why I have gone ahead with an almost complete rewrite:

[1] https://github.com/willtemperley/swift-arrow/

I have completely rewritten the type system, with the new version based on arrow-rs.

In addition to the types supported by Arrow Swift I've added binary views and large-offset types. I've built a test harnesss that uses the Arrow-Gold files and it passes the folliowing gold tests, which includes a round-trip through pyarrow:

"generated_binary",
"generated_binary_no_batches",
"generated_binary_view",
"generated_binary_zerolength",
"generated_custom_metadata",
"generated_datetime",
"generated_duplicate_fieldnames",
"generated_nested",
"generated_nested_large_offsets",
"generated_recursive_nested",
"generated_large_binary",
"generated_map",
"generated_duration",
"generated_primitive",
"generated_primitive_no_batches",
"generated_primitive_zerolength",

Also I've made the IPC zero-copy. Arrow swift currently copies all file data into in-memory buffers.

Maybe we could come up with a roadmap to create an Arrow-Swift v2 using this work, but under the Apache umbrella.

@kou
Copy link
Copy Markdown
Member

kou commented Mar 30, 2026

OK. Could you open an issue that focuses on how to proceed it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants