Closed
Conversation
This starts work towards supporting teh C data interface for the arrow format, as documented [here](https://arrow.apache.org/docs/format/CDataInterface.html#). Currently in this PR, it includes struct definitions and basic methods to allow getting a pointer to an `ArrowSchema`/`ArrowArray` C-compatible struct that can then be populated by another implementation. For example, with this PR, you can do: ```julia using Arrow, PyCall pd = pyimport("pandas") pa = pyimport("pyarrow") df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o) rb = pa.record_batch(df) sch = Arrow.CData.getschema() do ptr rb.schema._export_to_c(Int(ptr)) end arr = Arrow.CData.getarray() do ptr rb._export_to_c(Int(ptr)) end ``` Currently, these `ArrowSchema`/`ArrowArray` structs are pretty bare bones, but it at least lays some ground work for integration. Things we still need/want to make all this nicer to use/work with: * Type format string parsing/converting: we need to parse the type format strings as outlined [here](https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings) to figure out what type of data we'll get in the arrays. It'd probably be best to add a `type` field to the ArrowSchema struct that we'd populate when converting from `CArrowSchema` -> `ArrowSchema` * Add a method like `Arrow.ArrowVector(::ArrowSchema, ::ArrowArray)` that produced a concrete `ArrowVector` subtype, like `Arrow.Primitive`, `Arrow.List`, etc. This will be a bit tricky, because have to follow all the same columnar layout trickery that we currently handle for IPC in the table.jl `build` methods. Perhaps we can refactor all that so we can re-use some code? Otherwise, we might just need to reimplement a bunch of that logic specific to converting `ArrrowArray`s. * That should give a robust consuming story; for producing, we probably need a definition like `Arrow.ArrowSchema(a::Arrow.ArrowVector)` that produced a valid `ArrowSchema`, and then overloads per `ArrowVector` subtype like `Arrow.ArrowArray(x::Arrow.Primitive)` that produced the right `ArrowArray` for a concrete arrow array * Then the last piece we need is just figuring out the right mechanics for providing a pointer to the `CArrowSchema`, `CArrowArray` structs once they're populated If anyone would like to help out, I'm happy to provide as much guidance as possible so others can get their feet wet in some arrow spec nitty-gritty.
Member
Author
Codecov Report
@@ Coverage Diff @@
## main #178 +/- ##
==========================================
- Coverage 81.34% 79.15% -2.20%
==========================================
Files 25 26 +1
Lines 3034 3118 +84
==========================================
Hits 2468 2468
- Misses 566 650 +84
Continue to review full report at Codecov.
|
Closed
Member
Author
|
Closing due to staleness - this PR is 5 years old and has been superseded by #561 which provides a more complete C Data Interface implementation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This starts work towards supporting teh C data interface for the arrow
format, as documented
here.
Currently in this PR, it includes struct definitions and basic
methods to allow getting a pointer to an
ArrowSchema/ArrowArrayC-compatible struct that can then be populated by another
implementation. For example, with this PR, you can do:
Currently, these
ArrowSchema/ArrowArraystructs are pretty barebones, but it at least lays some ground work for integration. Things we
still need/want to make all this nicer to use/work with:
format strings as outlined
here
to figure out what type of data we'll get in the arrays. It'd
probably be best to add a
typefield to the ArrowSchema struct thatwe'd populate when converting from
CArrowSchema->ArrowSchemaArrow.ArrowVector(::ArrowSchema, ::ArrowArray)that produced a concrete
ArrowVectorsubtype, likeArrow.Primitive,Arrow.List, etc. This will be a bit tricky,because have to follow all the same columnar layout trickery that we
currently handle for IPC in the table.jl
buildmethods. Perhaps wecan refactor all that so we can re-use some code? Otherwise, we might
just need to reimplement a bunch of that logic specific to converting
ArrrowArrays.probably need a definition like
Arrow.ArrowSchema(a::Arrow.ArrowVector)that produced a validArrowSchema, and then overloads perArrowVectorsubtype likeArrow.ArrowArray(x::Arrow.Primitive)that produced the rightArrowArrayfor a concrete arrow arrayfor providing a pointer to the
CArrowSchema,CArrowArraystructsonce they're populated
If anyone would like to help out, I'm happy to provide as much guidance
as possible so others can get their feet wet in some arrow spec
nitty-gritty.