[Java] How dictionaries work - roundtrip Java-Python#327
[Java] How dictionaries work - roundtrip Java-Python#327vibhatha wants to merge 12 commits intoapache:mainfrom
Conversation
|
Hi @vibhatha, just reviewing error logs and error mention that arrow-memory-netty is not available on the nightly packages, and that is true, as you can see at https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory-netty/, which is very rare, but it needs to be reviewed. Alternatively, you could start using the 14.0.0-SNAPSHOT version which can be configured in the arrow-cookbook/java/source/conf.py file as follows: |
If I understand correctly, should I make that change in this PR itself? Or should we create a separate PR for that? |
This change is being added to #320 as well, but it is true, this must have been created independently. |
|
@davisusanibar sounds good and I will rebase once this PR is merged. Thank you 👍 |
7befe38 to
c35eefd
Compare
|
@danepitkin @davisusanibar This PR is ready for review, please take a look. |
danepitkin
left a comment
There was a problem hiding this comment.
Thanks @vibhatha ! This will be a great addition to the cook book. I'm not super familiar with the C Data interface / jpype so reviewing this was a good learning opportunity for me.
| data | ||
| avro | ||
| jdbc | ||
| python_java |
There was a problem hiding this comment.
Would "c data interface" potentially be a better name? The section below uses pyarrow/jpype as a python example, but in the future we could also add other language interfaces.
There was a problem hiding this comment.
@danepitkin
Thank you for the quick review and suggestions. And I agree with your comment. Will make the changed.
| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | ||
| It then updates the data and sends it back to the Python component. | ||
|
|
||
| .. code-block:: java |
There was a problem hiding this comment.
Java component is tested by custom directives created with testcode and testoutput.
During internal works, Java code is executed by Jshell and output defined by system.out.print on the testcode is compared to the value defined on the test output, such as:
.. testcode::
System.out.print("testme");
.. testoutput::
testme
| return this.vector; | ||
| } | ||
|
|
||
| public void update(long c_array_ptr, long c_schema_ptr) { |
There was a problem hiding this comment.
Could be an option to also validate Python call as as part of Java testing using .. testcode:: and .. testoutput:: directives?
Something like this:
import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.c.Data;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;
import org.apache.arrow.vector.FieldVector;
public class MapValuesConsumer {
private final static BufferAllocator allocator = new RootAllocator();
private final CDataDictionaryProvider provider;
private FieldVector vector;
public MapValuesConsumer(CDataDictionaryProvider provider) {
this.provider = provider;
}
public MapValuesConsumer() {
this.provider = null;
}
public static BufferAllocator getAllocatorForJavaConsumer() {
return allocator;
}
public FieldVector getVector() {
return this.vector;
}
public void update(long c_array_ptr, long c_schema_ptr) {
ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
this.doWorkInJava(vector);
}
public void update2(long c_array_ptr, long c_schema_ptr) {
ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, null);
this.doWorkInJava(vector);
}
private void doWorkInJava(FieldVector vector) {
System.out.println("Doing work in Java");
BigIntVector bigIntVector = (BigIntVector)vector;
bigIntVector.setSafe(0, 2);
}
public static void main(String[] args) {
simulateAsAJavaConsumers();
}
final static BigIntVector intVector =
new BigIntVector("internal_test", allocator);
public static BigIntVector getIntVectorForJavaConsumers() {
intVector.allocateNew(3);
intVector.set(0, 1);
intVector.set(1, 7);
intVector.set(2, 93);
intVector.setValueCount(3);
return intVector;
}
public static void simulateAsAJavaConsumers() {
MapValuesConsumer mvc = new MapValuesConsumer();//FIXME! Use constructor with dictionary provider
try (
ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
ArrowSchema arrowSchema = ArrowSchema.allocateNew(allocator)
) {
//FIXME! Add custo logic to emulate a dictionary provider adding
Data.exportVector(allocator, getIntVectorForJavaConsumers(), null, arrowArray, arrowSchema);
mvc.update2(arrowArray.memoryAddress(), arrowSchema.memoryAddress());
try (FieldVector valueVectors = Data.importVector(allocator, arrowArray, arrowSchema, null);) {
System.out.print(valueVectors); //FIXME! Validate on .. testoutput::
}
}
intVector.close(); //FIXME! Expose this method also to be called by end python program
allocator.close();
}
}
There was a problem hiding this comment.
Thank you for the insight, @davisusanibar. I will work on this.
There was a problem hiding this comment.
@davisusanibar this code doesn't seem to be working, but I get your idea.
68d140b to
46bba74
Compare
|
@danepitkin I am addressing the reviews and thank you for reviewing this PR. |
danepitkin
left a comment
There was a problem hiding this comment.
Nice work @vibhatha ! A few small comments, but overall LGTM otherwise.
|
@danepitkin I am working on updating this PR for Java doctests. I will address the reviews as well. |
| This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | ||
| and finally re-accessed and validated in Python for data consistency. |
There was a problem hiding this comment.
| This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | |
| and finally re-accessed and validated in Python for data consistency. | |
| This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | |
| and finally re-accessed and validated in Python for data consistency. |
There was a problem hiding this comment.
This description is misleading. You cannot mutate data in the C Data Interface.
There was a problem hiding this comment.
I changed the wording.
This section demonstrates a data roundtrip where C Data interface is being used to provide
the seamless access to data across language boundaries.| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | ||
| It then updates the data and sends it back to the Python component. | ||
|
|
||
| .. testcode:: |
| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | ||
| It then updates the data and sends it back to the Python component. |
There was a problem hiding this comment.
| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | |
| It then updates the data and sends it back to the Python component. | |
| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | |
| It then updates the data and sends it back to the Python component. |
There was a problem hiding this comment.
Changed the wording. Thanks for catching this.
|
@lidavidm thanks a lot for your feedback. I will address these. |
|
Given we have no roundtrip examples at all, I don't see why we are starting with dictionaries vs a simpler type |
Could it be consider as a simpler type #325? |
Good point. Although, I am merely picking up an existing issue. |
This PR closes #213