arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Avro -> TensorFlow
Date Sat, 01 Aug 2020 02:52:52 GMT
Thanks Cindy,
Feedback would be appreciated.  I also filed
https://issues.apache.org/jira/browse/ARROW-9613 so that the conversion can
potentially be more efficient.

On Wed, Jul 29, 2020 at 4:16 AM Cindy McMullen <cmcmullen@twitter.com>
wrote:

> Thanks, Micah, for your thoughtful response.  We'll give it a try and let
> you know how it goes.
>
> -- Cindy
>
> On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
>> Hi Cindy,
>> I haven't tried this but the best guidance I can give is the following:
>> 1.   Create an appropriate decoder using Avro's DecoderFactory [1]
>> 2.  Construct an arrow adapter with a schema and the decoder.  There are
>> some examples in the unit tests [2].
>> 3.  Adapt the method described by Uwe describes in his blog-post about
>> JDBC [3] to using the adapter.  From there I think you can use the
>> tensorflow APIs (sorry I've not used them but my understanding is TF only
>> has python APIs?)
>>
>> If number 3 doesn't work for you due to environment constraints, you
>> could write out an Arrow file using the file writer [4] and try to see if
>> examples listed in [5] help.
>>
>>  ne thing to note is, I believe the Avro adapter library currently has an
>> impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
>> VectorStreamRoot per batch, and the Writer libraries are designed around
>> loading/unloading a single VectorSchemaRoot.  I think the method with the
>> least overhead for transferring is the data is to create a VectorUnloader
>> [6] per VectorSchemaRoot, convert it to a record batch and then load it
>> into the Writer's VectorSchemaRoot.  This will unfortunately cause some
>> amount of memory churn due to extra allocations.
>>
>> There is a short overview of working with Arrow generally available at [7]
>>
>> Hope this helps,
>> Micah
>>
>> [1]
>> https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
>> [2]
>> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
>> [3]
>> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
>> [4]
>> https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
>> [5]
>> https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
>> [6]
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
>> [7] https://arrow.apache.org/docs/java/
>>
>> On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <cmcmullen@twitter.com>
>> wrote:
>>
>>> Hi -
>>>
>>> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
>>> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
>>> input tensors, preferably via Arrow.  Can you suggest some existing
>>> adapters or code patterns (Java or Scala) that I can use?
>>>
>>> Thanks -
>>>
>>> -- Cindy
>>>
>>

Mime
View raw message