arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cindy McMullen <cmcmul...@twitter.com>
Subject Re: Avro -> TensorFlow
Date Wed, 29 Jul 2020 11:16:08 GMT
Thanks, Micah, for your thoughtful response.  We'll give it a try and let
you know how it goes.

-- Cindy

On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Cindy,
> I haven't tried this but the best guidance I can give is the following:
> 1.   Create an appropriate decoder using Avro's DecoderFactory [1]
> 2.  Construct an arrow adapter with a schema and the decoder.  There are
> some examples in the unit tests [2].
> 3.  Adapt the method described by Uwe describes in his blog-post about
> JDBC [3] to using the adapter.  From there I think you can use the
> tensorflow APIs (sorry I've not used them but my understanding is TF only
> has python APIs?)
>
> If number 3 doesn't work for you due to environment constraints, you could
> write out an Arrow file using the file writer [4] and try to see if
> examples listed in [5] help.
>
>  ne thing to note is, I believe the Avro adapter library currently has an
> impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
> VectorStreamRoot per batch, and the Writer libraries are designed around
> loading/unloading a single VectorSchemaRoot.  I think the method with the
> least overhead for transferring is the data is to create a VectorUnloader
> [6] per VectorSchemaRoot, convert it to a record batch and then load it
> into the Writer's VectorSchemaRoot.  This will unfortunately cause some
> amount of memory churn due to extra allocations.
>
> There is a short overview of working with Arrow generally available at [7]
>
> Hope this helps,
> Micah
>
> [1]
> https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
> [2]
> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
> [3]
> https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
> [4]
> https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
> [5]
> https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
> [6]
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
> [7] https://arrow.apache.org/docs/java/
>
> On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <cmcmullen@twitter.com>
> wrote:
>
>> Hi -
>>
>> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
>> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
>> input tensors, preferably via Arrow.  Can you suggest some existing
>> adapters or code patterns (Java or Scala) that I can use?
>>
>> Thanks -
>>
>> -- Cindy
>>
>

Mime
View raw message