arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Avro -> TensorFlow
Date Wed, 29 Jul 2020 04:19:53 GMT
Hi Cindy,
I haven't tried this but the best guidance I can give is the following:
1.   Create an appropriate decoder using Avro's DecoderFactory [1]
2.  Construct an arrow adapter with a schema and the decoder.  There are
some examples in the unit tests [2].
3.  Adapt the method described by Uwe describes in his blog-post about JDBC
[3] to using the adapter.  From there I think you can use the tensorflow
APIs (sorry I've not used them but my understanding is TF only has python
APIs?)

If number 3 doesn't work for you due to environment constraints, you could
write out an Arrow file using the file writer [4] and try to see if
examples listed in [5] help.

 ne thing to note is, I believe the Avro adapter library currently has an
impedance mismatch with the ArrowFileWriter.  The Adapter returns an new
VectorStreamRoot per batch, and the Writer libraries are designed around
loading/unloading a single VectorSchemaRoot.  I think the method with the
least overhead for transferring is the data is to create a VectorUnloader
[6] per VectorSchemaRoot, convert it to a record batch and then load it
into the Writer's VectorSchemaRoot.  This will unfortunately cause some
amount of memory churn due to extra allocations.

There is a short overview of working with Arrow generally available at [7]

Hope this helps,
Micah

[1]
https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html
[2]
https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77
[3]
https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
[4]
https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java
[5]
https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html
[6]
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java
[7] https://arrow.apache.org/docs/java/

On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <cmcmullen@twitter.com>
wrote:

> Hi -
>
> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc
> file or SpecificRecord Java class) that I'd like to send to TensorFlow as
> input tensors, preferably via Arrow.  Can you suggest some existing
> adapters or code patterns (Java or Scala) that I can use?
>
> Thanks -
>
> -- Cindy
>

Mime
View raw message