arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cutler <cutl...@gmail.com>
Subject Re: tensorflow-io Arrow Datasets and thoughts on support for tensor columns
Date Wed, 27 Mar 2019 18:18:24 GMT
Thanks Wes!  I am most interested in the last option, adding Tensor as a
logical type, but if it makes sense to embed as a BinaryArray for a first
step then that would still be useful too.  I'll work on a design doc with a
use case and report back. I know there are a lot of different efforts going
on right now and I hate to pile more on, but appreciate time for feedback
and review.

Best Regards,
Bryan

On Mon, Mar 25, 2019 at 2:36 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Bryan,
>
> I agree this would be useful to work out.
>
> There's a few options:
>
> * Sending multiple tensors as a sequence of encapsulated IPC messages
> (as described in
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst).
> There is no conflict with the columnar streaming protocol that
> prevents this
> * Embedding tensors in BinaryArray columns in some way (e.g. as an
> ExtensionType, which we have now in C++)
> * Adding Tensor as a logical type (this is essentially ARROW-1614)
>
> I would like to understand the use cases more precisely. Perhaps you
> can write a design document that describes the use cases in detail and
> proposed solution? This doesn't fall anywhere on my list of 2019
> priorities but I'm happy to give feedback on discussions and review
> PRs where relevant.
>
> In conjunction with embedding sequences of tensors in a BinaryArray,
> we would probably need to first develop a LargeBinaryArray with 64-bit
> offsets, so that buffers can be arbitrarily large (well, within 64-bit
> address space at least)
>
> - Wes
>
> On Fri, Mar 22, 2019 at 1:24 PM Bryan Cutler <cutlerb@gmail.com> wrote:
> >
> > Hi All,
> >
> > Recently I have been working with the TensorFlow SIG-IO community to
> introduce Apache Arrow based Datasets for bringing Arrow data into
> TensorFlow. SIG-IO is a community maintained repository focused on
> input/output support for TF, see https://github.com/tensorflow/io (a lot
> of formats from contrib/ ended up here).  Since it is community driven, if
> anyone is interested, participation is highly encouraged!
> >
> > I'm bringing this up for a couple reasons. First, I want to make sure
> that this stays in-line with any related efforts within the Arrow project
> and welcome any feedback. Secondly, the initial response has been great and
> people are excited about using Arrow and looking to use it in other areas
> of TF, but I've noticed there has been some confusion about how Arrow
> handles tensor data. Specifically, it gets assumed that tensors could be
> part of a RecordBatch and could be readily used in an Arrow stream.
> >
> > I know we have talked about making tensors a logical type for columnar
> data before in
> https://lists.apache.org/thread.html/6cc86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.org%3E
> and there is a JIRA ARROW-1614, but since there is work needed to fully
> support the current spec for 1.0, I don't think it has moved forward much.
> I'm wondering if maybe now is a better time to start working on this?  I
> think having built-in support for tensor columns would really help to
> increase adoption of Arrow in frameworks that use tensor data. What are
> other people's thoughts?
> >
> > Best Regards,
> > Bryan
> >
>

Mime
View raw message