arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: tensorflow-io Arrow Datasets and thoughts on support for tensor columns
Date Tue, 26 Mar 2019 02:36:10 GMT
hi Bryan,

I agree this would be useful to work out.

There's a few options:

* Sending multiple tensors as a sequence of encapsulated IPC messages
(as described in
https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst).
There is no conflict with the columnar streaming protocol that
prevents this
* Embedding tensors in BinaryArray columns in some way (e.g. as an
ExtensionType, which we have now in C++)
* Adding Tensor as a logical type (this is essentially ARROW-1614)

I would like to understand the use cases more precisely. Perhaps you
can write a design document that describes the use cases in detail and
proposed solution? This doesn't fall anywhere on my list of 2019
priorities but I'm happy to give feedback on discussions and review
PRs where relevant.

In conjunction with embedding sequences of tensors in a BinaryArray,
we would probably need to first develop a LargeBinaryArray with 64-bit
offsets, so that buffers can be arbitrarily large (well, within 64-bit
address space at least)

- Wes

On Fri, Mar 22, 2019 at 1:24 PM Bryan Cutler <cutlerb@gmail.com> wrote:
>
> Hi All,
>
> Recently I have been working with the TensorFlow SIG-IO community to introduce Apache
Arrow based Datasets for bringing Arrow data into TensorFlow. SIG-IO is a community maintained
repository focused on input/output support for TF, see https://github.com/tensorflow/io (a
lot of formats from contrib/ ended up here).  Since it is community driven, if anyone is interested,
participation is highly encouraged!
>
> I'm bringing this up for a couple reasons. First, I want to make sure that this stays
in-line with any related efforts within the Arrow project and welcome any feedback. Secondly,
the initial response has been great and people are excited about using Arrow and looking to
use it in other areas of TF, but I've noticed there has been some confusion about how Arrow
handles tensor data. Specifically, it gets assumed that tensors could be part of a RecordBatch
and could be readily used in an Arrow stream.
>
> I know we have talked about making tensors a logical type for columnar data before in
https://lists.apache.org/thread.html/6cc86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.org%3E
and there is a JIRA ARROW-1614, but since there is work needed to fully support the current
spec for 1.0, I don't think it has moved forward much. I'm wondering if maybe now is a better
time to start working on this?  I think having built-in support for tensor columns would really
help to increase adoption of Arrow in frameworks that use tensor data. What are other people's
thoughts?
>
> Best Regards,
> Bryan
>

Mime
View raw message