From user-return-118-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Mar 27 18:18:42 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id D1723180648 for ; Wed, 27 Mar 2019 19:18:41 +0100 (CET) Received: (qmail 25988 invoked by uid 500); 27 Mar 2019 18:18:41 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 25969 invoked by uid 99); 27 Mar 2019 18:18:40 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Mar 2019 18:18:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 840DCC0032; Wed, 27 Mar 2019 18:18:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.801 X-Spam-Level: * X-Spam-Status: No, score=1.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id TaTLXRZB63w5; Wed, 27 Mar 2019 18:18:38 +0000 (UTC) Received: from mail-ot1-f66.google.com (mail-ot1-f66.google.com [209.85.210.66]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 491DE624DB; Wed, 27 Mar 2019 18:18:38 +0000 (UTC) Received: by mail-ot1-f66.google.com with SMTP id 103so15803241otd.9; Wed, 27 Mar 2019 11:18:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sCRlZoarXvm6L8L6XNOS04kgR8ctRnoF8IKnzEG9ctI=; b=ssmgBoJD9sh8vGLBczWZOSgnsy6VZ87xrz1mqY0krsr225BPKWRDX1qTvOTMO9Vn1w QZOnTtkjBaQhhOMnaJY6ST/6GJ7KpU/aMTb1y8/dAvvnKRLIlllxfuTLtTGC02GAIy8m NU/r4VxjchMj/ZAue4STln9BJ94Mbfky606CFnGpF37BynsKOQfJpRS/mQLWmGSoF4A9 /zvlZSgN+7MUxU+BUySJKAdt8gy/STKz7LAMHwl4fnFhNE5gg2GO8MmVsV7rsJ4TmISn tXAILmpDTtdwmt/9vFbTyj35/hu81xDCHFiPodzXz6Apmt3iSsLs1UdzOleL/6zxpOh6 9GRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sCRlZoarXvm6L8L6XNOS04kgR8ctRnoF8IKnzEG9ctI=; b=Dbliho49KPAk73FpFIuEnS1kL1nHeBEmKJi57SRPMV2OylYlyijhluHDvGXie3cq0B p7Fre7Zl+Uyr+0qWuhhGf2yJZXOu93AUmsdi0EEj7HyvtSPUA52R2TgxMlRD3DlbwcWE eYSPx17M6c6HAHc9yMOQhSaRRQcGx3go2bCU24azwhdSqCUPLKP2/2SpkABTiUNFY4Nu OCCjwxSanHBuzD2vrrQ0S9j4wRTeL9wjXBV0pP2UE8CiIRRAV00rJDXez3wSjzr0NFBi j9OE0R3rw1OBHiAAU3QXt4wkGalwCqq9cN9HtG8qWbRg7nDRKUBTaTWBAwjzcTcWJAco pOIA== X-Gm-Message-State: APjAAAU9J6svkXJlm+4nBmptIcZwtpB7Ss6ViRQEL7XNCKeqVeDiup/8 yL1jHjiPz/Rc5KE2fQS7+XNo/be6D2yBAtIqkqhWcd+a X-Google-Smtp-Source: APXvYqwwcWdA28GqjH2PiryHuLs60Iql73dF+Kv5ov03Bbyv7iL79tv0p0zSxi1R85UfPzL7GfiVa5jSrK1NANqn+TM= X-Received: by 2002:a9d:6541:: with SMTP id q1mr26301413otl.347.1553710716648; Wed, 27 Mar 2019 11:18:36 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bryan Cutler Date: Wed, 27 Mar 2019 11:18:24 -0700 Message-ID: Subject: Re: tensorflow-io Arrow Datasets and thoughts on support for tensor columns To: user@arrow.apache.org Cc: dev@arrow.apache.org Content-Type: multipart/alternative; boundary="00000000000059820d0585177740" --00000000000059820d0585177740 Content-Type: text/plain; charset="UTF-8" Thanks Wes! I am most interested in the last option, adding Tensor as a logical type, but if it makes sense to embed as a BinaryArray for a first step then that would still be useful too. I'll work on a design doc with a use case and report back. I know there are a lot of different efforts going on right now and I hate to pile more on, but appreciate time for feedback and review. Best Regards, Bryan On Mon, Mar 25, 2019 at 2:36 PM Wes McKinney wrote: > hi Bryan, > > I agree this would be useful to work out. > > There's a few options: > > * Sending multiple tensors as a sequence of encapsulated IPC messages > (as described in > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst). > There is no conflict with the columnar streaming protocol that > prevents this > * Embedding tensors in BinaryArray columns in some way (e.g. as an > ExtensionType, which we have now in C++) > * Adding Tensor as a logical type (this is essentially ARROW-1614) > > I would like to understand the use cases more precisely. Perhaps you > can write a design document that describes the use cases in detail and > proposed solution? This doesn't fall anywhere on my list of 2019 > priorities but I'm happy to give feedback on discussions and review > PRs where relevant. > > In conjunction with embedding sequences of tensors in a BinaryArray, > we would probably need to first develop a LargeBinaryArray with 64-bit > offsets, so that buffers can be arbitrarily large (well, within 64-bit > address space at least) > > - Wes > > On Fri, Mar 22, 2019 at 1:24 PM Bryan Cutler wrote: > > > > Hi All, > > > > Recently I have been working with the TensorFlow SIG-IO community to > introduce Apache Arrow based Datasets for bringing Arrow data into > TensorFlow. SIG-IO is a community maintained repository focused on > input/output support for TF, see https://github.com/tensorflow/io (a lot > of formats from contrib/ ended up here). Since it is community driven, if > anyone is interested, participation is highly encouraged! > > > > I'm bringing this up for a couple reasons. First, I want to make sure > that this stays in-line with any related efforts within the Arrow project > and welcome any feedback. Secondly, the initial response has been great and > people are excited about using Arrow and looking to use it in other areas > of TF, but I've noticed there has been some confusion about how Arrow > handles tensor data. Specifically, it gets assumed that tensors could be > part of a RecordBatch and could be readily used in an Arrow stream. > > > > I know we have talked about making tensors a logical type for columnar > data before in > https://lists.apache.org/thread.html/6cc86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.org%3E > and there is a JIRA ARROW-1614, but since there is work needed to fully > support the current spec for 1.0, I don't think it has moved forward much. > I'm wondering if maybe now is a better time to start working on this? I > think having built-in support for tensor columns would really help to > increase adoption of Arrow in frameworks that use tensor data. What are > other people's thoughts? > > > > Best Regards, > > Bryan > > > --00000000000059820d0585177740 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks Wes!=C2=A0 I am most interested in the last option, addi= ng Tensor as a logical type, but if it makes sense to embed as a BinaryArra= y for a first step then that would still be useful too.=C2=A0 I'll work= on a design doc with a use case and report back. I know there are a lot of= different efforts going on right now and I hate to pile more on, but appre= ciate time for feedback and review.

Best Regards,
Bryan

On Mo= n, Mar 25, 2019 at 2:36 PM Wes McKinney <wesmckinn@gmail.com> wrote:
hi Bryan,

I agree this would be useful to work out.

There's a few options:

* Sending multiple tensors as a sequence of encapsulated IPC messages
(as described in
https://github.com/apache/arro= w/blob/master/docs/source/format/IPC.rst).
There is no conflict with the columnar streaming protocol that
prevents this
* Embedding tensors in BinaryArray columns in some way (e.g. as an
ExtensionType, which we have now in C++)
* Adding Tensor as a logical type (this is essentially ARROW-1614)

I would like to understand the use cases more precisely. Perhaps you
can write a design document that describes the use cases in detail and
proposed solution? This doesn't fall anywhere on my list of 2019
priorities but I'm happy to give feedback on discussions and review
PRs where relevant.

In conjunction with embedding sequences of tensors in a BinaryArray,
we would probably need to first develop a LargeBinaryArray with 64-bit
offsets, so that buffers can be arbitrarily large (well, within 64-bit
address space at least)

- Wes

On Fri, Mar 22, 2019 at 1:24 PM Bryan Cutler <cutlerb@gmail.com> wrote:
>
> Hi All,
>
> Recently I have been working with the TensorFlow SIG-IO community to i= ntroduce Apache Arrow based Datasets for bringing Arrow data into TensorFlo= w. SIG-IO is a community maintained repository focused on input/output supp= ort for TF, see https://github.com/tensorflow/io (a lot of forma= ts from contrib/ ended up here).=C2=A0 Since it is community driven, if any= one is interested, participation is highly encouraged!
>
> I'm bringing this up for a couple reasons. First, I want to make s= ure that this stays in-line with any related efforts within the Arrow proje= ct and welcome any feedback. Secondly, the initial response has been great = and people are excited about using Arrow and looking to use it in other are= as of TF, but I've noticed there has been some confusion about how Arro= w handles tensor data. Specifically, it gets assumed that tensors could be = part of a RecordBatch and could be readily used in an Arrow stream.
>
> I know we have talked about making tensors a logical type for columnar= data before in https://lists.apache.org/thread.html/6c= c86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.= org%3E and there is a JIRA ARROW-1614, but since there is work needed t= o fully support the current spec for 1.0, I don't think it has moved fo= rward much. I'm wondering if maybe now is a better time to start workin= g on this?=C2=A0 I think having built-in support for tensor columns would r= eally help to increase adoption of Arrow in frameworks that use tensor data= . What are other people's thoughts?
>
> Best Regards,
> Bryan
>
--00000000000059820d0585177740--