From user-return-494-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Jun 8 07:21:56 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3A402180647 for ; Mon, 8 Jun 2020 09:21:56 +0200 (CEST) Received: (qmail 52058 invoked by uid 500); 8 Jun 2020 07:21:55 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 52048 invoked by uid 99); 8 Jun 2020 07:21:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2020 07:21:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 667DC1813DE for ; Mon, 8 Jun 2020 07:21:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 3ztbt1EIm9M9 for ; Mon, 8 Jun 2020 07:21:53 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.219.52; helo=mail-qv1-f52.google.com; envelope-from=niyue.com@gmail.com; receiver= Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 02531BCAD5 for ; Mon, 8 Jun 2020 07:21:52 +0000 (UTC) Received: by mail-qv1-f52.google.com with SMTP id ec10so7863935qvb.5 for ; Mon, 08 Jun 2020 00:21:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=aV1RpqEWHP/kCcBd5YUyp6PB9VTXq7dU46YDnwVKhCs=; b=Rx3gelxU1SQU9JVfMPlYmP9DI0cchMkW8SXFdnN04X/mKKpqjBEkLPGzVX3ittuWVC Kj/AkQ0hte+fODpNpAhEgGxeVqb/rYmRC97QeAJWxq6lOB87WfzjyjY/l+c7NQe4KfE9 rrAjgjf6n6sCj0Y8LzZfuJGLRw25CTbeyseWsZVguxCvYkP3K8+fiS6Iny04W9zN247l 4abrskwaEMaijyMmmPjfWnpJZlnj68QMoIs/UdtJEI1LKx58uSmzyMYsQ6eCj5kfe3QG ou29yA1QUlPBdjeLPVLHaiPioyA+268beWyUskWOCfHpfPxOciR0n1AUMAhUa3p+uI5B ODfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=aV1RpqEWHP/kCcBd5YUyp6PB9VTXq7dU46YDnwVKhCs=; b=LpxZEU0buXnyBi58uye69q3n883UPhr685S+2dSpGYrNbq9sDOY1nXbWlLTWgL245o MkRDbuCXTD4WapYZNM8tcu0d7Ocij3Ym8JyPLTZbNK7OTO7BN9M0/oxyUA9hLIOTFh3Q baeXsRPhCP38a/i0QkUNTs+E3PXeIri572qQaGOyyTar9AXHky9mpSbqTHpYbApldpg1 HurLHo2BX6660eCMYQKcqCfRm3458035KmHqsDWaZAer5fjFY74nOEv6myKGAMHdweuD pY5SEREqLulQEcArkZk8M0Z9zfS5jFMNP1MAN50qk/hIxZ+BybfClzc9E3w8KUhSfJqQ bgdA== X-Gm-Message-State: AOAM532fUUW7dvrc1LyR/iyRSnNr+9vuBxO+lOorI0O3tgP66vnxM29Q TdGbent5LO6P41q1tIwFiOXL4dkz8lP/ztkYZ7jWjhJ8mtteQA== X-Google-Smtp-Source: ABdhPJzUHHfS8xVhldAetPgPAyWfv9xcYMJUJa0HoO5vxWE0zhxyjynik5Vbo1ySHhvhZEeCAr8CWyvDywHtwEEieQ8= X-Received: by 2002:a0c:e385:: with SMTP id a5mr21532585qvl.81.1591600912304; Mon, 08 Jun 2020 00:21:52 -0700 (PDT) MIME-Version: 1.0 From: Yue Ni Date: Mon, 8 Jun 2020 15:21:16 +0800 Message-ID: Subject: Apply arrow compute functions to an iterator like data structure for Array To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000008ff905a78d7774" --000000000000008ff905a78d7774 Content-Type: text/plain; charset="UTF-8" Hi there, I am experimenting some computation over a stream of record batches, and would like to use some functions in arrow::compute. Currently, the functions in arrow::compute accepts *Datum* data structure in its API, which allows users to pass: 1) Array 2) ChunkedArray 3) Table to the API. However, in my case, I have a stream of record batches to read from an Java iterator like interface, basically, it allows you to read a new batch at a time using the "next" function. I can adapt it to the arrow::RecordBatchReader interface, and I wonder how I can apply some arrow compute functions like "sum"/"dictionary encode" to the record batch streams like this. Is this possible and what the recommended way is to do this in Arrow? I am aware that I can put multiple non contiguous arrays into a ChunkedArray and consume it using the arrow compute functions, but that requires users to consume all the stream to the end and buffer them all in memory because users need to construct a vector of Array from the record batch stream (if I understand ChunkedArray correctly), which is not necessary in many cases. For example, for "sum", I think only the global sum state and the specific array in the current batch is needed in memory for such computation, so I would like to know if there is an alternative approach doing it. Thanks. Regards, Yue --000000000000008ff905a78d7774 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi ther= e,

I am experimenting some computation over a stream= of record batches, and would like to use some functions in arrow::compute.= Currently, the functions in arrow::compute accepts *Datum* data structure = in its API, which allows users to pass: 1) Array 2) ChunkedArray 3) Table t= o the API. However, in my case, I have a stream of record batches to read f= rom an Java iterator like interface, basically, it allows you to read a new= batch at a time using the "next" function.=C2=A0

<= /div>
I can adapt it to the arrow::RecordBatchReader interface, and I w= onder how I can apply some arrow compute functions like "sum"/&qu= ot;dictionary encode" to the record batch streams like this.

Is this possible and what the recommended way is to do this= in Arrow? I am aware that I can put multiple non contiguous arrays into a = ChunkedArray and consume it using the arrow compute functions, but that req= uires users to consume all the stream to the end and buffer them all in mem= ory because users need to construct a vector of Array from the record batch= stream (if I understand ChunkedArray correctly), which is not necessary in= many cases. For example, for "sum", I think only the global sum = state and the specific array in the current=C2=A0batch is needed in memory = for such computation, so I would like to know if there is an alternative ap= proach doing it. Thanks.

Regards,
Yue
--000000000000008ff905a78d7774--