From user-return-498-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Jun 8 13:45:24 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 56498180647 for ; Mon, 8 Jun 2020 15:45:24 +0200 (CEST) Received: (qmail 31866 invoked by uid 500); 8 Jun 2020 13:45:23 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 31854 invoked by uid 99); 8 Jun 2020 13:45:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2020 13:45:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F24631814DA for ; Mon, 8 Jun 2020 13:45:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ZzrT56fIc0jV for ; Mon, 8 Jun 2020 13:45:21 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::734; helo=mail-qk1-x734.google.com; envelope-from=niyue.com@gmail.com; receiver= Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com [IPv6:2607:f8b0:4864:20::734]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 86BDB7F57B for ; Mon, 8 Jun 2020 13:45:20 +0000 (UTC) Received: by mail-qk1-x734.google.com with SMTP id 205so17226100qkg.3 for ; Mon, 08 Jun 2020 06:45:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=o4xYNlopgDVp+ixdezc0uZca/NMfiyHaqZ5dsBWYFZc=; b=rpyFZL0v9xMTJt+m180Q3VyHZq9IaOkkW7p3anhwaGJz/CdBG6qNAMJeX3PXlwTPr9 O9Nj2S0u1mwyX18XGVAzqWK5SrLsKm+V3UY+p+FMJIYsDH/sDcI+eqiHPcyzp6Hhyt7+ 4fuQ905utVx9A4TyzQnlbvbgU0F/B5PkqnWePX/zUTfXdGUute0wJOcF8NzZ0Py4QsWP CXz2NDGtriKWpniafrOgzuS9hzws2mKW2bQ2mBKMjwFqGTGxm0VWg/CLJmP4OZTDdefb Vdy/t13RwbpsNoKbYCR3N8WvCiKmdQY5Fdiu/ogpGbnfBm1K0t6glSvodmR52ysHux56 02HQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=o4xYNlopgDVp+ixdezc0uZca/NMfiyHaqZ5dsBWYFZc=; b=mz+s1Ap48jApau6Ja7z6xzGAAUCKytVCJWHCyGLeL0J52hDZe2qy77zD/9L1DEBKOO 9FHO2gFLxbT9rUOufg1+pWWXOJil9WxhijVcwHmA6KnTBOydq044sDNMNGRu/U9aF94f KnG7e+WNWUghwWScwBAsnpJY8HcuveY8Jcd/RFGtKAO8fGhsOiSDLmWmfXTNo4EMFjgi 3NWmRSRle6ufqBbE0ZjcfWNrcYVauu4lkMSr9SGc6UQUXnu2ldWtPepI8blWnGC3+uvV R8IvKZt61cIto1t05xleQvjflC0cGNEucVxJEGClyDDicMDz87G4FCdk5FQ5ck11mDPm dJTg== X-Gm-Message-State: AOAM5317ejN8stH5x6zln9Oltev1g6Oy8IbuEjb5X20CEkhk7q5TBx93 Dp5nfd9SUzUsTLR4I5Vzc6jweeL6fGRh2KJqz30a9sQWGEs= X-Google-Smtp-Source: ABdhPJwj03HP8T3uqoPCKHESnyAPW7JUCTPpp6/9DgbEv2Clq7l66Vdb38JgBOUamGvvTKk60ANC1JKozp+ye8m2tXc= X-Received: by 2002:a37:4d89:: with SMTP id a131mr20249070qkb.223.1591623918711; Mon, 08 Jun 2020 06:45:18 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yue Ni Date: Mon, 8 Jun 2020 21:44:42 +0800 Message-ID: Subject: Re: Apply arrow compute functions to an iterator like data structure for Array To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000004a77bb05a792d29f" --0000000000004a77bb05a792d29f Content-Type: text/plain; charset="UTF-8" Thanks for the pointer, Wes. I will look into it and will keep an eye on the new PR/JIRAs on arrow/compute. On Mon, Jun 8, 2020 at 9:26 PM Wes McKinney wrote: > I'm presently working on building general streaming / iterative > function execution machinery that should eventually serve this use > case (see recent patches in arrow/compute/* and related JIRAs), but > there are not yet APIs available that do exactly what you're looking > for. Depending on your appetite for low-level development you can look > at the details of how functions like DictionaryEncode are executed on > chunked data (see ExecBatchIterator and VectorExecutor in > compute/exec_internal.h / compute/exec.cc) > > On Mon, Jun 8, 2020 at 2:21 AM Yue Ni wrote: > > > > Hi there, > > > > I am experimenting some computation over a stream of record batches, and > would like to use some functions in arrow::compute. Currently, the > functions in arrow::compute accepts *Datum* data structure in its API, > which allows users to pass: 1) Array 2) ChunkedArray 3) Table to the API. > However, in my case, I have a stream of record batches to read from an Java > iterator like interface, basically, it allows you to read a new batch at a > time using the "next" function. > > > > I can adapt it to the arrow::RecordBatchReader interface, and I wonder > how I can apply some arrow compute functions like "sum"/"dictionary encode" > to the record batch streams like this. > > > > Is this possible and what the recommended way is to do this in Arrow? I > am aware that I can put multiple non contiguous arrays into a ChunkedArray > and consume it using the arrow compute functions, but that requires users > to consume all the stream to the end and buffer them all in memory because > users need to construct a vector of Array from the record batch stream (if > I understand ChunkedArray correctly), which is not necessary in many cases. > For example, for "sum", I think only the global sum state and the specific > array in the current batch is needed in memory for such computation, so I > would like to know if there is an alternative approach doing it. Thanks. > > > > Regards, > > Yue > --0000000000004a77bb05a792d29f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks for the pointer, Wes. I will look into it and will = keep an eye on the new PR/JIRAs on arrow/compute.=C2=A0

On Mon, Jun 8, 2020 = at 9:26 PM Wes McKinney <wesmckin= n@gmail.com> wrote:
I'm presently work= ing on building general streaming / iterative
function execution machinery that should eventually serve this use
case (see recent patches in arrow/compute/* and related JIRAs), but
there are not yet APIs available that do exactly what you're looking for. Depending on your appetite for low-level development you can look
at the details of how functions like DictionaryEncode are executed on
chunked data (see ExecBatchIterator and VectorExecutor in
compute/exec_internal.h / compute/exec.cc)

On Mon, Jun 8, 2020 at 2:21 AM Yue Ni <niyue.com@gmail.com> wrote:
>
> Hi there,
>
> I am experimenting some computation over a stream of record batches, a= nd would like to use some functions in arrow::compute. Currently, the funct= ions in arrow::compute accepts *Datum* data structure in its API, which all= ows users to pass: 1) Array 2) ChunkedArray 3) Table to the API. However, i= n my case, I have a stream of record batches to read from an Java iterator = like interface, basically, it allows you to read a new batch at a time usin= g the "next" function.
>
> I can adapt it to the arrow::RecordBatchReader interface, and I wonder= how I can apply some arrow compute functions like "sum"/"di= ctionary encode" to the record batch streams like this.
>
> Is this possible and what the recommended way is to do this in Arrow? = I am aware that I can put multiple non contiguous arrays into a ChunkedArra= y and consume it using the arrow compute functions, but that requires users= to consume all the stream to the end and buffer them all in memory because= users need to construct a vector of Array from the record batch stream (if= I understand ChunkedArray correctly), which is not necessary in many cases= . For example, for "sum", I think only the global sum state and t= he specific array in the current batch is needed in memory for such computa= tion, so I would like to know if there is an alternative approach doing it.= Thanks.
>
> Regards,
> Yue
--0000000000004a77bb05a792d29f--