From user-return-1028-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun Feb 28 22:09:45 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 4456C180652 for ; Sun, 28 Feb 2021 23:09:45 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 7FC3E42979 for ; Sun, 28 Feb 2021 22:09:44 +0000 (UTC) Received: (qmail 58623 invoked by uid 500); 28 Feb 2021 22:09:44 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 58613 invoked by uid 99); 28 Feb 2021 22:09:44 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Feb 2021 22:09:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id 647B9C033C for ; Sun, 28 Feb 2021 22:09:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id kwDpStZ-wG9D for ; Sun, 28 Feb 2021 22:09:42 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::629; helo=mail-ej1-x629.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id C06007FC9B for ; Sun, 28 Feb 2021 22:09:42 +0000 (UTC) Received: by mail-ej1-x629.google.com with SMTP id mm21so24409992ejb.12 for ; Sun, 28 Feb 2021 14:09:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to; bh=e8282CuE/wkmLm0h/P7OjaV72dp5PfL1sL9A2I56AeM=; b=vfaE5D0Ez3SBdVYQGfaBLoPc0pnmb2mvxRQD+3os690fx4lYLJ43IXcWH7zGcSUK0f 3+MWgr0qmqW3ceh7pE2yELF2yrQTUAHhc/Qh0Bl6NRbT03M47V3dGgn7geANIC6D/6uB ABgz9nlT5aZ7SVJI+Csthbsdq27eg7NUfqM0h4BuaPmXdPr9Lglq3Vq8MrxjY+uI70cO UILbhWjGkO8Krpp4H5s5eLN9ApLqKAovXrC01i/2CK7QG448jrZik9T2FiKhw62D/IV4 QZI4BGF7FIvW1QzUF6XHzjwG9jdawHXub71EpcQnxg7lTmVNyJG3H7f5wok/kyCwWki/ 8hMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to; bh=e8282CuE/wkmLm0h/P7OjaV72dp5PfL1sL9A2I56AeM=; b=G/cYX4dYkU8OvQLAPgrDX6S6/ipKFmxEersg1SHDPZ9TREQxUCQK3/t0VV6G1KoQKL O6joAAcEvDKO6zuCX/01Fz5Cy/D8/1+yZucIua8q9dfPdkT5NTg9kghJwSXVSZg+VMa/ Bvz7MtLoBoWQHiX7nr9OUoXdj800i+udYRvP42cN1IC0Vq2U+zqFnywWzW0X3bc8L6tk hg5jECzPZ6qARg6GvrYjaWiaDUR6UvTT6pNCo2MJIRfrEurRWMPA6LLOlHZzhmPXzAtI dG6gIEuwwLkKPgAfsZD5QAw0os60a1JCl4LbTToH2J2afHwpRXUA3guLNwXNcY9KBP36 MJrQ== X-Gm-Message-State: AOAM532i7pS2XmSypNFHxGgENPRmJwuKVFLpxirBtlk/m8YF+y5mDch8 7w3JVyYQtQvdXK/xFAvaohZdWjCVTGu105zgeH7/gWNhrZ0= X-Google-Smtp-Source: ABdhPJya+GehMSmwXMqCMNuoEHgcRhGoEE1nLwBbRqRIa91hYQfrjuhif4H05rWpbte+Bssq0nUcdFI7waYItWfdDTc= X-Received: by 2002:a17:907:75fa:: with SMTP id jz26mr13221890ejc.457.1614550176067; Sun, 28 Feb 2021 14:09:36 -0800 (PST) MIME-Version: 1.0 References: <6A4907F2-7327-4BA6-B3C6-C1A5BE0C5412@icloud.com> In-Reply-To: <6A4907F2-7327-4BA6-B3C6-C1A5BE0C5412@icloud.com> Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Sun, 28 Feb 2021 14:09:25 -0800 Message-ID: Subject: Re: [C++] - How to extract indices of nested MapArray To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000b7746f05bc6cc1cd" --000000000000b7746f05bc6cc1cd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Yeshwanth, I think you can do the first part of the filtering using the Equals kernel and IsIn kernel on the child arrays of the Map. I took a quick look but I don't think that there is anything implemented that would allow you to map the resulting bitmaps to the parent lists. It seems that we would want to add an "Any" function for List that returns a Bool array if any of the elements are true. There is already one for flat Boolean Arrays [1] but I don't think that is useful here. So I think the logic that you would ultimately want in pseudo-code: children_bitmap =3D Equals(map.key, "some string") && IsIn(map.struct.id, [[=E2=80=9Caaa=E2=80=9D, =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2=80=9D]) list =3D MakeList(map.offsets, children_bitmap) final_selection =3D Any(list) Is the new Kernel something you would be interested in contributing? -Micah [1] https://github.com/apache/arrow/pull/8294 On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram wrote: > Using C++//Arrow to filter out large parquet files and I=E2=80=99m able t= o do this > successfully. The current poc implementation is based on nested for/loops > which I would like to avoid this and instead use built-in filter/take > functions or some recommendations to extract (take functions ?) arrays o= f > indices or booleans to filter out rows. > > The input (data) array/column type is MapArray[key:String, > value:StructArray[id:String, =E2=80=A6]] > > The input filter is a {filter_key: =E2=80=9Csome string=E2=80=9D, filter_= ids: [=E2=80=9Caaa=E2=80=9D, > =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2=80=9D, ..] } > - Where filter_key, and filter_ids is to match contents of input MapArr= ay > > The output I=E2=80=99m looking for is either array of booleans or indices= of input > array that match the input filer. > > Thank you --000000000000b7746f05bc6cc1cd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi=C2=A0 Yeshwanth,=C2=A0
I think you=C2=A0= can do the first part of the filtering using the Equals kernel and IsIn ker= nel on the child arrays of the Map.=C2=A0 I took a quick look but I don'= ;t think that there is anything implemented that would allow you to map the= resulting=C2=A0bitmaps to the parent lists. It seems that we would want to= add an "Any" function for List<Bool> that returns a Bool a= rray if any of the elements are true. There is already one for flat Boolean= Arrays [1] but I don't think that is useful here.

S= o I think the logic that you would ultimately want in pseudo-code:

children_bitmap =3D Equals(map.key, "some string"= ;) && IsIn(map.struct.id, [[= =E2=80=9Caaa=E2=80=9D, =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2=80=9D])
=
list =3D MakeList(map.offsets, children_bitmap)
final_select= ion =3D Any(list)

Is the new Kernel something you = would be interested in contributing?=C2=A0

-Micah<= /div>


On Su= n, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <yeshsriram@icloud.com> wrote:
Using C++//Arrow to filter out large par= quet files and I=E2=80=99m able to do this successfully. The current poc im= plementation is based on nested for/loops which I would like to avoid this = and instead use built-in filter/take functions or some recommendations=C2= =A0 to extract (take functions ?) arrays of indices or booleans to filter o= ut rows.

The input (data) array/column type is MapArray[key:String, value:StructArra= y[id:String, =E2=80=A6]]

The input filter is a {filter_key: =E2=80=9Csome string=E2=80=9D, filter_id= s: [=E2=80=9Caaa=E2=80=9D, =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2=80=9D, ..= ] }
=C2=A0 - Where filter_key, and filter_ids is to match contents of input Map= Array

The output I=E2=80=99m looking for is either array of booleans or indices o= f input array that match the input filer.

Thank you
--000000000000b7746f05bc6cc1cd--