From user-return-1052-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Mar 3 23:41:27 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id C016618064F for ; Thu, 4 Mar 2021 00:41:27 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id D6AFF64072 for ; Wed, 3 Mar 2021 23:41:26 +0000 (UTC) Received: (qmail 33014 invoked by uid 500); 3 Mar 2021 23:41:25 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 32999 invoked by uid 99); 3 Mar 2021 23:41:25 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Mar 2021 23:41:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id DC280C02DD for ; Wed, 3 Mar 2021 23:41:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=icloud.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id qRuPCWS_Ah3l for ; Wed, 3 Mar 2021 23:41:24 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=17.58.23.198; helo=mr85p00im-ztdg06011901.me.com; envelope-from=yeshsriram@icloud.com; receiver= Received: from mr85p00im-ztdg06011901.me.com (mr85p00im-ztdg06011901.me.com [17.58.23.198]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 9B22FBD037 for ; Wed, 3 Mar 2021 23:41:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1614814877; bh=JtUZUO9vbgVV2Z5r0JYXSwtjyJ6toJSKgGTlQyeyATA=; h=From:Content-Type:Mime-Version:Subject:Date:To:Message-Id; b=YozXEV1B5i0jbW5Ba041XeWKzTPM4Y157aLeuVlmoUH1L5nPndtUKZN6XkUdVO8e9 OYcoo3yfzpPYZHW6wBNXjGFTyb+Hl5g62JVdRXj1QS1Y8086Q007sueizKAYPfRHtU OwuZbaKHHZa6DfuefhNZEnHb8oriCcHynY8BZ/aV5lFXkQM/2m5aWMYCGRHO1PQqv+ 44ffD6j95Js3UpjzFkghmaMaI2y/pyL28Ryy4wjhySoTVCJngjg+JzhXXnwTl4RyE0 oLQzSxH6ZK0YBAMkcY+jFfar4BMHMdAn3W/db9aaRguuuRdm4VM8zVXrtswjd/iftA EomZzGu7ZAk8g== Received: from [10.0.0.83] (c-73-189-132-104.hsd1.ca.comcast.net [73.189.132.104]) by mr85p00im-ztdg06011901.me.com (Postfix) with ESMTPSA id 0CD39A60763; Wed, 3 Mar 2021 23:41:16 +0000 (UTC) From: Yeshwanth Sriram Content-Type: multipart/alternative; boundary="Apple-Mail=_5140C579-ED48-412C-919A-A989DC19A883" Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.40.0.2.32\)) Subject: Re: [C++] - How to extract indices of nested MapArray Date: Wed, 3 Mar 2021 15:41:16 -0800 References: <6A4907F2-7327-4BA6-B3C6-C1A5BE0C5412@icloud.com> To: user@arrow.apache.org, emkornfield@gmail.com In-Reply-To: Message-Id: <45A862D0-A6C6-4CB2-9C37-C9937F7B3004@icloud.com> X-Mailer: Apple Mail (2.3654.40.0.2.32) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.369,18.0.761 definitions=2021-03-03_07:2021-03-03,2021-03-03 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-2006250000 definitions=main-2103030170 --Apple-Mail=_5140C579-ED48-412C-919A-A989DC19A883 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Micah, Thank you for the detailed response. Apologize for not responding = earlier. a.) Looked at the latencies with and without filtering based on just = foreach and the latency is dominated by the parquet/write operation. So = I=E2=80=99m going to go with what I have which already provides = substantial improvement for my use case. b.) Would like to contribute for implement ANY over booleans in = Arrow/compute kernel. Waiting for permission to come through. I=E2=80=99m also interested in contributing to Azure/ADLS filesystem but = the library I was looking at is c++14 here = https://github.com/Azure/azure-sdk-for-cpp = . Is c++14 no-go as a = dependency in Arrow (even conditional ?) Thank you Yesh > On Feb 28, 2021, at 2:09 PM, Micah Kornfield = wrote: >=20 > Hi Yeshwanth,=20 > I think you can do the first part of the filtering using the Equals = kernel and IsIn kernel on the child arrays of the Map. I took a quick = look but I don't think that there is anything implemented that would = allow you to map the resulting bitmaps to the parent lists. It seems = that we would want to add an "Any" function for List that returns = a Bool array if any of the elements are true. There is already one for = flat Boolean Arrays [1] but I don't think that is useful here. >=20 > So I think the logic that you would ultimately want in pseudo-code: >=20 > children_bitmap =3D Equals(map.key, "some string") && = IsIn(map.struct.id , [[=E2=80=9Caaa=E2=80=9D, = =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2=80=9D]) > list =3D MakeList(map.offsets, children_bitmap) > final_selection =3D Any(list) >=20 > Is the new Kernel something you would be interested in contributing?=20= >=20 > -Micah >=20 > [1] https://github.com/apache/arrow/pull/8294 = > On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram = > wrote: > Using C++//Arrow to filter out large parquet files and I=E2=80=99m = able to do this successfully. The current poc implementation is based on = nested for/loops which I would like to avoid this and instead use = built-in filter/take functions or some recommendations to extract (take = functions ?) arrays of indices or booleans to filter out rows. >=20 > The input (data) array/column type is MapArray[key:String, = value:StructArray[id:String, =E2=80=A6]]=20 >=20 > The input filter is a {filter_key: =E2=80=9Csome string=E2=80=9D, = filter_ids: [=E2=80=9Caaa=E2=80=9D, =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2= =80=9D, ..] } > - Where filter_key, and filter_ids is to match contents of input = MapArray >=20 > The output I=E2=80=99m looking for is either array of booleans or = indices of input array that match the input filer. >=20 > Thank you --Apple-Mail=_5140C579-ED48-412C-919A-A989DC19A883 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi = Micah,

Thank you for = the detailed response. Apologize for not responding earlier.

a.) Looked at the = latencies with and without filtering based on just foreach and the = latency is dominated by the parquet/write operation. So I=E2=80=99m = going to go with what I have which already provides substantial = improvement for my use case.

b.) Would like to contribute for = implement ANY over booleans in Arrow/compute kernel. Waiting for = permission to come through.

I=E2=80=99m also interested in contributing to Azure/ADLS = filesystem but the library I was looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp . Is = c++14 no-go as a dependency in Arrow (even conditional ?)

Thank you
Yesh

On Feb 28, 2021, at 2:09 PM, = Micah Kornfield <emkornfield@gmail.com> wrote:

Hi  Yeshwanth, 
I think you can do the first part of the filtering using = the Equals kernel and IsIn kernel on the child arrays of the Map.  = I took a quick look but I don't think that there is anything implemented = that would allow you to map the resulting bitmaps to the parent = lists. It seems that we would want to add an "Any" function for = List<Bool> that returns a Bool array if any of the elements are = true. There is already one for flat Boolean Arrays [1] but I don't think = that is useful here.

So I think the logic that you would ultimately want in = pseudo-code:

children_bitmap =3D Equals(map.key, "some string") && = IsIn(map.struct.id, = [[=E2=80=9Caaa=E2=80=9D, =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2=80=9D])
list =3D MakeList(map.offsets, = children_bitmap)
final_selection =3D = Any(list)

Is = the new Kernel something you would be interested in = contributing? 

-Micah


On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram = <yeshsriram@icloud.com> wrote:
Using C++//Arrow to filter out large = parquet files and I=E2=80=99m able to do this successfully. The current = poc implementation is based on nested for/loops which I would like to = avoid this and instead use built-in filter/take functions or some = recommendations  to extract (take functions ?) arrays of indices or = booleans to filter out rows.

The input (data) array/column type is MapArray[key:String, = value:StructArray[id:String, =E2=80=A6]]

The input filter is a {filter_key: =E2=80=9Csome string=E2=80=9D, = filter_ids: [=E2=80=9Caaa=E2=80=9D, =E2=80=9Cbee=E2=80=9D, =E2=80=9Csee=E2= =80=9D, ..] }
  - Where filter_key, and filter_ids is to match contents of input = MapArray

The output I=E2=80=99m looking for is either array of booleans or = indices of input array that match the input filer.

Thank you

= --Apple-Mail=_5140C579-ED48-412C-919A-A989DC19A883--