arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [C++] - How to extract indices of nested MapArray
Date Wed, 03 Mar 2021 23:54:24 GMT
I think C++14 is fine for optional dependencies and shouldn't block
any development work right now. Note that we should be able to upgrade
to require a minimum of C++14 as soon as April or May of this year
since we will stop having to support one of the last gcc < 5
toolchains (for R 3.5 IIUC)

On Wed, Mar 3, 2021 at 5:41 PM Yeshwanth Sriram <yeshsriram@icloud.com> wrote:
>
> Hi Micah,
>
> Thank you for the detailed response. Apologize for not responding earlier.
>
> a.) Looked at the latencies with and without filtering based on just foreach and the
latency is dominated by the parquet/write operation. So I’m going to go with what I have
which already provides substantial improvement for my use case.
>
> b.) Would like to contribute for implement ANY over booleans in Arrow/compute kernel.
Waiting for permission to come through.
>
> I’m also interested in contributing to Azure/ADLS filesystem but the library I was
looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp . Is c++14 no-go as a
dependency in Arrow (even conditional ?)
>
> Thank you
> Yesh
>
> On Feb 28, 2021, at 2:09 PM, Micah Kornfield <emkornfield@gmail.com> wrote:
>
> Hi  Yeshwanth,
> I think you can do the first part of the filtering using the Equals kernel and IsIn kernel
on the child arrays of the Map.  I took a quick look but I don't think that there is anything
implemented that would allow you to map the resulting bitmaps to the parent lists. It seems
that we would want to add an "Any" function for List<Bool> that returns a Bool array
if any of the elements are true. There is already one for flat Boolean Arrays [1] but I don't
think that is useful here.
>
> So I think the logic that you would ultimately want in pseudo-code:
>
> children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id, [[“aaa”,
“bee”, “see”])
> list = MakeList(map.offsets, children_bitmap)
> final_selection = Any(list)
>
> Is the new Kernel something you would be interested in contributing?
>
> -Micah
>
> [1] https://github.com/apache/arrow/pull/8294
>
> On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <yeshsriram@icloud.com> wrote:
>>
>> Using C++//Arrow to filter out large parquet files and I’m able to do this successfully.
The current poc implementation is based on nested for/loops which I would like to avoid this
and instead use built-in filter/take functions or some recommendations  to extract (take functions
?) arrays of indices or booleans to filter out rows.
>>
>> The input (data) array/column type is MapArray[key:String, value:StructArray[id:String,
…]]
>>
>> The input filter is a {filter_key: “some string”, filter_ids: [“aaa”, “bee”,
“see”, ..] }
>>   - Where filter_key, and filter_ids is to match contents of input MapArray
>>
>> The output I’m looking for is either array of booleans or indices of input array
that match the input filer.
>>
>> Thank you
>
>

Mime
View raw message