arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [C++] - How to extract indices of nested MapArray
Date Sun, 28 Feb 2021 22:09:25 GMT
Hi  Yeshwanth,
I think you can do the first part of the filtering using the Equals kernel
and IsIn kernel on the child arrays of the Map.  I took a quick look but I
don't think that there is anything implemented that would allow you to map
the resulting bitmaps to the parent lists. It seems that we would want to
add an "Any" function for List<Bool> that returns a Bool array if any of
the elements are true. There is already one for flat Boolean Arrays [1] but
I don't think that is useful here.

So I think the logic that you would ultimately want in pseudo-code:

children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id,
[[“aaa”, “bee”, “see”])
list = MakeList(map.offsets, children_bitmap)
final_selection = Any(list)

Is the new Kernel something you would be interested in contributing?

-Micah

[1] https://github.com/apache/arrow/pull/8294

On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <yeshsriram@icloud.com>
wrote:

> Using C++//Arrow to filter out large parquet files and I’m able to do this
> successfully. The current poc implementation is based on nested for/loops
> which I would like to avoid this and instead use built-in filter/take
> functions or some recommendations  to extract (take functions ?) arrays of
> indices or booleans to filter out rows.
>
> The input (data) array/column type is MapArray[key:String,
> value:StructArray[id:String, …]]
>
> The input filter is a {filter_key: “some string”, filter_ids: [“aaa”,
> “bee”, “see”, ..] }
>   - Where filter_key, and filter_ids is to match contents of input MapArray
>
> The output I’m looking for is either array of booleans or indices of input
> array that match the input filer.
>
> Thank you

Mime
View raw message