Hi Micah,

Thank you for the detailed response. Apologize for not responding earlier.

a.) Looked at the latencies with and without filtering based on just foreach and the latency is dominated by the parquet/write operation. So I’m going to go with what I have which already provides substantial improvement for my use case.

b.) Would like to contribute for implement ANY over booleans in Arrow/compute kernel. Waiting for permission to come through.

I’m also interested in contributing to Azure/ADLS filesystem but the library I was looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp . Is c++14 no-go as a dependency in Arrow (even conditional ?)

Thank you
Yesh

On Feb 28, 2021, at 2:09 PM, Micah Kornfield <emkornfield@gmail.com> wrote:

Hi  Yeshwanth, 
I think you can do the first part of the filtering using the Equals kernel and IsIn kernel on the child arrays of the Map.  I took a quick look but I don't think that there is anything implemented that would allow you to map the resulting bitmaps to the parent lists. It seems that we would want to add an "Any" function for List<Bool> that returns a Bool array if any of the elements are true. There is already one for flat Boolean Arrays [1] but I don't think that is useful here.

So I think the logic that you would ultimately want in pseudo-code:

children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id, [[“aaa”, “bee”, “see”])
list = MakeList(map.offsets, children_bitmap)
final_selection = Any(list)

Is the new Kernel something you would be interested in contributing? 

-Micah


On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <yeshsriram@icloud.com> wrote:
Using C++//Arrow to filter out large parquet files and I’m able to do this successfully. The current poc implementation is based on nested for/loops which I would like to avoid this and instead use built-in filter/take functions or some recommendations  to extract (take functions ?) arrays of indices or booleans to filter out rows.

The input (data) array/column type is MapArray[key:String, value:StructArray[id:String, …]]

The input filter is a {filter_key: “some string”, filter_ids: [“aaa”, “bee”, “see”, ..] }
  - Where filter_key, and filter_ids is to match contents of input MapArray

The output I’m looking for is either array of booleans or indices of input array that match the input filer.

Thank you