arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elad Rosenheim <>
Subject Filtering list/map arrays
Date Fri, 21 May 2021 14:05:55 GMT

One of the gaps I currently have in Funnel Rocket ( is supporting nested
columns, as in: given a Parquet file with a column of type List(int64), be
able to find rows where the list holds a specific int element.

Right now, the need is fortunately limited to lists of primitives (mostly
int) and maps of string->string, rather than any arbitrary complexity.

Currently, I load Parquet files via pyarrow, then call to_pandas() and run
multiple filters on the DataFrame.

After reading Uwe's blog post (
and looking at the Fletcher project (,
seems the "proper" way to do it would be:

* Write an ExtensionDType/ExtensionArray can wrap an arrow ChunkedArray
made of ListArrays. Not even sure what the operator should be for lookup in
a list - should I treat a list_series==123 as "for each list in this
series, look for the element 123 in it?".

 * Potentially use a @jitclass for more performant lookup, as Uwe has

* For now, for any abstract method I'm not sure what to do with - start
with raising an exception, then run some unit tests based on my project's
needs, and see that they pass :-/

* When calling Table.to_pandas(), supply a type mapper argument to map the
specific supported types to the appropriate extension class.

* If it seems to work, figure out if I've missed something important in the
concrete classes :-/

Am I getting this right, more or less?

Thanks a lot,

View raw message