Adding to the performance scenario, I also implemented some operators on top of the Arrow compute API.
I also observed similar performance when compared to Numpy and Pandas.
But underneath Pandas what I observed was the usage of numpy ops,
So this would mean that Pandas may have similar performance to Numpy in filtering cases. Is this a correct assumption?
But the filter compute function itself was very fast. Most time is spent on creating the mask when there are multiple columns.
For about 10M records I observed 1.5 ratio of execution time between Arrow-compute based filtering method vs Pandas.
The performance gap is it due to vectorization or some other factor?