Adding to the performance scenario, I also implemented some operators on top of the Arrow compute API.
I also observed similar performance when compared to Numpy and Pandas.

But underneath Pandas what I observed was the usage of numpy ops,

(https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195,
https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999)

@Wes

So this would mean that Pandas may have similar performance to Numpy in filtering cases. Is this a correct assumption?

But the filter compute function itself was very fast. Most time is spent on creating the mask when there are multiple columns.
For about 10M records I observed 1.5 ratio of execution time between Arrow-compute based filtering method vs Pandas.

The performance gap is it due to vectorization or some other factor?


With Regards, 
Vibhatha Abeykoon


On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <jmsachs@gmail.com> wrote:
Ugh, let me reformat that since the PonyMail browser interface thinks ">>>" is a triply quoted message.

<<< t = pa.Table.from_pandas(df0)
<<< t
pyarrow.Table
timestamp: int64
index: int32
value: int64
<<< import pyarrow.compute as pc
<<< def select_by_index(table, ival):
     value_index = table.column('index')
     index_type = value_index.type.to_pandas_dtype()
     mask = pc.equal(value_index, index_type(ival))
     return table.filter(mask)
<<< %timeit t2 = select_by_index(t, 515)
2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<<< %timeit t2 = select_by_index(t, 3)
8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<<< %timeit df0[df0['index'] == 515]
1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
<<< %timeit df0[df0['index'] == 3]
10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
                                 np.count_nonzero(df0['index'] == 3),
                                 np.count_nonzero(df0['index'] == 515)))
ALL:1225000, 3:200000, 515:195
<<< df0.memory_usage()
Index            128
timestamp    9800000
index        4900000
value        9800000
dtype: int64