arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Tabular ID query (subframe selection based on an integer ID)
Date Thu, 12 Nov 2020 15:40:11 GMT
In my setup here I did:

import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

num_rows = 10_000_000
data = np.random.randn(num_rows)

df = pd.DataFrame({'data{}'.format(i): data
                   for i in range(100)})

df['key'] = np.random.randint(0, 100, size=num_rows)

rb = pa.record_batch(df)
t = pa.table(df)

I found that the performance of filtering a record batch is very similar:

In [22]: timeit df[df.key == 5]
71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Whereas the performance of filtering a table is absolutely abysmal (no
idea what's going on here)

In [23]: %timeit t.filter(pc.equal(t[-1], 5))
961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A few obvious notes:

* Evidently, these code paths haven't been greatly optimized, so
someone ought to take a look at this
* Everything here is single-threaded in Arrow-land. The end-goal for
all of this is to parallelize everything (predicate evaluation,
filtering) on the CPU thread pool

On Wed, Nov 11, 2020 at 4:27 PM Vibhatha Abeykoon <vibhatha@gmail.com> wrote:
>
> Adding to the performance scenario, I also implemented some operators on top of the Arrow
compute API.
> I also observed similar performance when compared to Numpy and Pandas.
>
> But underneath Pandas what I observed was the usage of numpy ops,
>
> (https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195,
> https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999)
>
> @Wes
>
> So this would mean that Pandas may have similar performance to Numpy in filtering cases.
Is this a correct assumption?
>
> But the filter compute function itself was very fast. Most time is spent on creating
the mask when there are multiple columns.
> For about 10M records I observed 1.5 ratio of execution time between Arrow-compute based
filtering method vs Pandas.
>
> The performance gap is it due to vectorization or some other factor?
>
>
> With Regards,
> Vibhatha Abeykoon
>
>
> On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <jmsachs@gmail.com> wrote:
>>
>> Ugh, let me reformat that since the PonyMail browser interface thinks ">>>"
is a triply quoted message.
>>
>> <<< t = pa.Table.from_pandas(df0)
>> <<< t
>> pyarrow.Table
>> timestamp: int64
>> index: int32
>> value: int64
>> <<< import pyarrow.compute as pc
>> <<< def select_by_index(table, ival):
>>      value_index = table.column('index')
>>      index_type = value_index.type.to_pandas_dtype()
>>      mask = pc.equal(value_index, index_type(ival))
>>      return table.filter(mask)
>> <<< %timeit t2 = select_by_index(t, 515)
>> 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< %timeit t2 = select_by_index(t, 3)
>> 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< %timeit df0[df0['index'] == 515]
>> 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>> <<< %timeit df0[df0['index'] == 3]
>> 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
>>                                  np.count_nonzero(df0['index'] == 3),
>>                                  np.count_nonzero(df0['index'] == 515)))
>> ALL:1225000, 3:200000, 515:195
>> <<< df0.memory_usage()
>> Index            128
>> timestamp    9800000
>> index        4900000
>> value        9800000
>> dtype: int64
>>

Mime
View raw message