arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Re: Tabular ID query (subframe selection based on an integer ID)
Date Wed, 11 Nov 2020 19:33:27 GMT
On 2020/11/11 18:17:32, Wes McKinney <wesmckinn@gmail.com> wrote: 
> You should be able to use the kernels available in pyarrow.compute to
> do this -- there might be a few that are missing, but if you can't
> find what you need please open a Jira issue so it goes into the
> backlog
> 
> On Wed, Nov 11, 2020 at 11:43 AM Jason Sachs <jmsachs@gmail.com> wrote:
> >
> > I do a lot of the following operation:
> >
> >     subframe = df[df['ID'] == k]
> >
> > where df is a Pandas DataFrame with a small number of columns but a moderately large
number of rows (say 200K - 5M). The columns are usually simple... for example's sake let's
call them int64 TIMESTAMP, uint32 ID, int64 VALUE.
> >
> > I am moving the source data to Parquet format. I don't really care whether I do
this in PyArrow or Pandas, but I need to perform these subframe selections frequently and
would like to speed them up. (The idea being, load the data into memory once, and then expect
to perform subframe selection anywhere from 10 - 1000 times to extract appropriate data for
further processing.)
> >
> > Is there a suggested method? Any ideas?
> >
> > I've tried
> >
> >     subframe = df.query('ID == %d' % k)
> >
> > and flirted with the idea of using Gandiva as per https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
but it looks a bit rough + I had to manually tweak the types of literal constants to support
something other than a float64.
> 

I'm a bit of a beginner in pyarrow, so I have something that kind of works (see https://stackoverflow.com/questions/64581590)
but doesn't seem to be significantly more efficient than pandas filtering; Joris filed https://issues.apache.org/jira/browse/ARROW-10423.

Below is an example (DataFrame df0 is a 1.2M row dataset with some content having 200K rows
and other content having 195 rows, which is a good mix of dense and sparse content) in Jupyter.
I'm not sure how to copy+paste output so it is clearly delineating input and output; I've
manually added ">>>" prompts to help. 

I guess the good news is that Pandas does pretty well on its own. The whole dataset takes
about 24.5MB; a 10ms query represents 2.45GB/s brute force processing so I suppose I shouldn't
complain too much.

>>> t = pa.Table.from_pandas(df0)
>>> t
pyarrow.Table
timestamp: int64
index: int32
value: int64
>>> import pyarrow.compute as pc
​>>> def select_by_index(table, ival):
    value_index = table.column('index')
    index_type = value_index.type.to_pandas_dtype()
    mask = pc.equal(value_index, index_type(ival))
    return table.filter(mask)
>>> %timeit t2 = select_by_index(t, 515)
2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit t2 = select_by_index(t, 3)
8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df0[df0['index'] == 515]
1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df0[df0['index'] == 3]
10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> print("ALL:%d, 3:%d, 515:%d" % (len(df0),
                                np.count_nonzero(df0['index'] == 3),
                                np.count_nonzero(df0['index'] == 515)))
ALL:1225000, 3:200000, 515:195
>>> df0.memory_usage()
Index            128
timestamp    9800000
index        4900000
value        9800000
dtype: int64



Mime
View raw message