arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Re: Tabular ID query (subframe selection based on an integer ID)
Date Wed, 11 Nov 2020 19:36:35 GMT
Ugh, let me reformat that since the PonyMail browser interface thinks ">>>" is a triply
quoted message.

<<< t = pa.Table.from_pandas(df0)
<<< t
pyarrow.Table
timestamp: int64
index: int32
value: int64
<<< import pyarrow.compute as pc
<<< def select_by_index(table, ival):
     value_index = table.column('index')
     index_type = value_index.type.to_pandas_dtype()
     mask = pc.equal(value_index, index_type(ival))
     return table.filter(mask)
<<< %timeit t2 = select_by_index(t, 515)
2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<<< %timeit t2 = select_by_index(t, 3)
8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<<< %timeit df0[df0['index'] == 515]
1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
<<< %timeit df0[df0['index'] == 3]
10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
                                 np.count_nonzero(df0['index'] == 3),
                                 np.count_nonzero(df0['index'] == 515)))
ALL:1225000, 3:200000, 515:195
<<< df0.memory_usage()
Index            128
timestamp    9800000
index        4900000
value        9800000
dtype: int64


Mime
View raw message