arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Sachs <jmsa...@gmail.com>
Subject Tabular ID query (subframe selection based on an integer ID)
Date Wed, 11 Nov 2020 17:43:25 GMT
I do a lot of the following operation:

    subframe = df[df['ID'] == k]

where df is a Pandas DataFrame with a small number of columns but a moderately large number
of rows (say 200K - 5M). The columns are usually simple... for example's sake let's call them
int64 TIMESTAMP, uint32 ID, int64 VALUE.

I am moving the source data to Parquet format. I don't really care whether I do this in PyArrow
or Pandas, but I need to perform these subframe selections frequently and would like to speed
them up. (The idea being, load the data into memory once, and then expect to perform subframe
selection anywhere from 10 - 1000 times to extract appropriate data for further processing.)

Is there a suggested method? Any ideas?

I've tried

    subframe = df.query('ID == %d' % k)

and flirted with the idea of using Gandiva as per https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
but it looks a bit rough + I had to manually tweak the types of literal constants to support
something other than a float64.

Mime
View raw message