arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Tabular ID query (subframe selection based on an integer ID)
Date Wed, 11 Nov 2020 18:17:32 GMT
You should be able to use the kernels available in pyarrow.compute to
do this -- there might be a few that are missing, but if you can't
find what you need please open a Jira issue so it goes into the
backlog

On Wed, Nov 11, 2020 at 11:43 AM Jason Sachs <jmsachs@gmail.com> wrote:
>
> I do a lot of the following operation:
>
>     subframe = df[df['ID'] == k]
>
> where df is a Pandas DataFrame with a small number of columns but a moderately large
number of rows (say 200K - 5M). The columns are usually simple... for example's sake let's
call them int64 TIMESTAMP, uint32 ID, int64 VALUE.
>
> I am moving the source data to Parquet format. I don't really care whether I do this
in PyArrow or Pandas, but I need to perform these subframe selections frequently and would
like to speed them up. (The idea being, load the data into memory once, and then expect to
perform subframe selection anywhere from 10 - 1000 times to extract appropriate data for further
processing.)
>
> Is there a suggested method? Any ideas?
>
> I've tried
>
>     subframe = df.query('ID == %d' % k)
>
> and flirted with the idea of using Gandiva as per https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
but it looks a bit rough + I had to manually tweak the types of literal constants to support
something other than a float64.

Mime
View raw message