From user-return-763-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Wed Nov 11 17:43:28 2020
Return-Path: <user-return-763-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 20D2D180642
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 11 Nov 2020 18:43:28 +0100 (CET)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 7B4C067048
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 11 Nov 2020 17:43:27 +0000 (UTC)
Received: (qmail 80421 invoked by uid 500); 11 Nov 2020 17:43:26 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 80411 invoked by uid 99); 11 Nov 2020 17:43:26 -0000
Received: from ui-eu-02.ponee.io (HELO localhost) (116.202.110.96)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2020 17:43:26 +0000
X-Mailer: LuaSocket 3.0-rc1
x-ponymail-agent: PonyMail Composer/0.2
Content-Type: text/plain; charset=utf-8
Date: Wed, 11 Nov 2020 17:43:25 -0000
To: <user@arrow.apache.org>
Subject: Tabular ID query (subframe selection based on an integer ID)
From: Jason Sachs <jmsachs@gmail.com>
References:  
x-ponymail-sender: 0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33
Message-ID: <pony-0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33-a1fc01e8b8c3fa37cbb9c55efb62f495b4e23e63@user.arrow.apache.org>
MIME-Version: 1.0
In-Reply-To: 

I do a lot of the following operation:

    subframe = df[df['ID'] == k]

where df is a Pandas DataFrame with a small number of columns but a moderately large number of rows (say 200K - 5M). The columns are usually simple... for example's sake let's call them int64 TIMESTAMP, uint32 ID, int64 VALUE.

I am moving the source data to Parquet format. I don't really care whether I do this in PyArrow or Pandas, but I need to perform these subframe selections frequently and would like to speed them up. (The idea being, load the data into memory once, and then expect to perform subframe selection anywhere from 10 - 1000 times to extract appropriate data for further processing.)

Is there a suggested method? Any ideas?

I've tried

    subframe = df.query('ID == %d' % k)

and flirted with the idea of using Gandiva as per https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/ but it looks a bit rough + I had to manually tweak the types of literal constants to support something other than a float64.