From user-return-763-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Nov 11 17:43:28 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 20D2D180642 for ; Wed, 11 Nov 2020 18:43:28 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 7B4C067048 for ; Wed, 11 Nov 2020 17:43:27 +0000 (UTC) Received: (qmail 80421 invoked by uid 500); 11 Nov 2020 17:43:26 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 80411 invoked by uid 99); 11 Nov 2020 17:43:26 -0000 Received: from ui-eu-02.ponee.io (HELO localhost) (116.202.110.96) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Nov 2020 17:43:26 +0000 X-Mailer: LuaSocket 3.0-rc1 x-ponymail-agent: PonyMail Composer/0.2 Content-Type: text/plain; charset=utf-8 Date: Wed, 11 Nov 2020 17:43:25 -0000 To: Subject: Tabular ID query (subframe selection based on an integer ID) From: Jason Sachs References: x-ponymail-sender: 0eb0ec1c07a09e2025e0f1f03a820e7b8e719a33 Message-ID: MIME-Version: 1.0 In-Reply-To: I do a lot of the following operation: subframe = df[df['ID'] == k] where df is a Pandas DataFrame with a small number of columns but a moderately large number of rows (say 200K - 5M). The columns are usually simple... for example's sake let's call them int64 TIMESTAMP, uint32 ID, int64 VALUE. I am moving the source data to Parquet format. I don't really care whether I do this in PyArrow or Pandas, but I need to perform these subframe selections frequently and would like to speed them up. (The idea being, load the data into memory once, and then expect to perform subframe selection anywhere from 10 - 1000 times to extract appropriate data for further processing.) Is there a suggested method? Any ideas? I've tried subframe = df.query('ID == %d' % k) and flirted with the idea of using Gandiva as per https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/ but it looks a bit rough + I had to manually tweak the types of literal constants to support something other than a float64.