From user-return-870-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Jan 1 23:48:02 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 62EE818065C for ; Sat, 2 Jan 2021 00:48:02 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 8E6C74530A for ; Fri, 1 Jan 2021 23:48:01 +0000 (UTC) Received: (qmail 28793 invoked by uid 500); 1 Jan 2021 23:48:01 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 28783 invoked by uid 99); 1 Jan 2021 23:48:01 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jan 2021 23:48:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id 67AFCC0115 for ; Fri, 1 Jan 2021 23:48:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: -0.2 X-Spam-Level: X-Spam-Status: No, score=-0.2 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id 64VNf4HmzSyP for ; Fri, 1 Jan 2021 23:47:59 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::12d; helo=mail-lf1-x12d.google.com; envelope-from=wesmckinn@gmail.com; receiver= Received: from mail-lf1-x12d.google.com (mail-lf1-x12d.google.com [IPv6:2a00:1450:4864:20::12d]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 9E1037FBC4 for ; Fri, 1 Jan 2021 23:47:59 +0000 (UTC) Received: by mail-lf1-x12d.google.com with SMTP id x20so51025455lfe.12 for ; Fri, 01 Jan 2021 15:47:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=baec7NpEK7ktsJTmlwhuRoRyj8kORNkaKsWURPQ6uSo=; b=Ob6tuEGu98eQbdfOdpyuKr3b6uAYjxKmZZG3+j2SMbjY4cLhnZaQdLm9upxpiyA8uz SDBe+3EKjIEgmqhp49MVUU0TMmX4pEHeqiBdiStTN7+WHjou9U6PjBlH+1tDMXtCuXfn sHClk26t9BhoiZGc4ULBh2b+Tt7pK2dBNU4iP8sAxp3Lkt+QKNKClUnZ7U/3ffUQFDWx eEAv9y3KzWRQwoyCCLU0+DrDRBYkp7j6Ylese3JV77RifS9sRt41ZyXwGOq6RzLH3ytw N0+peWNDZsQT+NhdC9yNXTwQXkfzkiM3AjPXZOmMO3fSJ6PRdLtwQOAYvLsDCjUCtmp4 jjRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=baec7NpEK7ktsJTmlwhuRoRyj8kORNkaKsWURPQ6uSo=; b=knDZomEF20EsZClb9ZWC+Dn+f0Xz87a2i2By90RoVMX1o74us+QoRI2lTDn+XWiuoV YHmiE5UhJuQf0p1DmEmgnNynfR3D7WzBWYsyJ4++B7uMUcLMA7rKLsLHjRJGVSMPJ7ir GdVEZU5O7Bsk3SGVlU2ojcI3YpJhjX6uL1ASHGY/YhTUeE1gW/nUzJ9RHPq0uh7+xVsJ ADqC/mtEFk7USwcqrRJxOSy1Yi6wn+Vkwq9C4EygiUDSc0h1NxAZ6bR4+XPVOt96AaLN j0awizvY9IH3XTiOoEeIgXGTC4QgNDj8jVofQcnQY5onthD/hSgjn6kJS1WB2UmavG57 ngyA== X-Gm-Message-State: AOAM532Gm823UtPTnULO+RsuM2SKlhL/9N1Dxs0sfgJa1ne8yf6zXpdG 9CStjo/LEmFgHwDQeKBmfDWMcN7J7P7ya+HRKc9PNW/gwi/YsQ== X-Google-Smtp-Source: ABdhPJxv8ddBEWZS+eoPqwD4H7XTNFiXWsbsE54Gkscw/T6YytQ42fhwR6u4xeQe+oOKI0UX5VH+C5IHj0mt856DUw0= X-Received: by 2002:a2e:b522:: with SMTP id z2mr29547305ljm.500.1609544872545; Fri, 01 Jan 2021 15:47:52 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Wes McKinney Date: Fri, 1 Jan 2021 17:47:16 -0600 Message-ID: Subject: Re: Optimising pandas relational ops with pyarrow To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Note that many of us think it's important to have canonical implementations of important algorithms (aggregate / hash aggregate, joins, sorts, etc.) in the Apache project and available to e.g. pyarrow users, as opposed to having to direct them to a third party project. I've been unable to do this work myself given my other responsibilities, but I will be continuing to direct funding / engineering time from my organization toward these goals. I hope that others from the community can join in to help out to make the work go faster. On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov wrote: > > Hi, thanks for the pointers. We tried cylondata already. We find it hard = to build, some lack of tests for Java, seems like sort and filter not suppo= rted yet... > We are short on time that is why we can=E2=80=99t afford to build own ci/= cd for cylondata... > Project looks very promising and for now it=E2=80=99s a huge technical ri= sk for us. > > > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon wrote= : >> >> Checkout https://cylondata.org/. >> >> We have also worked on this problem in both sequential and distributed e= xecution mode. An early DataFrame API is also available. >> >> [1]. https://cylondata.org/docs/python >> [2]. https://cylondata.org/docs/python_api_docs >> >> >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger = wrote: >>> >>> Ivan, >>> >>> The Clojure dataset abstraction does not copy the data, uses mmap, and = is generally extremely fast for aggregate group-by operations. Just FYI. >>> >>> >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov wrot= e: >>>> >>>> Hi! >>>> I plan to: >>>> - join >>>> - group by >>>> - filter >>>> data using pyarrow (new to it). The idea is to get better performance = and memory utilisation ( apache arrow columnar compression) compared to pan= das. >>>> Seems like pyarrow has no support for joining two Tables / Dataset by = key so I have to fallback to pandas. >>>> I don=E2=80=99t really follow how pyarrow <-> pandas integration works= . Will pandas rely on apache arrow data structure? I=E2=80=99m fine with us= ing only these flat types for columns to avoid "corner cases" >>>> - string >>>> - int >>>> - long >>>> - decimal >>>> >>>> I have a feeling that pandas will copy all data from apache arrow and = double the size (according to the doc). Did I get it right? >>>> What is the right way to join, groupBy and filter several "Tables" / "= Datasets" utilizing pyarrow (underlying apache arrow) power? >>>> >>>> Thank you! >> >> -- >> Vibhatha Abeykoon