From user-return-869-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Jan 1 23:36:45 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id F390218065C for ; Sat, 2 Jan 2021 00:36:44 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 5989965079 for ; Fri, 1 Jan 2021 23:36:44 +0000 (UTC) Received: (qmail 25094 invoked by uid 500); 1 Jan 2021 23:36:43 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 25083 invoked by uid 99); 1 Jan 2021 23:36:43 -0000 Received: from spamproc1-he-fi.apache.org (HELO spamproc1-he-fi.apache.org) (95.217.134.168) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jan 2021 23:36:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-fi.apache.org (ASF Mail Server at spamproc1-he-fi.apache.org) with ESMTP id A7D9BBFD6E for ; Fri, 1 Jan 2021 23:36:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-fi.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-fi.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-fi.apache.org [95.217.134.168]) (amavisd-new, port 10024) with ESMTP id Ni-19MEBUic8 for ; Fri, 1 Jan 2021 23:36:42 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::129; helo=mail-il1-x129.google.com; envelope-from=capacytron@gmail.com; receiver= Received: from mail-il1-x129.google.com (mail-il1-x129.google.com [IPv6:2607:f8b0:4864:20::129]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id B38627FBC4 for ; Fri, 1 Jan 2021 23:36:41 +0000 (UTC) Received: by mail-il1-x129.google.com with SMTP id q1so20221795ilt.6 for ; Fri, 01 Jan 2021 15:36:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=CEyg5knolyhPGrPC8ZGAslZL5RO9a6oofC8dYC2ZlYU=; b=dleIz7zZiN1oSlbUOQ3BoylHDt8obdFrrzikcST9YrVoDeDRGMj6V0dgiNpe4ojH3V CciPYvuAor9FuO/8bQdp0GmnBBUNP2LhL+QWSh+NuWgkPQnZsHqVwppA6htd71qRMZlg dN+Kf+jLjnpA4gttvr4JiJhXxr5nE6GIoGW+g29QIjC7op3t08QWj/0GmGXf1VRBcoe0 yZp8zMP1uMIEtASXdZUujJKchO1hVejkJtXS5wbaUpMNWtguXU270ijdJS1BIdU2Gp6Y 9vHCPXsPAMYbM2C8W65FKk4ScEBdSCa+AUM0GjUdAHaaBh/OvuVvhPpc2DztywMbY2rm 16lQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=CEyg5knolyhPGrPC8ZGAslZL5RO9a6oofC8dYC2ZlYU=; b=NsGq8hkBM/gS8KkPEmDTqLuFNWntn9D00kb5bcU5D48zz2tfOixj+/z6UbxFGzGGHf TeZghip5YNjyVFpHEY6dNxfvXtQDYS2PkIdBjU8zdPdmvVepaQVyzqx0DkXKxTcOmOE0 yvk9nl7RGPWhbjsJogCVfJHNVmyW7hYInKEyFZAA9RQyURO4eGmnw4alB6o7Y2OjcOjm dMxys9WCfI5gMIPIMD0jZU/Zkal4gWO0P89jxd236sI/NVG2CWZHN9EGWuniKd+Ej03i M2m1X1/dbdyZt43DMSIE2wKgyidQVyJEI3mcFXdQkS7sBeCg2vWhx+czSdLZV9lFzVg7 BTEQ== X-Gm-Message-State: AOAM532VkKcWJiAgSS350hrF4iQhH6O0aZhzmFeFvO/3HBj3bL2U7Vh6 rWFfdyEWkQkG6/e23N9YiSMm4KOodIVzNZeOqjMdorQgoVs= X-Google-Smtp-Source: ABdhPJzZJXMYBk8dmNxTldkFel1HebEebFl4fEjn8BzqNrW/4A9wjNcy2LYJ0Hu/sdhqP2ngHxi+qtwXn3l+PKw/TlU= X-Received: by 2002:a92:cb44:: with SMTP id f4mr60974802ilq.131.1609544200108; Fri, 01 Jan 2021 15:36:40 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Ivan Petrov Date: Sat, 2 Jan 2021 00:36:29 +0100 Message-ID: Subject: Re: Optimising pandas relational ops with pyarrow To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000004c3f2505b7df360c" --0000000000004c3f2505b7df360c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, thanks for the pointers. We tried cylondata already. We find it hard to build, some lack of tests for Java, seems like sort and filter not supported yet... We are short on time that is why we can=E2=80=99t afford to build own ci/cd= for cylondata... Project looks very promising and for now it=E2=80=99s a huge technical risk= for us. On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon wrote: > Checkout https://cylondata.org/. > > We have also worked on this problem in both sequential and distributed > execution mode. An early DataFrame API is also available. > > [1]. https://cylondata.org/docs/python > [2]. https://cylondata.org/docs/python_api_docs > > > On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger > wrote: > >> Ivan, >> >> The Clojure dataset abstraction does not copy the data, uses mmap, and i= s >> generally extremely fast for aggregate group-by operations >> . Just >> FYI. >> >> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov wrote= : >> >>> Hi! >>> I plan to: >>> - join >>> - group by >>> - filter >>> data using pyarrow (new to it). The idea is to get better performance >>> and memory utilisation ( apache arrow columnar compression) compared to >>> pandas. >>> Seems like pyarrow has no support for joining two Tables / Dataset by >>> key so I have to fallback to pandas. >>> I don=E2=80=99t really follow how pyarrow <-> pandas integration works.= Will >>> pandas rely on apache arrow data structure? I=E2=80=99m fine with using= only these >>> flat types for columns to avoid "corner cases" >>> - string >>> - int >>> - long >>> - decimal >>> >>> I have a feeling that pandas will copy all data from apache arrow and >>> double the size (according to the doc). Did I get it right? >>> What is the right way to join, groupBy and filter several "Tables" / >>> "Datasets" utilizing pyarrow (underlying apache arrow) power? >>> >>> Thank you! >>> >> -- > Vibhatha Abeykoon > --0000000000004c3f2505b7df360c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi, thanks for the pointers. We tried cylondata already. = We find it hard to build, some lack of tests for Java, seems like sort and = filter not supported yet...
We are short on time tha= t is why we can=E2=80=99t afford to build own ci/cd for cylondata...
<= div dir=3D"auto">Project looks very promising and for now it=E2=80=99s a hu= ge technical risk for us.


On Sat, 2 Jan 202= 1 at 00:25, Vibhatha Abeykoon <vib= hatha@gmail.com> wrote:

We have also worked on this problem in both sequential = and distributed execution mode. An early DataFrame API is also available.= =C2=A0


<= div>

On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <chris@techascent.com> wrote:<= br>

Ivan,=C2=A0

The Clojure dataset abstraction does not copy the data, uses mmap, and is = generally extremely fast for aggregate group-by operations= . Just FYI.


<= div dir=3D"ltr" class=3D"gmail_attr">On Fri, Jan 1, 2021 at 10:24 AM Ivan P= etrov <capacyt= ron@gmail.com> wrote:
Hi!=C2=A0
I plan to:
-=C2=A0 join
- group by
- filter
d= ata using pyarrow (new to it). The idea is to get better performance and me= mory utilisation ( apache arrow columnar compression) compared to pandas.Seems like pyarrow has no support for joining two Tables / Dataset by key= so I have to fallback to pandas.
I don=E2=80=99t really follow how pyar= row <-> pandas integration works. Will pandas rely on apache arrow da= ta structure? I=E2=80=99m fine with using only these flat types for columns= to avoid "corner cases"
- string
- int
- long- decimal

I have a feeling that pandas will copy all data fr= om apache arrow and double the size (according to the doc). Did I get it=C2= =A0right?
What is the right way to join, groupBy=C2=A0and fil= ter several "Tables" / "Datasets" utilizing pyarrow (un= derlying apache arrow) power?

Thank you!
--
Vibhatha Abeykoon
--0000000000004c3f2505b7df360c--