From user-return-871-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sat Jan 2 00:11:35 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 7371218065C for ; Sat, 2 Jan 2021 01:11:35 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 9D5D94530C for ; Sat, 2 Jan 2021 00:11:34 +0000 (UTC) Received: (qmail 38250 invoked by uid 500); 2 Jan 2021 00:11:34 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 38240 invoked by uid 99); 2 Jan 2021 00:11:34 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jan 2021 00:11:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 839B41FF3A1 for ; Sat, 2 Jan 2021 00:11:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id 0UXVQ7Syuj5K for ; Sat, 2 Jan 2021 00:11:32 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d32; helo=mail-io1-xd32.google.com; envelope-from=capacytron@gmail.com; receiver= Received: from mail-io1-xd32.google.com (mail-io1-xd32.google.com [IPv6:2607:f8b0:4864:20::d32]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 9915A7FBC4 for ; Sat, 2 Jan 2021 00:11:32 +0000 (UTC) Received: by mail-io1-xd32.google.com with SMTP id i18so20003631ioa.1 for ; Fri, 01 Jan 2021 16:11:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=QUnkyGFOmoLqSsk9YyE7RdUaRPHTL426y1Mc0MLzCZU=; b=dNmrO3Dg0E6tkEulPgf5cl2e1XfTb+ZSI5DxCVx8ZUaX6heAIZ/MxqrCXFQP9DIkIY 2i+mYuVIPKUlSqWcCSVlz0l8pHsHG467Q7SodXyj+uPAJKpMHNNj0lSVCQevnpjn8kV5 c20yLnqYfsHXhmi2p4NbbmBrpK0+KRIH0wG282gcRH/TDxk6ynPw34twWkOmajNAeZs4 k6UeDq9jnm5nOzhNLfKxFfmDk3YypLXiGXSC6a0zKe5FpNclQLtUC1m4C82pYV9+u95O u7WqJGHMd5jXYrPx3yhTZMqGWWOZFaL+VsQO4VFRuemUA3G6BKjwkv4czZp5SjcLtRM9 w/7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=QUnkyGFOmoLqSsk9YyE7RdUaRPHTL426y1Mc0MLzCZU=; b=KCqsfbSmoyQhzrYy6AYSRp3jR9hHiFuyCpIE6ocwDOQAEE903+IoXgY09RzFzJAmEa 1Sl00SXwFQ0NBhjaldBDaZuG5U2RLWrbPHyFp8T2PDuS5kM/A82OlxCr7h5pfa/oiY0U YIeK3o1sw9eWtv48pB220FcHaslm9mZV4OHFk5Ua2u6FP+nKEgo2v1sizF8YdSYrW0H8 00tK2rH7PvYHSERNpSG108Tm6P0VCQlOBMXx0DMsi1mfWvpgjaOglhEMDPZ6TVFCdBQb m3G7SFZNfBE+UhlfYSkACvhKsFbZjGtRXeDmzqigGUUpbeG5VQYTeJLIeuY6zviGnwu4 LYwA== X-Gm-Message-State: AOAM531oV1K0tFLc6igmxjX+ms4EheyS30MXtDFQtWRzbTXrEavTpjDx cEm1kD24THUSRmie6oZVuz4+kiArLeC9BN5acLqk0Ck2Iug= X-Google-Smtp-Source: ABdhPJyIruZzfpuM8Ml/rrgW2+MXLlFt0aMtB7bRYoL0xDtOxfic4l3UHT+KyZq7fkvqd0cdQosT/2wavD83Snssv0Q= X-Received: by 2002:a6b:f112:: with SMTP id e18mr51534285iog.195.1609546291250; Fri, 01 Jan 2021 16:11:31 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Ivan Petrov Date: Sat, 2 Jan 2021 01:11:20 +0100 Message-ID: Subject: Re: Optimising pandas relational ops with pyarrow To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000f0896805b7dfb2c2" --000000000000f0896805b7dfb2c2 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I can help with Java-scala, and have 0 exp in c++. It=E2=80=99s another ris= k for us, we have several JVM experts and 0 c++ guys. Efficient Distributed join is a mess btw ;) we would have to solve oom problems and do it though disk. Impala passed this painful stage 8 years ago... On Sat, 2 Jan 2021 at 00:48, Wes McKinney wrote: > Note that many of us think it's important to have canonical > implementations of important algorithms (aggregate / hash aggregate, > joins, sorts, etc.) in the Apache project and available to e.g. > pyarrow users, as opposed to having to direct them to a third party > project. I've been unable to do this work myself given my other > responsibilities, but I will be continuing to direct funding / > engineering time from my organization toward these goals. I hope that > others from the community can join in to help out to make the work go > faster. > > On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov wrote: > > > > Hi, thanks for the pointers. We tried cylondata already. We find it har= d > to build, some lack of tests for Java, seems like sort and filter not > supported yet... > > We are short on time that is why we can=E2=80=99t afford to build own c= i/cd for > cylondata... > > Project looks very promising and for now it=E2=80=99s a huge technical = risk for > us. > > > > > > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon > wrote: > >> > >> Checkout https://cylondata.org/. > >> > >> We have also worked on this problem in both sequential and distributed > execution mode. An early DataFrame API is also available. > >> > >> [1]. https://cylondata.org/docs/python > >> [2]. https://cylondata.org/docs/python_api_docs > >> > >> > >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger > wrote: > >>> > >>> Ivan, > >>> > >>> The Clojure dataset abstraction does not copy the data, uses mmap, an= d > is generally extremely fast for aggregate group-by operations. Just FYI. > >>> > >>> > >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov > wrote: > >>>> > >>>> Hi! > >>>> I plan to: > >>>> - join > >>>> - group by > >>>> - filter > >>>> data using pyarrow (new to it). The idea is to get better performanc= e > and memory utilisation ( apache arrow columnar compression) compared to > pandas. > >>>> Seems like pyarrow has no support for joining two Tables / Dataset b= y > key so I have to fallback to pandas. > >>>> I don=E2=80=99t really follow how pyarrow <-> pandas integration wor= ks. Will > pandas rely on apache arrow data structure? I=E2=80=99m fine with using o= nly these > flat types for columns to avoid "corner cases" > >>>> - string > >>>> - int > >>>> - long > >>>> - decimal > >>>> > >>>> I have a feeling that pandas will copy all data from apache arrow an= d > double the size (according to the doc). Did I get it right? > >>>> What is the right way to join, groupBy and filter several "Tables" / > "Datasets" utilizing pyarrow (underlying apache arrow) power? > >>>> > >>>> Thank you! > >> > >> -- > >> Vibhatha Abeykoon > --000000000000f0896805b7dfb2c2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I can help with Java-scala, and have 0 exp in c++. It=E2= =80=99s another risk for us, we have several JVM experts and 0 c++ guys. Ef= ficient Distributed join is a mess btw ;) we would have to solve oom proble= ms and do it though disk. Impala passed this painful stage 8 years ago...

On Sat, 2 Jan 2021 at 00:48, Wes McKinney <wesmckinn@gmail.com> wrote:
Note that many of us think it's important to have canonic= al
implementations of important algorithms (aggregate / hash aggregate,
joins, sorts, etc.) in the Apache project and available to e.g.
pyarrow users, as opposed to having to direct them to a third party
project. I've been unable to do this work myself given my other
responsibilities, but I will be continuing to direct funding /
engineering time from my organization toward these goals. I hope that
others from the community can join in to help out to make the work go
faster.

On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <capacytron@gmail.com> wrote:
>
> Hi, thanks for the pointers. We tried cylondata already. We find it ha= rd to build, some lack of tests for Java, seems like sort and filter not su= pported yet...
> We are short on time that is why we can=E2=80=99t afford to build own = ci/cd for cylondata...
> Project looks very promising and for now it=E2=80=99s a huge technical= risk for us.
>
>
> On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <vibhatha@gmail.com> wrote:
>>
>> Checkout https://cylondata.org/.
>>
>> We have also worked on this problem in both sequential and distrib= uted execution mode. An early DataFrame API is also available.
>>
>> [1]. https://cylondata.org/docs/python
>> [2]. https://cylondata.org/docs/python_api_docs
>>
>>
>> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <
chris@techascent.com> w= rote:
>>>
>>> Ivan,
>>>
>>> The Clojure dataset abstraction does not copy the data, uses m= map, and is generally extremely fast for aggregate group-by operations. Jus= t FYI.
>>>
>>>
>>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <capacytron@gmail.com> wr= ote:
>>>>
>>>> Hi!
>>>> I plan to:
>>>> -=C2=A0 join
>>>> - group by
>>>> - filter
>>>> data using pyarrow (new to it). The idea is to get better = performance and memory utilisation ( apache arrow columnar compression) com= pared to pandas.
>>>> Seems like pyarrow has no support for joining two Tables /= Dataset by key so I have to fallback to pandas.
>>>> I don=E2=80=99t really follow how pyarrow <-> pandas= integration works. Will pandas rely on apache arrow data structure? I=E2= =80=99m fine with using only these flat types for columns to avoid "co= rner cases"
>>>> - string
>>>> - int
>>>> - long
>>>> - decimal
>>>>
>>>> I have a feeling that pandas will copy all data from apach= e arrow and double the size (according to the doc). Did I get it right?
>>>> What is the right way to join, groupBy and filter several = "Tables" / "Datasets" utilizing pyarrow (underlying apa= che arrow) power?
>>>>
>>>> Thank you!
>>
>> --
>> Vibhatha Abeykoon
--000000000000f0896805b7dfb2c2--