From user-return-871-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Sat Jan  2 00:11:35 2021
Return-Path: <user-return-871-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 7371218065C
	for <archive-asf-public@cust-asf.ponee.io>; Sat,  2 Jan 2021 01:11:35 +0100 (CET)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 9D5D94530C
	for <archive-asf-public@cust-asf.ponee.io>; Sat,  2 Jan 2021 00:11:34 +0000 (UTC)
Received: (qmail 38250 invoked by uid 500); 2 Jan 2021 00:11:34 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 38240 invoked by uid 99); 2 Jan 2021 00:11:34 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jan 2021 00:11:34 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 839B41FF3A1
	for <user@arrow.apache.org>; Sat,  2 Jan 2021 00:11:33 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id 0UXVQ7Syuj5K for <user@arrow.apache.org>;
	Sat,  2 Jan 2021 00:11:32 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d32; helo=mail-io1-xd32.google.com; envelope-from=capacytron@gmail.com; receiver=<UNKNOWN> 
Received: from mail-io1-xd32.google.com (mail-io1-xd32.google.com [IPv6:2607:f8b0:4864:20::d32])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 9915A7FBC4
	for <user@arrow.apache.org>; Sat,  2 Jan 2021 00:11:32 +0000 (UTC)
Received: by mail-io1-xd32.google.com with SMTP id i18so20003631ioa.1
        for <user@arrow.apache.org>; Fri, 01 Jan 2021 16:11:32 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=QUnkyGFOmoLqSsk9YyE7RdUaRPHTL426y1Mc0MLzCZU=;
        b=dNmrO3Dg0E6tkEulPgf5cl2e1XfTb+ZSI5DxCVx8ZUaX6heAIZ/MxqrCXFQP9DIkIY
         2i+mYuVIPKUlSqWcCSVlz0l8pHsHG467Q7SodXyj+uPAJKpMHNNj0lSVCQevnpjn8kV5
         c20yLnqYfsHXhmi2p4NbbmBrpK0+KRIH0wG282gcRH/TDxk6ynPw34twWkOmajNAeZs4
         k6UeDq9jnm5nOzhNLfKxFfmDk3YypLXiGXSC6a0zKe5FpNclQLtUC1m4C82pYV9+u95O
         u7WqJGHMd5jXYrPx3yhTZMqGWWOZFaL+VsQO4VFRuemUA3G6BKjwkv4czZp5SjcLtRM9
         w/7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=QUnkyGFOmoLqSsk9YyE7RdUaRPHTL426y1Mc0MLzCZU=;
        b=KCqsfbSmoyQhzrYy6AYSRp3jR9hHiFuyCpIE6ocwDOQAEE903+IoXgY09RzFzJAmEa
         1Sl00SXwFQ0NBhjaldBDaZuG5U2RLWrbPHyFp8T2PDuS5kM/A82OlxCr7h5pfa/oiY0U
         YIeK3o1sw9eWtv48pB220FcHaslm9mZV4OHFk5Ua2u6FP+nKEgo2v1sizF8YdSYrW0H8
         00tK2rH7PvYHSERNpSG108Tm6P0VCQlOBMXx0DMsi1mfWvpgjaOglhEMDPZ6TVFCdBQb
         m3G7SFZNfBE+UhlfYSkACvhKsFbZjGtRXeDmzqigGUUpbeG5VQYTeJLIeuY6zviGnwu4
         LYwA==
X-Gm-Message-State: AOAM531oV1K0tFLc6igmxjX+ms4EheyS30MXtDFQtWRzbTXrEavTpjDx
	cEm1kD24THUSRmie6oZVuz4+kiArLeC9BN5acLqk0Ck2Iug=
X-Google-Smtp-Source: ABdhPJyIruZzfpuM8Ml/rrgW2+MXLlFt0aMtB7bRYoL0xDtOxfic4l3UHT+KyZq7fkvqd0cdQosT/2wavD83Snssv0Q=
X-Received: by 2002:a6b:f112:: with SMTP id e18mr51534285iog.195.1609546291250;
 Fri, 01 Jan 2021 16:11:31 -0800 (PST)
MIME-Version: 1.0
References: <CAEARqsGHH3p+7MV=mzcniEqcyZQa0viZ_nXRo+e0q9UWNar6Dg@mail.gmail.com>
 <CADbpEJth+cvTViRCHHsuH8oQUkriBwEh1HDX9Z6jPm6isEMhjw@mail.gmail.com>
 <CABC-QTntL8WWqFeq7M=23_KZbNsD+attzGbQFviB9EgBV+u9AQ@mail.gmail.com>
 <CAEARqsEz5zq7KqKp7Vw2Xr0j33MvHcGbsJBicj9k8wqxGnzgCw@mail.gmail.com> <CAJPUwMBwPA=yW9DAbkvwU71FaPNDE_TCM8Hyy2JGAJ5R7xytfw@mail.gmail.com>
In-Reply-To: <CAJPUwMBwPA=yW9DAbkvwU71FaPNDE_TCM8Hyy2JGAJ5R7xytfw@mail.gmail.com>
From: Ivan Petrov <capacytron@gmail.com>
Date: Sat, 2 Jan 2021 01:11:20 +0100
Message-ID: <CAEARqsGUB8ogifg3eysVQF-GXdJt3oGHiLsDQTD-6WOy944x3A@mail.gmail.com>
Subject: Re: Optimising pandas relational ops with pyarrow
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="000000000000f0896805b7dfb2c2"

--000000000000f0896805b7dfb2c2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I can help with Java-scala, and have 0 exp in c++. It=E2=80=99s another ris=
k for
us, we have several JVM experts and 0 c++ guys. Efficient Distributed join
is a mess btw ;) we would have to solve oom problems and do it though disk.
Impala passed this painful stage 8 years ago...

On Sat, 2 Jan 2021 at 00:48, Wes McKinney <wesmckinn@gmail.com> wrote:

> Note that many of us think it's important to have canonical
> implementations of important algorithms (aggregate / hash aggregate,
> joins, sorts, etc.) in the Apache project and available to e.g.
> pyarrow users, as opposed to having to direct them to a third party
> project. I've been unable to do this work myself given my other
> responsibilities, but I will be continuing to direct funding /
> engineering time from my organization toward these goals. I hope that
> others from the community can join in to help out to make the work go
> faster.
>
> On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <capacytron@gmail.com> wrote:
> >
> > Hi, thanks for the pointers. We tried cylondata already. We find it har=
d
> to build, some lack of tests for Java, seems like sort and filter not
> supported yet...
> > We are short on time that is why we can=E2=80=99t afford to build own c=
i/cd for
> cylondata...
> > Project looks very promising and for now it=E2=80=99s a huge technical =
risk for
> us.
> >
> >
> > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <vibhatha@gmail.com>
> wrote:
> >>
> >> Checkout https://cylondata.org/.
> >>
> >> We have also worked on this problem in both sequential and distributed
> execution mode. An early DataFrame API is also available.
> >>
> >> [1]. https://cylondata.org/docs/python
> >> [2]. https://cylondata.org/docs/python_api_docs
> >>
> >>
> >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <chris@techascent.com=
>
> wrote:
> >>>
> >>> Ivan,
> >>>
> >>> The Clojure dataset abstraction does not copy the data, uses mmap, an=
d
> is generally extremely fast for aggregate group-by operations. Just FYI.
> >>>
> >>>
> >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <capacytron@gmail.com>
> wrote:
> >>>>
> >>>> Hi!
> >>>> I plan to:
> >>>> -  join
> >>>> - group by
> >>>> - filter
> >>>> data using pyarrow (new to it). The idea is to get better performanc=
e
> and memory utilisation ( apache arrow columnar compression) compared to
> pandas.
> >>>> Seems like pyarrow has no support for joining two Tables / Dataset b=
y
> key so I have to fallback to pandas.
> >>>> I don=E2=80=99t really follow how pyarrow <-> pandas integration wor=
ks. Will
> pandas rely on apache arrow data structure? I=E2=80=99m fine with using o=
nly these
> flat types for columns to avoid "corner cases"
> >>>> - string
> >>>> - int
> >>>> - long
> >>>> - decimal
> >>>>
> >>>> I have a feeling that pandas will copy all data from apache arrow an=
d
> double the size (according to the doc). Did I get it right?
> >>>> What is the right way to join, groupBy and filter several "Tables" /
> "Datasets" utilizing pyarrow (underlying apache arrow) power?
> >>>>
> >>>> Thank you!
> >>
> >> --
> >> Vibhatha Abeykoon
>

--000000000000f0896805b7dfb2c2
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">I can help with Java-scala, and have 0 exp in c++. It=E2=
=80=99s another risk for us, we have several JVM experts and 0 c++ guys. Ef=
ficient Distributed join is a mess btw ;) we would have to solve oom proble=
ms and do it though disk. Impala passed this painful stage 8 years ago...</=
div><div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_att=
r">On Sat, 2 Jan 2021 at 00:48, Wes McKinney &lt;<a href=3D"mailto:wesmckin=
n@gmail.com">wesmckinn@gmail.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">Note that many of us think it&#39;s important to have canonic=
al<br>
implementations of important algorithms (aggregate / hash aggregate,<br>
joins, sorts, etc.) in the Apache project and available to e.g.<br>
pyarrow users, as opposed to having to direct them to a third party<br>
project. I&#39;ve been unable to do this work myself given my other<br>
responsibilities, but I will be continuing to direct funding /<br>
engineering time from my organization toward these goals. I hope that<br>
others from the community can join in to help out to make the work go<br>
faster.<br>
<br>
On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov &lt;<a href=3D"mailto:capacytron=
@gmail.com" target=3D"_blank">capacytron@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi, thanks for the pointers. We tried cylondata already. We find it ha=
rd to build, some lack of tests for Java, seems like sort and filter not su=
pported yet...<br>
&gt; We are short on time that is why we can=E2=80=99t afford to build own =
ci/cd for cylondata...<br>
&gt; Project looks very promising and for now it=E2=80=99s a huge technical=
 risk for us.<br>
&gt;<br>
&gt;<br>
&gt; On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon &lt;<a href=3D"mailto:v=
ibhatha@gmail.com" target=3D"_blank">vibhatha@gmail.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Checkout <a href=3D"https://cylondata.org/" rel=3D"noreferrer" tar=
get=3D"_blank">https://cylondata.org/</a>.<br>
&gt;&gt;<br>
&gt;&gt; We have also worked on this problem in both sequential and distrib=
uted execution mode. An early DataFrame API is also available.<br>
&gt;&gt;<br>
&gt;&gt; [1]. <a href=3D"https://cylondata.org/docs/python" rel=3D"noreferr=
er" target=3D"_blank">https://cylondata.org/docs/python</a><br>
&gt;&gt; [2]. <a href=3D"https://cylondata.org/docs/python_api_docs" rel=3D=
"noreferrer" target=3D"_blank">https://cylondata.org/docs/python_api_docs</=
a><br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger &lt;<a href=3D"ma=
ilto:chris@techascent.com" target=3D"_blank">chris@techascent.com</a>&gt; w=
rote:<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Ivan,<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; The Clojure dataset abstraction does not copy the data, uses m=
map, and is generally extremely fast for aggregate group-by operations. Jus=
t FYI.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov &lt;<a href=3D"mai=
lto:capacytron@gmail.com" target=3D"_blank">capacytron@gmail.com</a>&gt; wr=
ote:<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Hi!<br>
&gt;&gt;&gt;&gt; I plan to:<br>
&gt;&gt;&gt;&gt; -=C2=A0 join<br>
&gt;&gt;&gt;&gt; - group by<br>
&gt;&gt;&gt;&gt; - filter<br>
&gt;&gt;&gt;&gt; data using pyarrow (new to it). The idea is to get better =
performance and memory utilisation ( apache arrow columnar compression) com=
pared to pandas.<br>
&gt;&gt;&gt;&gt; Seems like pyarrow has no support for joining two Tables /=
 Dataset by key so I have to fallback to pandas.<br>
&gt;&gt;&gt;&gt; I don=E2=80=99t really follow how pyarrow &lt;-&gt; pandas=
 integration works. Will pandas rely on apache arrow data structure? I=E2=
=80=99m fine with using only these flat types for columns to avoid &quot;co=
rner cases&quot;<br>
&gt;&gt;&gt;&gt; - string<br>
&gt;&gt;&gt;&gt; - int<br>
&gt;&gt;&gt;&gt; - long<br>
&gt;&gt;&gt;&gt; - decimal<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; I have a feeling that pandas will copy all data from apach=
e arrow and double the size (according to the doc). Did I get it right?<br>
&gt;&gt;&gt;&gt; What is the right way to join, groupBy and filter several =
&quot;Tables&quot; / &quot;Datasets&quot; utilizing pyarrow (underlying apa=
che arrow) power?<br>
&gt;&gt;&gt;&gt;<br>
&gt;&gt;&gt;&gt; Thank you!<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Vibhatha Abeykoon<br>
</blockquote></div></div>

--000000000000f0896805b7dfb2c2--