Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAAdrtT2aREK-7qA43+_G6eDiV0fP_Cx2q9_Sqv2m4eQ3ROMhyg@mail.gmail.com>
References: <CADno-Ro-gSgtLMSFJzp6LFdCBZhtb_EFKWaicsYzTyD8ZKeC=A@mail.gmail.com>
 <CAGr9p8B-=564Ar5g-hD8hooAmB-mDH634zKTaabhAobmiW824g@mail.gmail.com> <CAAdrtT2aREK-7qA43+_G6eDiV0fP_Cx2q9_Sqv2m4eQ3ROMhyg@mail.gmail.com>
From: Patrick Brunmayr <jay@kpibench.com>
Date: Fri, 24 Feb 2017 15:08:26 +0100
Message-ID: <CADno-RppsGjZ2vxc4fQv=7vUuaCwMY1KL6Dga3aSgX-DxRKDPQ@mail.gmail.com>
Subject: Re: Difference between partition and groupBy
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a114076b080164805494743d4
archived-at: Fri, 24 Feb 2017 14:08:34 -0000

--001a114076b080164805494743d4
Content-Type: text/plain; charset=UTF-8

Thank you for that answer. Helped me a lot

2017-02-23 22:10 GMT+01:00 Fabian Hueske <fhueske@gmail.com>:

> Hi Patrick,
>
> as Robert said, partitionBy() shuffles the data such that all records with
> the same key end up in the same partition. That's all it does.
> groupBy() also prepares the data in each partition to be processed per
> key. For example, if you run a groupReduce after a groupBy(), the data is
> first shuffled (just like partitionBy()) and then in each partition sorted
> to organize it by key. So groupBy() does more than partitionBy() because it
> organizes the data in each partition to be processed by key.
>
> Moreover, groupBy() alone is not a complete operation but just "prepares"
> a following operation. It must be called with a reduce or combine operator.
> In contrast partitionBy() is by itself complete.
> So the difference between partitionBy() and groupBy() is more than just an
> API thing.
>
> Hope that helps,
> Fabian
>
> 2017-02-23 21:51 GMT+01:00 Robert Metzger <rmetzger@apache.org>:
>
>> Hi Patrick,
>>
>> I think (but I'm not 100% sure) its not a difference in what the engine
>> does in the end, its more of an API thing. When you are grouping, you can
>> perform operations such as reducing afterwards.
>> On a partitioned dataset, you can do stuff like processing each partition
>> in parallel, or sort them.
>>
>> The parallelism is independent of the partitioning or grouping. Usually
>> there are more partitions than parallel instances, so each instance will
>> take care of multiple partitions.
>>
>>
>>
>> On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <jay@kpibench.com>
>> wrote:
>>
>>> What is the basic difference between partitioning datasets by key or
>>> grouping them by key ?
>>>
>>> Does it make a difference in terms of paralellism ?
>>>
>>> Thx
>>>
>>
>>
>

--001a114076b080164805494743d4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thank you for that answer. Helped me a lot</div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-02-23 22:10 GMT+01:00 =
Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" ta=
rget=3D"_blank">fhueske@gmail.com</a>&gt;</span>:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr"><div><div><div>Hi Patrick,<br><br></div>as Robert=
 said, partitionBy() shuffles the data such that all records with the same =
key end up in the same partition. That&#39;s all it does.<br></div>groupBy(=
) also prepares the data in each partition to be processed per key. For exa=
mple, if you run a groupReduce after a groupBy(), the data is first shuffle=
d (just like partitionBy()) and then in each partition sorted to organize i=
t by key. So groupBy() does more than partitionBy() because it organizes th=
e data in each partition to be processed by key.<br><br></div><div>Moreover=
, groupBy() alone is not a complete operation but just &quot;prepares&quot;=
 a following operation. It must be called with a reduce or combine operator=
. In contrast partitionBy() is by itself complete.<br></div><div></div><div=
>So the difference between partitionBy() and groupBy() is more than just an=
 API thing.<br><br></div><div>Hope that helps,<br></div><div>Fabian<br></di=
v></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra">=
<br><div class=3D"gmail_quote">2017-02-23 21:51 GMT+01:00 Robert Metzger <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:rmetzger@apache.org" target=3D"_blank=
">rmetzger@apache.org</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"ltr">Hi Patrick,<div><br></div><div>I think (but I&#39;m not 100% =
sure) its not a difference in what the engine does in the end, its more of =
an API thing. When you are grouping, you can perform operations such as red=
ucing afterwards.</div><div>On a partitioned dataset, you can do stuff like=
 processing each partition in parallel, or sort them.</div><div><br></div><=
div>The parallelism is independent of the partitioning or grouping. Usually=
 there are more partitions than parallel instances, so each instance will t=
ake care of multiple partitions.</div><div><br></div><div><br></div></div><=
div class=3D"m_-4269627110185800565HOEnZb"><div class=3D"m_-426962711018580=
0565h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, F=
eb 23, 2017 at 6:16 PM, Patrick Brunmayr <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:jay@kpibench.com" target=3D"_blank">jay@kpibench.com</a>&gt;</span> w=
rote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">What is the basic =
difference between partitioning datasets by key or grouping them by key ?<d=
iv><br></div><div>Does it make a difference in terms of paralellism ?</div>=
<div><br></div><div>Thx</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a114076b080164805494743d4--