Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAAdrtT14egoCoaOmtdjb_ydMsm8vF-9sgCd33sR3GSvyM7Acqw@mail.gmail.com>
References: <CAO2XNQ8e9crA48dgx8eiw4J2ZgrHpo7uJYHKV-wrR=Xhdv04vw@mail.gmail.com>
 <CAAdrtT14egoCoaOmtdjb_ydMsm8vF-9sgCd33sR3GSvyM7Acqw@mail.gmail.com>
From: Robert Schmidtke <ro.schmidtke@gmail.com>
Date: Fri, 13 Jan 2017 15:14:04 +0100
Message-ID: <CAO2XNQ_9AVJawOTrPK7PwOPXd_NGtWmPzOgQzmwCCMaVBUgM2g@mail.gmail.com>
Subject: Re: Terminology: Split, Group and Partition
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=94eb2c1a160e5052760545fa7236
archived-at: Fri, 13 Jan 2017 14:14:18 -0000

--94eb2c1a160e5052760545fa7236
Content-Type: text/plain; charset=UTF-8

Hi Fabian,

thanks for the quick and comprehensive reply. I'll have a look at the
ExecutionPlan using your suggestion to check what actually gets computed,
and I'll use the properties as well. If I stumble across something else
I'll let you know.

Many thanks again!
Robert

On Fri, Jan 13, 2017 at 2:40 PM, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi Robert,
>
> let me first describe what splits, groups, and partitions are.
>
> * Partition: This is basically all data that goes through the same task
> instance. If you have an operator with a parallelism of 80, you have 80
> partitions. When you call sortPartition() you'll have 80 sorted streams, if
> you call mapPartition you iterate over all records in one partition.
> * Split: Splits are a concept of InputFormats. An InputFormat can process
> several splits. All splits that are processed by the same data source task
> make up the partition of that task. So a split is a subset of a partition.
> In your case where each task reads exactly one split, the split is
> equivalent to the partition.
> * Group: A group is based on the groupBy attribute and hence data-driven
> and does not depend on the parallelism. A groupReduce requires a
> partitioning such that all records with the same grouping attribute are
> sent to the same operator, i.e., all are part of the same partition.
> Depending on the number of distinct grouping keys (and the hash-function) a
> partition can have zero, one, or more groups.
>
> Now coming to your use case. You have 80 sources running on 5 machines.
> All source on the same machine produce records with the same grouping key
> (hostname). You can actually give a hint to Flink, that the data returned
> by a split is partitioned, grouped, or sorted in a specific way. This works
> as follows:
>
> // String is hostname, Integer is parallel id of the source task
> DataSet<Tuple3<String, Integer, Long>> = env.createInput(YourFormat);
> SplitDataProperties<Tuple3<String, Integer, Long>> splitProps =
> ((DataSource)text).getSplitDataProperties();
> splitProps.splitsGroupedBy(0,1)
> splitProps.splitsPartitionedBy(0,1)
>
> With this info, Flink knows that the data returned by our source is
> partitioned and grouped. Now you can do groupBy(0,1).groupReduce(XXX) to
> run a local groupReduce operation on each of the 80 tasks (hostname and
> parallel index result in 80 keys) and locally reduce the data.
> Next step would be another .groupBy(0).groupReduce() which gives 16 groups
> which are distributed across your tasks.
>
> However, you have to be careful with the SplitDataProperties. If you get
> them wrong, the optimizer makes false assumption and the resulting plan
> might not compute what you are looking for.
> I'd recommend to read the JavaDocs and play a bit with this feature to see
> how it behaves. ExecutionEnvironment.getExecutionPlan() can help to
> figure out what is happening.
>
> Best,
> Fabian
>
>
> 2017-01-13 12:14 GMT+01:00 Robert Schmidtke <ro.schmidtke@gmail.com>:
>
>> Hi all,
>>
>> I'm having some trouble grasping what the meaning of/difference between
>> the following concepts is:
>>
>> - Split
>> - Group
>> - Partition
>>
>> Let me elaborate a bit on the problem I'm trying to solve here. In my
>> tests I'm using a 5-node cluster, on which I'm running Flink 1.1.3 in
>> standalone mode. Each node has 64G of memory and 32 cores. I'm starting the
>> JobManager on one node, and a TaskManager on each node. I'm assigning 16
>> slots to each TaskManager, so the overall parallelism is 80 (= 5 TMs x 16
>> Slots).
>>
>> The data I want to process resides in a local folder on each worker with
>> the same path (say /tmp/input). There can be arbitrarily many input files
>> in each worker's folder. I have written a custom input format that
>> round-robin assigns the files to each of the 16 local input splits (
>> https://github.com/robert-schmidtke/hdfs-statistics-adapter
>> /blob/master/sfs-analysis/src/main/java/de/zib/sfs/analysis/
>> io/SfsInputFormat.java) to obtain a total of 80 input splits that need
>> processing. Each split reads zero or more files, parsing the contents into
>> records that are emitted correctly. This works as expected.
>>
>> Now we're getting to the questions. How do these 80 input splits relate
>> to groups and partitions? My understanding of a partition is a subset of my
>> DataSet<X> that is local to each node. I.e. if I were to repartition the
>> data according to some scheme, a shuffling over workers would occur. After
>> reading all the data, I have 80 partitions, correct?
>>
>> What is less clear to me is the concept of a group, i.e. the result of a
>> groupBy operation. The input files I have are produced on each worker by
>> some other process. I first want to do pre-aggregation (I hope that's the
>> term) on each node before sending data over the network. The records I'm
>> processing contain a 'hostname' attribute, which is set to the worker's
>> hostname that processes the data, because the DataSources are local. That
>> means the records produced by the worker on host1 always contain the
>> attribute hostname=host1. Similar for the other 4 workers.
>>
>> Now what happens if I do a groupBy("hostname")? How do the workers
>> realize that no network transfer is necessary? Is a group a logical
>> abstraction, or a physical one (in my understanding a partition is physical
>> because it's local to exactly one worker).
>>
>> What I'd like to do next is a reduceGroup to merge multiple records into
>> one (some custom, yet straightforward, aggregation) and emit another record
>> for every couple of input records. Am I correct in assuming that the
>> Iterable<X> values passed to the reduce function all have the same hostname
>> value? That is, will the operation have a parallelism of 80, where 5x16
>> operations will have the same hostname value? Because I have 16 splits per
>> host, the 16 reduces on host1 should all receive values with
>> hostname=host1, correct? And after the operation has finished, will the
>> reduced groups (now actual DataSets again) still be local to the workers?
>>
>> This is quite a lot to work on I have to admit. I'm happy for any hints,
>> advice and feedback on this. If there's need for clarification I'd be happy
>> to provide more information.
>>
>> Thanks a lot in advance!
>>
>> Robert
>>
>> --
>> My GPG Key ID: 336E2680
>>
>
>


-- 
My GPG Key ID: 336E2680

--94eb2c1a160e5052760545fa7236
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Fabian,<div><br></div><div>thanks for the quick and com=
prehensive reply. I&#39;ll have a look at the ExecutionPlan using your sugg=
estion to check what actually gets computed, and I&#39;ll use the propertie=
s as well. If I stumble across something else I&#39;ll let you know.</div><=
div><br></div><div>Many thanks again!</div><div>Robert</div></div><div clas=
s=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Jan 13, 2017 at 2:=
40 PM, Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.=
com" target=3D"_blank">fhueske@gmail.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><div>Hi Rob=
ert,<br><br></div>let me first describe what splits, groups, and partitions=
 are.<br><br></div>* Partition: This is basically all data that goes throug=
h the same task instance. If you have an operator with a parallelism of 80,=
 you have 80 partitions. When you call sortPartition() you&#39;ll have 80 s=
orted streams, if you call mapPartition you iterate over all records in one=
 partition.<br></div>* Split: Splits are a concept of InputFormats. An Inpu=
tFormat can process several splits. All splits that are processed by the sa=
me data source task make up the partition of that task. So a split is a sub=
set of a partition. In your case where each task reads exactly one split, t=
he split is equivalent to the partition.<br></div>* Group: A group is based=
 on the groupBy attribute and hence data-driven and does not depend on the =
parallelism. A groupReduce requires a partitioning such that all records wi=
th the same grouping attribute are sent to the same operator, i.e., all are=
 part of the same partition. Depending on the number of distinct grouping k=
eys (and the hash-function) a partition can have zero, one, or more groups.=
<br><br></div>Now coming to your use case. You have 80 sources running on 5=
 machines. All source on the same machine produce records with the same gro=
uping key (hostname). You can actually give a hint to Flink, that the data =
returned by a split is partitioned, grouped, or sorted in a specific way. T=
his works as follows:<br><br></div>// String is hostname, Integer is parall=
el id of the source task<br><div>DataSet&lt;Tuple3&lt;String, Integer, Long=
&gt;&gt; =3D env.createInput(YourFormat);<br>SplitDataProperties&lt;Tuple3&=
lt;<wbr>String, Integer, Long&gt;&gt; splitProps =3D ((DataSource)text).<wb=
r>getSplitDataProperties();<br>splitProps.splitsGroupedBy(0,<wbr>1)<br>spli=
tProps.<wbr>splitsPartitionedBy(0,1)<br><div><br></div><div>With this info,=
 Flink knows that the data returned by our source is partitioned and groupe=
d. Now you can do groupBy(0,1).groupReduce(XXX) to run a local groupReduce =
operation on each of the 80 tasks (hostname and parallel index result in 80=
 keys) and locally reduce the data.<br></div><div>Next step would be anothe=
r .groupBy(0).groupReduce() which gives 16 groups which are distributed acr=
oss your tasks.<br><br></div><div>However, you have to be careful with the =
SplitDataProperties. If you get them wrong, the optimizer makes false assum=
ption and the resulting plan might not compute what you are looking for.<br=
></div><div>I&#39;d recommend to read the JavaDocs and play a bit with this=
 feature to see how it behaves. ExecutionEnvironment.<wbr>getExecutionPlan(=
) can help to figure out what is happening.<br></div><div><br></div><div>Be=
st, <br></div><div>Fabian<br></div><div><br></div></div></div><div class=3D=
"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gma=
il_quote">2017-01-13 12:14 GMT+01:00 Robert Schmidtke <span dir=3D"ltr">&lt=
;<a href=3D"mailto:ro.schmidtke@gmail.com" target=3D"_blank">ro.schmidtke@g=
mail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"=
>Hi all,<div><br></div><div>I&#39;m having some trouble grasping what the m=
eaning of/difference between the following concepts is:</div><div><br></div=
><div>- Split</div><div>- Group</div><div>- Partition<br clear=3D"all"><div=
><br></div><div>Let me elaborate a bit on the problem I&#39;m trying to sol=
ve here. In my tests I&#39;m using a 5-node cluster, on which I&#39;m runni=
ng Flink 1.1.3 in standalone mode. Each node has 64G of memory and 32 cores=
. I&#39;m starting the JobManager on one node, and a TaskManager on each no=
de. I&#39;m assigning 16 slots to each TaskManager, so the overall parallel=
ism is 80 (=3D 5 TMs x 16 Slots).</div><div><br></div><div>The data I want =
to process resides in a local folder on each worker with the same path (say=
 /tmp/input). There can be arbitrarily many input files in each worker&#39;=
s folder. I have written a custom input format that round-robin assigns the=
 files to each of the 16 local input splits (<a href=3D"https://github.com/=
robert-schmidtke/hdfs-statistics-adapter/blob/master/sfs-analysis/src/main/=
java/de/zib/sfs/analysis/io/SfsInputFormat.java" target=3D"_blank">https://=
github.com/robert-sch<wbr>midtke/hdfs-statistics-adapter<wbr>/blob/master/s=
fs-analysis/src/<wbr>main/java/de/zib/sfs/analysis/<wbr>io/SfsInputFormat.j=
ava</a>) to obtain a total of 80 input splits that need processing. Each sp=
lit reads zero or more files, parsing the contents into records that are em=
itted correctly. This works as expected.</div><div><br></div><div>Now we=
9;re getting to the questions. How do these 80 input splits relate to group=
s and partitions? My understanding of a partition is a subset of my DataSet=
&lt;X&gt; that is local to each node. I.e. if I were to repartition the dat=
a according to some scheme, a shuffling over workers would occur. After rea=
ding all the data, I have 80 partitions, correct?</div><div><br></div><div>=
What is less clear to me is the concept of a group, i.e. the result of a gr=
oupBy operation. The input files I have are produced on each worker by some=
 other process. I first want to do pre-aggregation (I hope that&#39;s the t=
erm) on each node before sending data over the network. The records I&#39;m=
 processing contain a &#39;hostname&#39; attribute, which is set to the wor=
ker&#39;s hostname that processes the data, because the DataSources are loc=
al. That means the records produced by the worker on host1 always contain t=
he attribute hostname=3Dhost1. Similar for the other 4 workers.</div><div><=
br></div><div>Now what happens if I do a groupBy(&quot;hostname&quot;)? How=
 do the workers realize that no network transfer is necessary? Is a group a=
 logical abstraction, or a physical one (in my understanding a partition is=
 physical because it&#39;s local to exactly one worker).</div><div><br></di=
v><div>What I&#39;d like to do next is a reduceGroup to merge multiple reco=
rds into one (some custom, yet straightforward, aggregation) and emit anoth=
er record for every couple of input records. Am I correct in assuming that =
the Iterable&lt;X&gt; values passed to the reduce function all have the sam=
e hostname value? That is, will the operation have a parallelism of 80, whe=
re 5x16 operations will have the same hostname value? Because I have 16 spl=
its per host, the 16 reduces on host1 should all receive values with hostna=
me=3Dhost1, correct? And after the operation has finished, will the reduced=
 groups (now actual DataSets again) still be local to the workers?</div><di=
v><br></div><div>This is quite a lot to work on I have to admit. I&#39;m ha=
ppy for any hints, advice and feedback on this. If there&#39;s need for cla=
rification I&#39;d be happy to provide more information.</div><div><br></di=
v><div>Thanks a lot in advance!</div><span class=3D"m_-7242007433703230260H=
OEnZb"><font color=3D"#888888"><div><br></div><div>Robert</div><div><br></d=
iv>-- <br><div class=3D"m_-7242007433703230260m_5469335562205424200gmail_si=
gnature"><div dir=3D"ltr">My GPG Key ID: 336E2680<br></div></div>
</font></span></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div dir=
=3D"ltr">My GPG Key ID: 336E2680<br></div></div>
</div>

--94eb2c1a160e5052760545fa7236--