Mailing-List: contact user-help@flink.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAN0XJzN=gWU8=HSBPTKHvfA7qRbCpOq6yRC33yR+Ouq6j9ouPg@mail.gmail.com>
References: 
 <CAN0XJzMg4ABW_99ctbNxXHJa0+gWmLwveXtR=PvGfKhHKG-1Vw@mail.gmail.com>
	<CANMXwW1q37jn6wot5wyit68Ozoac6_upR9=GWRe_xxqe6_vewA@mail.gmail.com>
	<CAN0XJzM8icgztgx279vs2NApRmqsAZZwSaBN7CVBd8Jy8Ya9-A@mail.gmail.com>
	<CAAdrtT12xPh+qbdLWJPyr0NqE11T=XE8ZJyvA_c-26nVN9XuJg@mail.gmail.com>
	<CAN0XJzN=gWU8=HSBPTKHvfA7qRbCpOq6yRC33yR+Ouq6j9ouPg@mail.gmail.com>
Date: Mon, 10 Nov 2014 12:05:47 +0100
Message-ID: 
 <CAAdrtT1Jh8DTgzP-ad4KWZuPOt7jXDuCgF5F6=bnWp3rD7X_uA@mail.gmail.com>
Subject: Re: PartitionByHash and usage of KeySelector
From: Fabian Hueske <fhueske@apache.org>
To: "user@flink.incubator.apache.org" <user@flink.incubator.apache.org>
Content-Type: multipart/alternative; boundary=001a1132e71e103a1c05077f2549

--001a1132e71e103a1c05077f2549
Content-Type: text/plain; charset=UTF-8

Yes, if you'd split the data set manually (maybe using filter) into
multiple data sets, you could use Cross.
However, Cross is a binary operation, such that you'd need to use it as a
self-cross which would result in symmetric pairs as the join.

I'm not sure if I would do this in a single job, i.e., run all cross
operations concurrently.
It might be better to partition the data up-front and run multiple jobs for
each group.

Best, Fabian

2014-11-10 11:08 GMT+01:00 Stefano Bortoli <s.bortoli@gmail.com>:

> Thanks a lot Fabian. You clarified many points. Currently I am try to run
> the job relying on a global index built with SOLR. It worked on a dataset
> of about 1M record, but it failed with obscure exception on the one of
> 9.2M. If I cannot make it work, I will go back to the grouping approach.
>
> Just a question. If I create a dataset for each group of a dataset, then I
> could use the cross on each of the group. Right? However, I guess it would
> be smarter to have a reduceGroup capable of generating just the pairs that
> would need to be compared.
>
> thanks a lot again. keep on the great work! :-)
>
> saluti,
> Stefano
>
>
> 2014-11-10 10:50 GMT+01:00 Fabian Hueske <fhueske@apache.org>:
>
>> Hi Stefano,
>>
>> I'm not sure if we use the same terminology here. What you call
>> partitioning might be called grouping in Flinks API / documentation.
>>
>> Grouping builds groups of element that share the same key. This is a
>> deterministic operation.
>> Partitioning distributes elements over a set of machines / parallel
>> workers. If this is done using hash partitioning, Flink determines the
>> parallel worker for an element by hashing the element's partition key (
>> mod(hash(key), #workers) ). Consequently, all elements with the same
>> partition key will be shipped to the same worker, BUT also all other
>> elements for which mod(hash(key), #workers) is the same will be shipped to
>> the same worker. If you partition map over these partitions all of these
>> elements will be mixed. If the number of workers (or the hash function)
>> changes, partitions will look different. When grouping all elements of the
>> group will have the same key (and all elements with that key will be in the
>> group).
>>
>> Flink's cross operator builds a dataset wide cross product. It does not
>> respect groups (or partitions). If you want to build a cross product within
>> a group, you can do that with a groupReduce which requires to hold all
>> elements of the group in memory or manually spill them to disk in your UDF.
>> Alternatively, you can use a self join (join a data set with itself) which
>> will give you all pairs of the CP in individual function calls. However,
>> Flink is currently not treating self joins special, such that the
>> performance could be optimized. You'll also get symmetric pairs (a-b, b-a,
>> a-a, b-b, for two element a, b with the same join key).
>>
>> If it is possible to combine the marco-parameter keys and the
>> minor-blocking keys into a single key, you could specify a key-selector
>> function x() and either do
>> - dataSet.groupBy(x).reduceGroup( *read full group into memory, and apply
>> expensive function to each pair of elements* ); or
>> - dataSet.join(dataSet).where(x).equalTo(x).join( *check of symmetric
>> pair and apply expensive compare function* ).
>>
>> BTW. there was a similar use case a few days back on the mailing list.
>> Might be worth reading that thread [1].
>> Since there this is the second time that this issue came up, we might
>> consider to add better support for group-wise cross operations.
>>
>> Cheers, Fabian
>>
>> [1]
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/load-balancing-groups-td2287.html
>>
>>
>>
>

--001a1132e71e103a1c05077f2549
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Yes, if you&#39;d=C2=A0split the data set manually (m=
aybe using filter) into multiple data sets, you could use Cross. </div><div=
>However, Cross is a binary operation, such that you&#39;d need to use it a=
s a self-cross which would result in symmetric pairs as the join.</div><div=
><br></div><div>I&#39;m not sure if I would do this in a single job, i.e., =
run all cross operations concurrently.=C2=A0</div><div>It might be better t=
o partition the data up-front and run multiple jobs for each group.</div><d=
iv><br></div><div>Best, Fabian</div></div><div class=3D"gmail_extra"><br><d=
iv class=3D"gmail_quote">2014-11-10 11:08 GMT+01:00 Stefano Bortoli <span d=
ir=3D"ltr">&lt;<a href=3D"mailto:s.bortoli@gmail.com" target=3D"_blank">s.b=
ortoli@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><div><div><div>Thanks a lot Fabian. You clarified many points. Cur=
rently I am try to run the job relying on a global index built with SOLR. I=
t worked on a dataset of about 1M record, but it failed with obscure except=
ion on the one of 9.2M. If I cannot make it work, I will go back to the gro=
uping approach.<br><br></div>Just a question. If I create a dataset for eac=
h group of a dataset, then I could use the cross on each of the group. Righ=
t? However, I guess it would be smarter to have a reduceGroup capable of ge=
nerating just the pairs that would need to be compared.<br><br></div>thanks=
 a lot again. keep on the great work! :-)<br><br></div>saluti,<br>Stefano<b=
r><div><div><div><br></div></div></div></div><div class=3D"HOEnZb"><div cla=
ss=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2014-11=
-10 10:50 GMT+01:00 Fabian Hueske <span dir=3D"ltr">&lt;<a href=3D"mailto:f=
hueske@apache.org" target=3D"_blank">fhueske@apache.org</a>&gt;</span>:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;padding=
-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-l=
eft-style:solid"><div dir=3D"ltr"><div>Hi Stefano,</div><div><br></div><div=
>I&#39;m not sure if we use the same terminology here. What you call partit=
ioning might be called grouping in Flinks API / documentation.</div><div><b=
r></div><div>Grouping builds groups of element that share the same key. Thi=
s is a deterministic operation. </div><div>Partitioning distributes element=
s over a set of machines / parallel workers. If this is done using hash par=
titioning, Flink determines the parallel worker for an element=C2=A0by hash=
ing the element&#39;s partition key ( mod(hash(key), #workers) ). Consequen=
tly, all elements with the same partition key will be shipped to the same w=
orker, BUT also all other elements for which mod(hash(key), #workers) is th=
e same will be shipped to the same worker. If you partition map over these =
partitions all of these elements will be mixed. If the number of workers (o=
r the hash function) changes,=C2=A0partitions will look different.=C2=A0Whe=
n grouping all elements of the group will have the same key (and all elemen=
ts with that key will be in the group).</div><div><br></div><div>Flink&#39;=
s cross operator builds a dataset wide cross product. It does not respect g=
roups (or partitions). If you want to build a cross product within a group,=
 you can do that with a groupReduce which requires to hold all elements of =
the group in memory or manually spill them to disk in your UDF. Alternative=
ly, you can use a self join (join a data set with itself) which will give y=
ou all pairs of the CP in individual function calls. However, Flink is curr=
ently not treating self joins special, such that the performance could be o=
ptimized. You&#39;ll also get symmetric pairs (a-b, b-a, a-a, b-b, for two =
element a, b with the same join key).</div><div><br></div><div>If it is pos=
sible to combine the marco-parameter keys and the minor-blocking keys into =
a single key, you could specify a key-selector function x() and either do</=
div><div>- dataSet.groupBy(x).reduceGroup( *read full group into memory, an=
d apply expensive function to each pair of elements* ); or</div><div>- data=
Set.join(dataSet).where(x).equalTo(x).join( *check of symmetric pair and ap=
ply expensive compare function* ).</div><div><br></div><div>BTW. there was =
a similar use case a few days back on the mailing list. Might be worth read=
ing that thread [1].</div><div>Since there this is the second time that thi=
s issue came up, we might consider to add better support for group-wise cro=
ss operations.</div><div><br></div><div>Cheers, Fabian</div><div><br></div>=
<div>[1] <a href=3D"http://apache-flink-incubator-mailing-list-archive.1008=
284.n3.nabble.com/load-balancing-groups-td2287.html" target=3D"_blank">http=
://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/load-b=
alancing-groups-td2287.html</a></div><div><br></div><div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a1132e71e103a1c05077f2549--