Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Received-SPF: pass (athena.apache.org: domain of ewenstephan@gmail.com
 designates 209.85.223.171 as permitted sender)
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <CADSBNAiiHGo=_Rd1sMh0s2TSnE2N2u0tGhKnQVpFc+gHFnUEFw@mail.gmail.com>
References: 
 <CADSBNAj0=4EP_SLjp4=aSYBkzistraGhL5gsZrhhpt=q4UGdrQ@mail.gmail.com>
	<CANC1h_uuKX+bgyhe2w-Exo3y68CSdHEnbbq6=v7OL+xKTDDFFg@mail.gmail.com>
	<CADSBNAjrOUA3xVvi=9UT+2dnQwGC0h5M3-cwztQk_EvFJX=q0w@mail.gmail.com>
	<CANC1h_unxN2Uh4ff-j0oEa_ynbxfTBo_p8ee-6soTyufgr=Ufg@mail.gmail.com>
	<CADSBNAixhAjrS1Qcx=8urECLXt5-nCowFqiz5GS-dvS9XGOFPQ@mail.gmail.com>
	<CADSBNAiiHGo=_Rd1sMh0s2TSnE2N2u0tGhKnQVpFc+gHFnUEFw@mail.gmail.com>
Date: Mon, 16 Mar 2015 08:58:53 +0100
Message-ID: 
 <CANC1h_vMUVxhxrUb9dC6B23SzrWT=Mk0ZbNMD8qQpxMX4-+Xbw@mail.gmail.com>
Subject: Re: Sort tuple dataset
From: Stephan Ewen <sewen@apache.org>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a113f29bcb172f005116338a7

--001a113f29bcb172f005116338a7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I think that depends on your use case. If you want to work on the entire
dataset as a whole anyways, you can assign a Dummy-Key (like 0) to all
elements, group by that key and sort the group on the actual value.

What exactly is you use case? Does the above solution work there?
Am 15.03.2015 17:39 schrieb "Kristoffer Sj=C3=B6gren" <stoffe@gmail.com>:

> After building flink 0.9-SNAPSHOT from source and using
> DataSet.sortPartition is indeed working as expected.
>
> This is fine but raises the question on how to go about sorting in 0.8.1?
>
>
>
>
>
> On Sun, Mar 15, 2015 at 5:05 PM, Kristoffer Sj=C3=B6gren <stoffe@gmail.co=
m>
> wrote:
>
>> That's the thing, there is no DataSet.sortPartition method in 0.8.1.
>> Looking through the git history show that sortPartition was added 20th o=
f
>> February so I think that's 0.9-SNAPSHOT?
>>
>>
>> On Sun, Mar 15, 2015 at 4:51 PM, Stephan Ewen <sewen@apache.org> wrote:
>>
>>> Hi!
>>>
>>> I think sort partition is the right think, if you have only one
>>> partition (which makes sense, if you want a total order). It is not a
>>> parallel operation any mode, so use it only after the data size has bee=
n
>>> reduced (filters / aggregations).
>>>
>>> What about "data.sortPartition().setParallelism(1)".
>>>
>>> Does that work for you?
>>>
>>> Greetings,
>>> Stephan
>>>
>>>
>>> On Sun, Mar 15, 2015 at 4:47 PM, Kristoffer Sj=C3=B6gren <stoffe@gmail.=
com>
>>> wrote:
>>>
>>>> Thanks for your answer. I guess i'm a bit infected by writing to much
>>>> Crunch code and I also suspected that getDataSet() was the wrong thing=
 to
>>>> do :-)
>>>>
>>>> However I was expecting DataSet.sortPartition to do the sorting, but
>>>> this method is missing in 0.8.1?
>>>>
>>>> Do you have a minimal example? I was looking through the tests but mos=
t
>>>> of them use sortPartition.
>>>>
>>>> Cheers,
>>>> -Kristoffer
>>>>
>>>>
>>>>
>>>> On Sun, Mar 15, 2015 at 4:22 PM, Stephan Ewen <sewen@apache.org> wrote=
:
>>>>
>>>>> Hi Kristoffer!
>>>>>
>>>>> There are a few issues with that code:
>>>>>
>>>>> 1) Grouping and then calling "sort group" sorts within the group. In
>>>>> your case, you group after the entire element and each group has on v=
alue -
>>>>> the element. Sorting inside the group does not make any difference. T=
here
>>>>> is no order across groups.
>>>>>
>>>>> 2) This code never groups and sorts. The calls to "groupBy(0).sortGro=
up(0,
>>>>> Order.DESCENDING)." do not group and sort already, they set up a grou=
ping
>>>>> to be used with a reduce or aggregate function. The "getDataSet()" ca=
ll
>>>>> gets you the original data set, which is the original input.
>>>>>
>>>>> To see an illustration of this, get the program plan
>>>>> (env.getExecutionPlan()). You can render it using the html file
>>>>> "tools/planVisualizer.html".
>>>>>
>>>>> Greetings,
>>>>> Stephan
>>>>>
>>>>>
>>>>> On Sun, Mar 15, 2015 at 3:29 PM, Kristoffer Sj=C3=B6gren <stoffe@gmai=
l.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> This is silly but I can't understand why the following code doesn't
>>>>>> sort the collection of integers. It seems to be reasonable thing to =
do from
>>>>>> an API perspective?
>>>>>>
>>>>>> Cheers,
>>>>>> -Kristoffer
>>>>>>
>>>>>> final ExecutionEnvironment env =3D
>>>>>> ExecutionEnvironment.getExecutionEnvironment();
>>>>>>     env.fromCollection(Lists.newArrayList(2,1,5,3,4,5)).map(new
>>>>>> MapFunction<Integer, Tuple1<Integer>>() {
>>>>>>       @Override
>>>>>>       public Tuple1<Integer> map(Integer value) throws Exception {
>>>>>>         return new Tuple1(value);
>>>>>>       }
>>>>>>     }).groupBy(0).sortGroup(0, Order.DESCENDING).getDataSet().print(=
);
>>>>>>     env.execute();
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--001a113f29bcb172f005116338a7
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">I think that depends on your use case. If you want to work o=
n the entire dataset as a whole anyways, you can assign a Dummy-Key (like 0=
) to all elements, group by that key and sort the group on the actual value=
.</p>
<p dir=3D"ltr">What exactly is you use case? Does the above solution work t=
here?</p>
<div class=3D"gmail_quote">Am 15.03.2015 17:39 schrieb &quot;Kristoffer Sj=
=C3=B6gren&quot; &lt;<a href=3D"mailto:stoffe@gmail.com">stoffe@gmail.com</=
a>&gt;:<br type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"=
ltr">After building flink=C2=A00.9-SNAPSHOT=C2=A0from source and using Data=
Set.sortPartition is indeed working as expected.<div><br></div><div>This is=
 fine but raises the question on how to go about sorting in 0.8.1?<br><div>=
<br></div><div><br><div><br></div><div><br></div></div></div></div><div cla=
ss=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sun, Mar 15, 2015 at 5=
:05 PM, Kristoffer Sj=C3=B6gren <span dir=3D"ltr">&lt;<a href=3D"mailto:sto=
ffe@gmail.com" target=3D"_blank">stoffe@gmail.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">That&#39;s the thing, there=
 is no DataSet.sortPartition method in 0.8.1. Looking through the git histo=
ry show that sortPartition was added 20th of February so I think that&#39;s=
 0.9-SNAPSHOT?<div><br></div></div><div><div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Sun, Mar 15, 2015 at 4:51 PM, Stephan Ewen <=
span dir=3D"ltr">&lt;<a href=3D"mailto:sewen@apache.org" target=3D"_blank">=
sewen@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><d=
iv dir=3D"ltr">Hi!<div><br></div><div>I think sort partition is the right t=
hink, if you have only one partition (which makes sense, if you want a tota=
l order). It is not a parallel operation any mode, so use it only after the=
 data size has been reduced (filters / aggregations).=C2=A0</div><div><br><=
/div><div>What about &quot;data.sortPartition().setParallelism(1)&quot;.</d=
iv><div><br></div><div>Does that work for you?</div><div><br></div><div>Gre=
etings,</div><div>Stephan</div><div><br></div></div><div><div><div class=3D=
"gmail_extra"><br><div class=3D"gmail_quote">On Sun, Mar 15, 2015 at 4:47 P=
M, Kristoffer Sj=C3=B6gren <span dir=3D"ltr">&lt;<a href=3D"mailto:stoffe@g=
mail.com" target=3D"_blank">stoffe@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex"><div dir=3D"ltr">Thanks for your answer. I guess =
i&#39;m a bit infected by writing to much Crunch code and I also suspected =
that getDataSet() was the wrong thing to do :-)=C2=A0<div><br></div><div>Ho=
wever I was expecting DataSet.sortPartition to do the sorting, but this met=
hod is missing in 0.8.1?</div><div><br></div><div>Do you have a minimal exa=
mple? I was looking through the tests but most of them use sortPartition.</=
div><div><br></div><div>Cheers,</div><div>-Kristoffer<br><div><br></div><di=
v><br></div></div></div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Sun, Mar 15, 2015 at 4:22 PM, Stephan Ewen <span dir=3D=
"ltr">&lt;<a href=3D"mailto:sewen@apache.org" target=3D"_blank">sewen@apach=
e.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"l=
tr">Hi Kristoffer!<div><br></div><div>There are a few issues with that code=
:</div><div><br></div><div>1) Grouping and then calling &quot;sort group&qu=
ot; sorts within the group. In your case, you group after the entire elemen=
t and each group has on value - the element. Sorting inside the group does =
not make any difference. There is no order across groups.</div><div><br></d=
iv><div>2) This code never groups and sorts. The calls to &quot;<span style=
=3D"font-size:12.8000001907349px">groupBy(0).sortGroup(0, Order.DESCENDING)=
.&quot; do not group and sort already, they set up a grouping to be used wi=
th a reduce or aggregate function. The &quot;getDataSet()&quot; call gets y=
ou the original data set, which is the original input.</span></div><div><sp=
an style=3D"font-size:12.8000001907349px"><br></span></div><div><span style=
=3D"font-size:12.8000001907349px">To see an illustration of this, get the p=
rogram plan (env.getExecutionPlan()). You can render it using the html file=
 &quot;tools/planVisualizer.html&quot;.</span></div><div><span style=3D"fon=
t-size:12.8000001907349px"><br></span></div><div><span style=3D"font-size:1=
2.8000001907349px">Greetings,</span></div><div><span style=3D"font-size:12.=
8000001907349px">Stephan</span></div><div><span style=3D"font-size:12.80000=
01907349px"><br></span></div></div><div><div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Sun, Mar 15, 2015 at 3:29 PM, Kristoffer Sj=
=C3=B6gren <span dir=3D"ltr">&lt;<a href=3D"mailto:stoffe@gmail.com" target=
=3D"_blank">stoffe@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div dir=3D"ltr">Hi<div><br></div><div>This is silly but I can=
9;t understand why the following code doesn&#39;t sort the collection of in=
tegers. It seems to be reasonable thing to do from an API perspective?</div=
><div><br></div><div><div>Cheers,</div><div>-Kristoffer</div></div><div><br=
></div><div><div>final ExecutionEnvironment env =3D ExecutionEnvironment.ge=
tExecutionEnvironment();</div><div>=C2=A0 =C2=A0 env.fromCollection(Lists.n=
ewArrayList(2,1,5,3,4,5)).map(new MapFunction&lt;Integer, Tuple1&lt;Integer=
&gt;&gt;() {</div><div>=C2=A0 =C2=A0 =C2=A0 @Override</div><div>=C2=A0 =C2=
=A0 =C2=A0 public Tuple1&lt;Integer&gt; map(Integer value) throws Exception=
 {</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 return new Tuple1(value);</div><di=
v>=C2=A0 =C2=A0 =C2=A0 }</div><div>=C2=A0 =C2=A0 }).groupBy(0).sortGroup(0,=
 Order.DESCENDING).getDataSet().print();</div><div>=C2=A0 =C2=A0 env.execut=
e();</div></div><div><br></div><div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div>

--001a113f29bcb172f005116338a7--