Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
References: <CADvKAvs+pMihULgc72iJ7cVuEMt72U7sKBPg6FZfJB6n8XU9uA@mail.gmail.com>
 <CAAdrtT3zmeA1NcTWJTtRdiYn7nHFD717uWKs2Q8cS0WeYNdTmA@mail.gmail.com>
 <CADXjeyDZciGcK61HfJ0Rt_d+t4-V9b-QcHo=HPHXrBGxQfbQmA@mail.gmail.com> <CAAdrtT2RgPGi_uN3Q7QsB95OJrsvDtRfO9rSA9JZHcXTzbLkUQ@mail.gmail.com>
In-Reply-To: <CAAdrtT2RgPGi_uN3Q7QsB95OJrsvDtRfO9rSA9JZHcXTzbLkUQ@mail.gmail.com>
From: Aljoscha Krettek <aljoscha@apache.org>
Date: Tue, 24 Jan 2017 16:52:53 +0000
Message-ID: <CANMXwW0NNFrwe8tNKd6OgC=uYg-xijza876W6+kBBxDMXjPLcw@mail.gmail.com>
Subject: Re: How to get top N elements in a DataSet?
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a113ebfa41716a40546d9f388
archived-at: Tue, 24 Jan 2017 16:53:07 -0000

--001a113ebfa41716a40546d9f388
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

@Fabian, I think there's a typo in your code, shouldn't it be

dataset // assuming some partitioning that can be reused to avoid a shuffle
  .sortPartition(1, Order.DESCENDING)
  .mapPartition(new ReturnFirstTen())
  .sortPartition(1, Order.DESCENDING)
  .mapPartition(new ReturnFirstTen()).parallelism(1)

i.e. the second MapPartition has to be parallelism=3D1


On Tue, 24 Jan 2017 at 11:57 Fabian Hueske <fhueske@gmail.com> wrote:

> You are of course right Gabor.
> @Ivan, you can use a heap in the MapPartitionFunction to collect the top
> 10 elements (note that you need to create deep-copies if object reuse is
> enabled [1]).
>
> Best, Fabian
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/batch/in=
dex.html#operating-on-data-objects-in-functions
>
>
> 2017-01-24 11:49 GMT+01:00 G=C3=A1bor G=C3=A9vay <ggab90@gmail.com>:
>
> Hello,
>
> Btw. there is a Jira about this:
> https://issues.apache.org/jira/browse/FLINK-2549
> Note that the discussion there suggests a more efficient approach,
> which doesn't involve sorting the entire partitions.
>
> And if I remember correctly, this question comes up from time to time
> on the mailing list.
>
> Best,
> G=C3=A1bor
>
>
>
> 2017-01-24 11:35 GMT+01:00 Fabian Hueske <fhueske@gmail.com>:
> > Hi Ivan,
> >
> > I think you can use MapPartition for that.
> > So basically:
> >
> > dataset // assuming some partitioning that can be reused to avoid a
> shuffle
> >   .sortPartition(1, Order.DESCENDING)
> >   .mapPartition(new ReturnFirstTen())
> >   .sortPartition(1, Order.DESCENDING).parallelism(1)
> >   .mapPartition(new ReturnFirstTen())
> >
> > Best, Fabian
> >
> >
> > 2017-01-24 10:10 GMT+01:00 Ivan Mushketyk <ivan.mushketik@gmail.com>:
> >>
> >> Hi,
> >>
> >> I have a dataset of tuples with two fields ids and ratings and I need =
to
> >> find 10 elements with the highest rating in this dataset. I found a
> >> solution, but I think it's suboptimal and I think there should be a
> better
> >> way to do it.
> >>
> >> The best thing that I came up with is to partition dataset by rating,
> sort
> >> locally and write the partitioned dataset to disk:
> >>
> >> dataset
> >> .partitionCustom(new Partitioner<Double>() {
> >>   @Override
> >>   public int partition(Double key, int numPartitions) {
> >>     return key.intValue() % numPartitions;
> >>   }
> >> }, 1) . // partition by rating
> >> .setParallelism(5)
> >> .sortPartition(1, Order.DESCENDING) // locally sort by rating
> >> .writeAsText("..."); // write the partitioned dataset to disk
> >>
> >> This will store tuples in sorted files with names 5, 4, 3, ... that
> >> contain ratings in ranges (5, 4], (4, 3], and so on. Then I can read
> sorted
> >> data from disk and and N elements with the highest rating.
> >> Is there a way to do the same but without writing a partitioned datase=
t
> to
> >> a disk?
> >>
> >> I tried to use "first(10)" but it seems to give top 10 items from a
> random
> >> partition. Is there a way to get top N elements from every partition?
> Then I
> >> could locally sort top values from every partition and find top 10
> global
> >> values.
> >>
> >> Best regards,
> >> Ivan.
> >>
> >>
> >
>
>
>

--001a113ebfa41716a40546d9f388
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">@Fabian, I think there&#39;s a typo in your code, shouldn&=
#39;t it be<div><br></div><div><div class=3D"gmail_msg"><div class=3D"gmail=
_msg"><div class=3D"gmail_msg">dataset // assuming some partitioning that c=
an be reused to avoid a shuffle<br class=3D"gmail_msg"></div><div class=3D"=
gmail_msg">=C2=A0 .sortPartition(1, Order.DESCENDING)<br class=3D"gmail_msg=
"></div>=C2=A0 .mapPartition(new ReturnFirstTen())<br class=3D"gmail_msg"><=
/div>=C2=A0 .sortPartition(1, Order.DESCENDING)<br class=3D"gmail_msg"></di=
v>=C2=A0 .mapPartition(new ReturnFirstTen()).parallelism(1)<br></div><div><=
br></div><div>i.e. the second MapPartition has to be parallelism=3D1</div><=
div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Tue,=
 24 Jan 2017 at 11:57 Fabian Hueske &lt;<a href=3D"mailto:fhueske@gmail.com=
">fhueske@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=
<div dir=3D"ltr" class=3D"gmail_msg"><div class=3D"gmail_msg">You are of co=
urse right Gabor. <br class=3D"gmail_msg"></div><div class=3D"gmail_msg">@I=
van, you can use a heap in the MapPartitionFunction to collect the top 10 e=
lements (note that you need to create deep-copies if object reuse is enable=
d [1]).<br class=3D"gmail_msg"></div><div class=3D"gmail_msg"></div><div cl=
ass=3D"gmail_msg"><br class=3D"gmail_msg"></div><div class=3D"gmail_msg">Be=
st, Fabian<br class=3D"gmail_msg"></div><div class=3D"gmail_msg"><br class=
=3D"gmail_msg">[1] <a href=3D"https://ci.apache.org/projects/flink/flink-do=
cs-release-1.1/apis/batch/index.html#operating-on-data-objects-in-functions=
" class=3D"gmail_msg" target=3D"_blank">https://ci.apache.org/projects/flin=
k/flink-docs-release-1.1/apis/batch/index.html#operating-on-data-objects-in=
-functions</a><br class=3D"gmail_msg"><br class=3D"gmail_msg"></div></div><=
div class=3D"gmail_extra gmail_msg"><br class=3D"gmail_msg"><div class=3D"g=
mail_quote gmail_msg">2017-01-24 11:49 GMT+01:00 G=C3=A1bor G=C3=A9vay <spa=
n dir=3D"ltr" class=3D"gmail_msg">&lt;<a href=3D"mailto:ggab90@gmail.com" c=
lass=3D"gmail_msg" target=3D"_blank">ggab90@gmail.com</a>&gt;</span>:<br cl=
ass=3D"gmail_msg"><blockquote class=3D"gmail_quote gmail_msg" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br class=
=3D"gmail_msg">
<br class=3D"gmail_msg">
Btw. there is a Jira about this:<br class=3D"gmail_msg">
<a href=3D"https://issues.apache.org/jira/browse/FLINK-2549" rel=3D"norefer=
rer" class=3D"gmail_msg" target=3D"_blank">https://issues.apache.org/jira/b=
rowse/FLINK-2549</a><br class=3D"gmail_msg">
Note that the discussion there suggests a more efficient approach,<br class=
=3D"gmail_msg">
which doesn&#39;t involve sorting the entire partitions.<br class=3D"gmail_=
msg">
<br class=3D"gmail_msg">
And if I remember correctly, this question comes up from time to time<br cl=
ass=3D"gmail_msg">
on the mailing list.<br class=3D"gmail_msg">
<br class=3D"gmail_msg">
Best,<br class=3D"gmail_msg">
G=C3=A1bor<br class=3D"gmail_msg">
<div class=3D"m_3826944008760818448HOEnZb gmail_msg"><div class=3D"m_382694=
4008760818448h5 gmail_msg"><br class=3D"gmail_msg">
<br class=3D"gmail_msg">
<br class=3D"gmail_msg">
2017-01-24 11:35 GMT+01:00 Fabian Hueske &lt;<a href=3D"mailto:fhueske@gmai=
l.com" class=3D"gmail_msg" target=3D"_blank">fhueske@gmail.com</a>&gt;:<br =
class=3D"gmail_msg">
&gt; Hi Ivan,<br class=3D"gmail_msg">
&gt;<br class=3D"gmail_msg">
&gt; I think you can use MapPartition for that.<br class=3D"gmail_msg">
&gt; So basically:<br class=3D"gmail_msg">
&gt;<br class=3D"gmail_msg">
&gt; dataset // assuming some partitioning that can be reused to avoid a sh=
uffle<br class=3D"gmail_msg">
&gt;=C2=A0 =C2=A0.sortPartition(1, Order.DESCENDING)<br class=3D"gmail_msg"=
>
&gt;=C2=A0 =C2=A0.mapPartition(new ReturnFirstTen())<br class=3D"gmail_msg"=
>
&gt;=C2=A0 =C2=A0.sortPartition(1, Order.DESCENDING).parallelism(1)<br clas=
s=3D"gmail_msg">
&gt;=C2=A0 =C2=A0.mapPartition(new ReturnFirstTen())<br class=3D"gmail_msg"=
>
&gt;<br class=3D"gmail_msg">
&gt; Best, Fabian<br class=3D"gmail_msg">
&gt;<br class=3D"gmail_msg">
&gt;<br class=3D"gmail_msg">
&gt; 2017-01-24 10:10 GMT+01:00 Ivan Mushketyk &lt;<a href=3D"mailto:ivan.m=
ushketik@gmail.com" class=3D"gmail_msg" target=3D"_blank">ivan.mushketik@gm=
ail.com</a>&gt;:<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; Hi,<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; I have a dataset of tuples with two fields ids and ratings and I n=
eed to<br class=3D"gmail_msg">
&gt;&gt; find 10 elements with the highest rating in this dataset. I found =
a<br class=3D"gmail_msg">
&gt;&gt; solution, but I think it&#39;s suboptimal and I think there should=
 be a better<br class=3D"gmail_msg">
&gt;&gt; way to do it.<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; The best thing that I came up with is to partition dataset by rati=
ng, sort<br class=3D"gmail_msg">
&gt;&gt; locally and write the partitioned dataset to disk:<br class=3D"gma=
il_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; dataset<br class=3D"gmail_msg">
&gt;&gt; .partitionCustom(new Partitioner&lt;Double&gt;() {<br class=3D"gma=
il_msg">
&gt;&gt;=C2=A0 =C2=A0@Override<br class=3D"gmail_msg">
&gt;&gt;=C2=A0 =C2=A0public int partition(Double key, int numPartitions) {<=
br class=3D"gmail_msg">
&gt;&gt;=C2=A0 =C2=A0 =C2=A0return key.intValue() % numPartitions;<br class=
=3D"gmail_msg">
&gt;&gt;=C2=A0 =C2=A0}<br class=3D"gmail_msg">
&gt;&gt; }, 1) . // partition by rating<br class=3D"gmail_msg">
&gt;&gt; .setParallelism(5)<br class=3D"gmail_msg">
&gt;&gt; .sortPartition(1, Order.DESCENDING) // locally sort by rating<br c=
lass=3D"gmail_msg">
&gt;&gt; .writeAsText(&quot;...&quot;); // write the partitioned dataset to=
 disk<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; This will store tuples in sorted files with names 5, 4, 3, ... tha=
t<br class=3D"gmail_msg">
&gt;&gt; contain ratings in ranges (5, 4], (4, 3], and so on. Then I can re=
ad sorted<br class=3D"gmail_msg">
&gt;&gt; data from disk and and N elements with the highest rating.<br clas=
s=3D"gmail_msg">
&gt;&gt; Is there a way to do the same but without writing a partitioned da=
taset to<br class=3D"gmail_msg">
&gt;&gt; a disk?<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; I tried to use &quot;first(10)&quot; but it seems to give top 10 i=
tems from a random<br class=3D"gmail_msg">
&gt;&gt; partition. Is there a way to get top N elements from every partiti=
on? Then I<br class=3D"gmail_msg">
&gt;&gt; could locally sort top values from every partition and find top 10=
 global<br class=3D"gmail_msg">
&gt;&gt; values.<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt; Best regards,<br class=3D"gmail_msg">
&gt;&gt; Ivan.<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;&gt;<br class=3D"gmail_msg">
&gt;<br class=3D"gmail_msg">
</div></div></blockquote></div><br class=3D"gmail_msg"></div>
</blockquote></div>

--001a113ebfa41716a40546d9f388--