Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.164.171.186
 which is an MX secondary for user@flink.apache.org)
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <A9B95645A9EAB24A9693F53358DB8A612EB7B4F5@bt1shkrd.bt0d0000.w2k.bouyguestelecom.fr>
References: 
 <A9B95645A9EAB24A9693F53358DB8A612EB7B3B8@bt1shkrd.bt0d0000.w2k.bouyguestelecom.fr>
	<5547414A.1030607@informatik.hu-berlin.de>
	<A9B95645A9EAB24A9693F53358DB8A612EB7B4F5@bt1shkrd.bt0d0000.w2k.bouyguestelecom.fr>
Date: Mon, 4 May 2015 14:54:47 +0200
Message-ID: 
 <CANC1h_tvwdgoOS5r+sCmOJr-F_C7YjZ0++y3JjQWFwethch-sA@mail.gmail.com>
Subject: Re: Best way to join with inequalities (historical data)
From: Stephan Ewen <sewen@apache.org>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=089e011839a614679805154111e4

--089e011839a614679805154111e4
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

When you use cross() it is always goot to use crossWithLarge or
crossWithTiny to tell the system which side is small. It can not always
infer that automagically at this point.

If you have an optimized structure for the lookups, go with a broadcast
variable and a map() function.

On Mon, May 4, 2015 at 1:40 PM, LINZ, Arnaud <ALINZ@bouyguestelecom.fr>
wrote:

> Hi,
>
> Thanks. The use case I have right now does not require too much magic ; m=
y
> historical data set is small enough to fit in RAM, I'll spread it over ea=
ch
> node and use a simple mapping with a log(n) look up. It was more a
> theorical question.
> If my dataset becomes too large, I may use some hashing techniques (for
> instance at day level) and cut the intervals at hash frontiers by
> duplicating the row to prevent overlapping.
>
> Arnaud
>
>
>
>
> -----Message d'origine-----
> De : Matthias J. Sax [mailto:mjsax@informatik.hu-berlin.de]
> Envoy=C3=A9 : lundi 4 mai 2015 11:52
> =C3=80 : user@flink.apache.org
> Objet : Re: Best way to join with inequalities (historical data)
>
> Hi,
>
> there is no other system support to express this join.
>
> However, you could perform some "hand wired" optimization by partitioning
> your input data into distinct intervals. It might be tricky though.
> Especially, if the time-ranges in your "range-key" dataset are overlappin=
g
> everywhere (-> data replication necessary for overlapping parts).
>
> But it might be worth the effort if you can't get the job done using
> cross-product. How large are your data sets? What hardware are you using?
>
>
> -Matthias
>
>
> On 05/04/2015 10:47 AM, LINZ, Arnaud wrote:
> > Hello,
> >
> >
> >
> > I was wondering how to join large data sets on inequalities.
> >
> >
> >
> > Let say I have a data set whose =E2=80=9Ckeys=E2=80=9D are two timestam=
ps (start time
> > & end time of validity) and value is a label :
> >
> >         *final*DataSet<Tuple3<Long, Long, String>> historical=3D =E2=80=
=A6;
> >
> >
> >
> > I also have events, with an event name and a timestamp :
> >
> >         *final*DataSet<Tuple2<String, Long>> events=3D =E2=80=A6;
> >
> >
> >
> > I want to join my events with my historical data to get the =E2=80=9Cac=
tive=E2=80=9D
> > label for the time of the event.
> >
> > The simple way is to use a cross product + a filter :
> >
> >
> >
> > events.cross(historical).filter((crossedRow) -> {
> >
> >             *return*(crossedRow.f0.f1>=3D crossedRow.f1.f0) &&
> > (crossedRow.f0.f1<=3D crossedRow.f1.f1);
> >
> >         })
> >
> >
> >
> > But that=E2=80=99s not efficient with 2 big data sets=E2=80=A6
> >
> >
> >
> > How would you code that ?
> >
> >
> >
> > Greetings,
> >
> > Arnaud
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ----------------------------------------------------------------------
> > --
> >
> > L'int=C3=A9grit=C3=A9 de ce message n'=C3=A9tant pas assur=C3=A9e sur i=
nternet, la soci=C3=A9t=C3=A9
> > exp=C3=A9ditrice ne peut =C3=AAtre tenue responsable de son contenu ni =
de ses
> > pi=C3=A8ces jointes. Toute utilisation ou diffusion non autoris=C3=A9e =
est
> > interdite. Si vous n'=C3=AAtes pas destinataire de ce message, merci de=
 le
> > d=C3=A9truire et d'avertir l'exp=C3=A9diteur.
> >
> > The integrity of this message cannot be guaranteed on the Internet.
> > The company that sent this message cannot therefore be held liable for
> > its content nor attachments. Any unauthorized use or dissemination is
> > prohibited. If you are not the intended recipient of this message,
> > then please delete it and notify the sender.
>
>

--089e011839a614679805154111e4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">When you use cross() it is always goot to use crossWithLar=
ge or crossWithTiny to tell the system which side is small. It can not alwa=
ys infer that automagically at this point.<div><br></div><div>If you have a=
n optimized structure for the lookups, go with a broadcast variable and a m=
ap() function.</div></div><div class=3D"gmail_extra"><br><div class=3D"gmai=
l_quote">On Mon, May 4, 2015 at 1:40 PM, LINZ, Arnaud <span dir=3D"ltr">&lt=
;<a href=3D"mailto:ALINZ@bouyguestelecom.fr" target=3D"_blank">ALINZ@bouygu=
estelecom.fr</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br=
>
<br>
Thanks. The use case I have right now does not require too much magic ; my =
historical data set is small enough to fit in RAM, I&#39;ll spread it over =
each node and use a simple mapping with a log(n) look up. It was more a the=
orical question.<br>
If my dataset becomes too large, I may use some hashing techniques (for ins=
tance at day level) and cut the intervals at hash frontiers by duplicating =
the row to prevent overlapping.<br>
<br>
Arnaud<br>
<br>
<br>
<br>
<br>
-----Message d&#39;origine-----<br>
De=C2=A0: Matthias J. Sax [mailto:<a href=3D"mailto:mjsax@informatik.hu-ber=
lin.de">mjsax@informatik.hu-berlin.de</a>]<br>
Envoy=C3=A9=C2=A0: lundi 4 mai 2015 11:52<br>
=C3=80=C2=A0: <a href=3D"mailto:user@flink.apache.org">user@flink.apache.or=
g</a><br>
Objet=C2=A0: Re: Best way to join with inequalities (historical data)<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
Hi,<br>
<br>
there is no other system support to express this join.<br>
<br>
However, you could perform some &quot;hand wired&quot; optimization by part=
itioning your input data into distinct intervals. It might be tricky though=
. Especially, if the time-ranges in your &quot;range-key&quot; dataset are =
overlapping everywhere (-&gt; data replication necessary for overlapping pa=
rts).<br>
<br>
But it might be worth the effort if you can&#39;t get the job done using cr=
oss-product. How large are your data sets? What hardware are you using?<br>
<br>
<br>
-Matthias<br>
<br>
<br>
On 05/04/2015 10:47 AM, LINZ, Arnaud wrote:<br>
&gt; Hello,<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; I was wondering how to join large data sets on inequalities.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Let say I have a data set whose =E2=80=9Ckeys=E2=80=9D are two timesta=
mps (start time<br>
&gt; &amp; end time of validity) and value is a label :<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*final*DataSet&lt;Tuple3&lt;Long, Lon=
g, String&gt;&gt; historical=3D =E2=80=A6;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; I also have events, with an event name and a timestamp :<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*final*DataSet&lt;Tuple2&lt;String, L=
ong&gt;&gt; events=3D =E2=80=A6;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; I want to join my events with my historical data to get the =E2=80=9Ca=
ctive=E2=80=9D<br>
&gt; label for the time of the event.<br>
&gt;<br>
&gt; The simple way is to use a cross product + a filter :<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; events.cross(historical).filter((crossedRow) -&gt; {<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*return*(crossedRow.f0.=
f1&gt;=3D crossedRow.f1.f0) &amp;&amp;<br>
&gt; (crossedRow.f0.f1&lt;=3D crossedRow.f1.f1);<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0})<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; But that=E2=80=99s not efficient with 2 big data sets=E2=80=A6<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; How would you code that ?<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Greetings,<br>
&gt;<br>
&gt; Arnaud<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; ----------------------------------------------------------------------=
<br>
&gt; --<br>
&gt;<br>
&gt; L&#39;int=C3=A9grit=C3=A9 de ce message n&#39;=C3=A9tant pas assur=C3=
=A9e sur internet, la soci=C3=A9t=C3=A9<br>
&gt; exp=C3=A9ditrice ne peut =C3=AAtre tenue responsable de son contenu ni=
 de ses<br>
&gt; pi=C3=A8ces jointes. Toute utilisation ou diffusion non autoris=C3=A9e=
 est<br>
&gt; interdite. Si vous n&#39;=C3=AAtes pas destinataire de ce message, mer=
ci de le<br>
&gt; d=C3=A9truire et d&#39;avertir l&#39;exp=C3=A9diteur.<br>
&gt;<br>
&gt; The integrity of this message cannot be guaranteed on the Internet.<br=
>
&gt; The company that sent this message cannot therefore be held liable for=
<br>
&gt; its content nor attachments. Any unauthorized use or dissemination is<=
br>
&gt; prohibited. If you are not the intended recipient of this message,<br>
&gt; then please delete it and notify the sender.<br>
<br>
</div></div></blockquote></div><br></div>

--089e011839a614679805154111e4--