Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.164.171.186
 which is an MX secondary for user@flink.apache.org)
MIME-Version: 1.0
In-Reply-To: 
 <CAELUF_AP3VZzUT0z64-2trn6+BAiQGn-grJPFihUh+gCKcug=Q@mail.gmail.com>
References: 
 <CAELUF_D4ndHOHJkjvn2ayW8WhXqGAJRm0snG3bX_wwS2=x=hCQ@mail.gmail.com>
	<CAAdrtT3QRTn7uOJAX+_h5+-NVrcw+4Xx1yXU5LDSDaO7H-mKJg@mail.gmail.com>
	<CAELUF_AP3VZzUT0z64-2trn6+BAiQGn-grJPFihUh+gCKcug=Q@mail.gmail.com>
Date: Mon, 4 May 2015 14:59:29 +0200
Message-ID: 
 <CAAdrtT1M7dAa1S459hiVMQXy9NC+jpLGmdyWqK_j1o4xXjJAjA@mail.gmail.com>
Subject: Re: filter().project() vs flatMap()
From: Fabian Hueske <fhueske@gmail.com>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=089e0112cdfee609de0515412119

--089e0112cdfee609de0515412119
Content-Type: text/plain; charset=UTF-8

That might help with cardinality estimation for cost-based optimization.
For example when deciding about join strategies (broadcast vs. repartition,
build-side of a hash join).
However, as Stephan said, there are many cases where it does not make a
difference, e.g. if the input cardinality of the filter (or the size of the
other join input) is unknown.

I think, chances are low that it makes a difference.


2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:

> Thanks Sebastian and Fabian for the feedback, just one last question:
> what does change from the system point of view to know that the  output
> tuples is <= the number of input tuples?
> Is there any optimization that Flink can apply to the pipeline?
>
> On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> It should not make a difference. I think its just personal taste.
>>
>> If your filter condition is simple enough, I'd go with Flink's Table API
>> because it does not require to define a Filter or FlatMapFunction.
>>
>>
>> 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> Hi Flinkers,
>>>
>>> I'd like to know whether it's better to perform a filter.project or a
>>> flatMap to filter tuples and do some projection after the filter.
>>> Functionally they are equivalent but maybe I'm ignoring something..
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>
>>
>
>

--089e0112cdfee609de0515412119
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>That might help with cardinality estimation for cost-=
based optimization. For example when deciding about join strategies (broadc=
ast vs. repartition, build-side of a hash join).<br>However, as Stephan sai=
d, there are many cases where it does not make a difference, e.g. if the in=
put cardinality of the filter (or the size of the other join input) is unkn=
own. <br><br></div>I think, chances are low that it makes a difference.<br>=
<div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <span dir=3D"ltr">&lt;<a =
href=3D"mailto:pompermaier@okkam.it" target=3D"_blank">pompermaier@okkam.it=
</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks =
Sebastian and Fabian for the feedback, just one last question:<div>what doe=
s change from the system point of view to know that the =C2=A0<span style=
=3D"font-size:13px">output tuples is &lt;=3D the number of input tuples?</s=
pan><div><span style=3D"font-size:13px">Is there any optimization that Flin=
k can apply to the pipeline?</span></div><div class=3D"gmail_extra"><br><di=
v class=3D"gmail_quote"><span class=3D"">On Mon, May 4, 2015 at 2:49 PM, Fa=
bian Hueske <span dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" targ=
et=3D"_blank">fhueske@gmail.com</a>&gt;</span> wrote:<br></span><div><div c=
lass=3D"h5"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>It should =
not make a difference. I think its just personal taste.<br><br>If your filt=
er condition is simple enough, I&#39;d go with Flink&#39;s Table API becaus=
e it does not require to define a Filter or FlatMapFunction.<br></div><br><=
/div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">20=
15-05-04 14:43 GMT+02:00 Flavio Pompermaier <span dir=3D"ltr">&lt;<a href=
=3D"mailto:pompermaier@okkam.it" target=3D"_blank">pompermaier@okkam.it</a>=
&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Flinkers=
,<div><div dir=3D"ltr"><p></p><p></p><p></p><p></p></div></div>
<div>I&#39;d like to know whether it&#39;s better to perform a filter.proje=
ct or a flatMap to filter tuples and do some projection after the filter. F=
unctionally they are equivalent but maybe I&#39;m ignoring something..</div=
><div><br></div><div>Thanks in advance,</div><div>Flavio</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div></div></div><br><div><br></div><div><div dir=
=3D"ltr"><p></p><p></p><p></p><p></p></div></div>
</div></div></div>
</blockquote></div><br></div>

--089e0112cdfee609de0515412119--