Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 33FA1185ED for ; Mon, 4 May 2015 13:00:03 +0000 (UTC) Received: (qmail 66462 invoked by uid 500); 4 May 2015 13:00:03 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 66396 invoked by uid 500); 4 May 2015 13:00:03 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 66385 invoked by uid 99); 4 May 2015 13:00:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 13:00:02 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.164.171.186 which is an MX secondary for user@flink.apache.org) Received: from [54.164.171.186] (HELO mx1-us-east.apache.org) (54.164.171.186) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 12:59:57 +0000 Received: from mail-la0-f44.google.com (mail-la0-f44.google.com [209.85.215.44]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id DBEF342ACD for ; Mon, 4 May 2015 12:59:36 +0000 (UTC) Received: by lagv1 with SMTP id v1so103302221lag.3 for ; Mon, 04 May 2015 05:59:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+X8NyJT9I/bVI0NsK2+rmWi6ShrIJqpytaFjMbZ0uig=; b=QqyILYV4Q6xQDc4Y4gPTuCGfK4Bd1Lkq5+FaHgZTAvfdmQyX03ZPYqTMlz/iDTs/bZ jAhEG5Tc9gEgllVhk0hGLbud9A3I4WfP7eQ3ulc19IyepJbhcSpG1SkafoAqwaHT/P+l X6SJRosH/Gk6wUEsJduX8UXPZLBLCxi+q65ZffuVxvXTkH4D+rYlLNiJvvtTJkp/hra4 cz7EHxljCr3fCxoDYsjEVRjyLDU2U1I+2/M9tuZ+4OHaV7wfmwBgyU42hLuQRrWz0DBs HP4DkQPN4Ys46ADCjAYsKUF2nb3i3NWtmiSz1BcWJh7MKDLr3lhlrG4jh/v0xIpgxDXz Dmfw== MIME-Version: 1.0 X-Received: by 10.112.142.232 with SMTP id rz8mr18874185lbb.74.1430744369332; Mon, 04 May 2015 05:59:29 -0700 (PDT) Received: by 10.152.225.171 with HTTP; Mon, 4 May 2015 05:59:29 -0700 (PDT) In-Reply-To: References: Date: Mon, 4 May 2015 14:59:29 +0200 Message-ID: Subject: Re: filter().project() vs flatMap() From: Fabian Hueske To: user@flink.apache.org Content-Type: multipart/alternative; boundary=089e0112cdfee609de0515412119 X-Virus-Checked: Checked by ClamAV on apache.org --089e0112cdfee609de0515412119 Content-Type: text/plain; charset=UTF-8 That might help with cardinality estimation for cost-based optimization. For example when deciding about join strategies (broadcast vs. repartition, build-side of a hash join). However, as Stephan said, there are many cases where it does not make a difference, e.g. if the input cardinality of the filter (or the size of the other join input) is unknown. I think, chances are low that it makes a difference. 2015-05-04 14:53 GMT+02:00 Flavio Pompermaier : > Thanks Sebastian and Fabian for the feedback, just one last question: > what does change from the system point of view to know that the output > tuples is <= the number of input tuples? > Is there any optimization that Flink can apply to the pipeline? > > On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske wrote: > >> It should not make a difference. I think its just personal taste. >> >> If your filter condition is simple enough, I'd go with Flink's Table API >> because it does not require to define a Filter or FlatMapFunction. >> >> >> 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier : >> >>> Hi Flinkers, >>> >>> I'd like to know whether it's better to perform a filter.project or a >>> flatMap to filter tuples and do some projection after the filter. >>> Functionally they are equivalent but maybe I'm ignoring something.. >>> >>> Thanks in advance, >>> Flavio >>> >> >> > > --089e0112cdfee609de0515412119 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
That might help with cardinality estimation for cost-= based optimization. For example when deciding about join strategies (broadc= ast vs. repartition, build-side of a hash join).
However, as Stephan sai= d, there are many cases where it does not make a difference, e.g. if the in= put cardinality of the filter (or the size of the other join input) is unkn= own.

I think, chances are low that it makes a difference.
=


2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it= >:
Thanks = Sebastian and Fabian for the feedback, just one last question:
what doe= s change from the system point of view to know that the =C2=A0output tuples is <=3D the number of input tuples?
Is there any optimization that Flin= k can apply to the pipeline?

On Mon, May 4, 2015 at 2:49 PM, Fa= bian Hueske <fhueske@gmail.com> wrote:
It should = not make a difference. I think its just personal taste.

If your filt= er condition is simple enough, I'd go with Flink's Table API becaus= e it does not require to define a Filter or FlatMapFunction.

<= /div>

20= 15-05-04 14:43 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it= >:
Hi Flinkers= ,

I'd like to know whether it's better to perform a filter.proje= ct or a flatMap to filter tuples and do some projection after the filter. F= unctionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio




--089e0112cdfee609de0515412119--