Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1FF3D1887A for ; Tue, 21 Jul 2015 01:20:34 +0000 (UTC) Received: (qmail 64674 invoked by uid 500); 21 Jul 2015 01:20:30 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 64576 invoked by uid 500); 21 Jul 2015 01:20:30 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 64566 invoked by uid 99); 21 Jul 2015 01:20:30 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 01:20:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 8E3B11A7359 for ; Tue, 21 Jul 2015 01:20:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.001 X-Spam-Level: **** X-Spam-Status: No, score=4.001 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id GemkQNsZBQxC for ; Tue, 21 Jul 2015 01:20:17 +0000 (UTC) Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com [209.85.213.173]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 1564D2C6C6 for ; Tue, 21 Jul 2015 01:20:17 +0000 (UTC) Received: by iggf3 with SMTP id f3so95197641igg.1 for ; Mon, 20 Jul 2015 18:19:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=FD61etXLc+OTc0zkfEQf3Joculv5xQF2OGf4wZ7srng=; b=bYHy0AkEyYfjX/upgA9D+16eQLBynmBIsMha64qd5Nn9wOLD9pa7Bi+GCeZZ3+cuYN nApCGZNuWBoiYC7a0hZI04r2tog7bqZTtUeTchMGKkKH84nrQNfvziZ3uO7NPlGbyF/j zkVse+yYeMrVU+jsuBaM5OK/uVqo7HvSJykw+PRqXSFnvhjvyCt33B+GlaxBX2qjp9VQ TrRQqTDnk8jj8RFq8sD9COUSeVTmqDYZg2aDCvH42S99rDHGFYUoI5Xzxe9kyXsotXiM m2G5jvyBE6WDq4X9/NWd79z3MPqR1RLN1eANyrHaeUjkTElxpUDv6nvT7HVA4NNonzH2 cDLg== X-Gm-Message-State: ALoCoQmw8XHQpMu1XhKDB8T5xP/qzZ9KoZCUeaULy2r9GoY4/UNQVj9rfyT11Vu/SzpWUp2p0uaP MIME-Version: 1.0 X-Received: by 10.107.9.65 with SMTP id j62mr13499972ioi.173.1437441571225; Mon, 20 Jul 2015 18:19:31 -0700 (PDT) Received: by 10.36.37.82 with HTTP; Mon, 20 Jul 2015 18:19:31 -0700 (PDT) X-Originating-IP: [216.13.177.90] In-Reply-To: <045D8FD556C73347A47F956EE65F82202ABADC01@S11MAILD013N1.sh11.lan> References: <045D8FD556C73347A47F956EE65F82202ABAD9C7@S11MAILD013N1.sh11.lan> <045D8FD556C73347A47F956EE65F82202ABADC01@S11MAILD013N1.sh11.lan> Date: Mon, 20 Jul 2015 18:19:31 -0700 Message-ID: Subject: Re: Data frames select and where clause dependency From: Mike Trienis To: Mohammed Guller Cc: Harish Butani , Michael Armbrust , "user@spark.apache.org" Content-Type: multipart/alternative; boundary=001a113e7d6e3d1920051b5872f5 --001a113e7d6e3d1920051b5872f5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Definitely, thanks Mohammed. On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller wrote: > Thanks, Harish. > > > > Mike =E2=80=93 this would be a cleaner version for your use case: > > df.filter(df("filter_field") =3D=3D=3D "value").select("field1").show() > > > > Mohammed > > > > *From:* Harish Butani [mailto:rhbutani.spark@gmail.com] > *Sent:* Monday, July 20, 2015 5:37 PM > *To:* Mohammed Guller > *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org > > *Subject:* Re: Data frames select and where clause dependency > > > > Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning > > See DefaultOptimizer.batches for list of logical rewrites. > > > > You can see the optimized plan by printing: df.queryExecution.optimizedPl= an > > > > On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller > wrote: > > Michael, > > How would the Catalyst optimizer optimize this version? > > df.filter(df("filter_field") =3D=3D=3D "value").select("field1").show() > > Would it still read all the columns in df or would it read only > =E2=80=9Cfilter_field=E2=80=9D and =E2=80=9Cfield1=E2=80=9D since only tw= o columns are used (assuming other > columns from df are not used anywhere else)? > > > > Mohammed > > > > *From:* Michael Armbrust [mailto:michael@databricks.com] > *Sent:* Friday, July 17, 2015 1:39 PM > *To:* Mike Trienis > *Cc:* user@spark.apache.org > *Subject:* Re: Data frames select and where clause dependency > > > > Each operation on a dataframe is completely independent and doesn't know > what operations happened before it. When you do a selection, you are > removing other columns from the dataframe and so the filter has nothing t= o > operate on. > > > > On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis > wrote: > > I'd like to understand why the where field must exist in the select > clause. > > > > For example, the following select statement works fine > > - df.select("field1", "filter_field").filter(df("filter_field") =3D=3D= =3D > "value").show() > > However, the next one fails with the error "in operator !Filter > (filter_field#60 =3D value);" > > - df.select("field1").filter(df("filter_field") =3D=3D=3D "value").sho= w() > > As a work-around, it seems that I can do the following > > - df.select("field1", "filter_field").filter(df("filter_field") =3D=3D= =3D > "value").drop("filter_field").show() > > > > Thanks, Mike. > > > > > --001a113e7d6e3d1920051b5872f5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Definitely, thanks Mohammed.=C2=A0

On Mon, Jul 20, 2015 at 5:47 PM, Mo= hammed Guller <mohammed@glassbeam.com> wrote:

Thanks, Harish.=

=C2=A0

Mike =E2=80=93 this would= be a cleaner version for your use case:

df.filter(df("filter_field") =3D=3D=3D &qu= ot;value").select("field1").show()

=C2=A0

Mohammed=

=C2=A0

From: Harish B= utani [mailto:rhbutani.spark@gmail.com]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis; user@spark.apache.org


Subject: Re: Data frames select and where clause dependency

=C2=A0

Yes via: =C2=A0org.apache.spark.sql.catalyst.optimiz= er.ColumnPruning

See DefaultOptimizer.batches for list of logical rew= rites.

=C2=A0

You can see the optimized plan by printing: df.query= Execution.optimizedPlan

=C2=A0

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller <= ;mohammed@glass= beam.com> wrote:

Michael,

How would the Catalyst optimizer optimize this versi= on?

df.filter(df("filter_field") =3D=3D=3D &qu= ot;value").select("field1").show()

Would it still read all t= he columns in df or would it read only =E2=80=9Cfilter_field=E2=80=9D and = =E2=80=9Cfield1=E2=80=9D since only two columns are used (assuming other columns from df are not used any= where else)?

=C2=A0

Mohammed=

=C2=A0

From: Michael = Armbrust [mailto:michael@databricks.com]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@= spark.apache.org
Subject: Re: Data frames select and where clause dependency

=C2=A0

Each operation on a dataframe is completely independ= ent and doesn't know what operations happened before it.=C2=A0 When you= do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on.

=C2=A0

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <<= a href=3D"mailto:mike.trienis@orcsol.com" target=3D"_blank">mike.trienis@or= csol.com> wrote:

I'd like to understand why the where field must = exist in the select clause.=C2=A0

=C2=A0

For example, the following select statement works fi= ne

  • df.select("field1", "filter_field").filter(df("fil= ter_field") =3D=3D=3D "value").show()
  • However, the next one fails with the error "in = operator !Filter (filter_field#60 =3D value);"

    • df.select("field1").filter(df("filter_field") =3D=3D=3D= "value").show()

    As a work-around, it seems that I can do the followi= ng

    • df.select("field1", "filter_field").filter(df("fil= ter_field") =3D=3D=3D "value").drop("filter_field"= ).show()

    =C2=A0

    Thanks, Mike.=C2=A0

=C2=A0

=C2=A0


--001a113e7d6e3d1920051b5872f5--