Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 498EF200C5C for ; Thu, 20 Apr 2017 15:31:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 48187160B9F; Thu, 20 Apr 2017 13:31:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 690BB160B91 for ; Thu, 20 Apr 2017 15:31:22 +0200 (CEST) Received: (qmail 34498 invoked by uid 500); 20 Apr 2017 13:31:21 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 34486 invoked by uid 99); 20 Apr 2017 13:31:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Apr 2017 13:31:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 51CC5180A5A for ; Thu, 20 Apr 2017 13:31:20 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=shopify.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id K5JVaYsUyt3w for ; Thu, 20 Apr 2017 13:31:18 +0000 (UTC) Received: from mail-qt0-f173.google.com (mail-qt0-f173.google.com [209.85.216.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 599715F1D5 for ; Thu, 20 Apr 2017 13:31:17 +0000 (UTC) Received: by mail-qt0-f173.google.com with SMTP id g60so44847890qtd.3 for ; Thu, 20 Apr 2017 06:31:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopify.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=4gSNEWAVFKaWqFLfHdYr+EDlkhrrShIAA1yOW21zF0g=; b=Wg3JSsiuUPBgLQpQpNXq1y9tEAmCepj4v+MLFJIEr0X+fuPHetglyunuYFHLpKOYi6 k/+MtmSyejbteyK7jpCKNrIWO/LpUy4ePxQSwcYpy2EEyNKTQZ7clIUhwPJVCIra/5T4 IRpN2ZTepPGUv2TyEh4UM933ZjqDMIU6hLHPM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=4gSNEWAVFKaWqFLfHdYr+EDlkhrrShIAA1yOW21zF0g=; b=Z9NnLZuxZh8TcpfmvHyCl7CPdOfdHtd1JGYf1O6u274y3pPPVoER7YwZfqCxkbPjqF 69ahO26QsqWm/Jz+TEXi6O/RekNd53TPguLy2heb1srF9K/gJAOnOjM9unM6nArcbaW/ hUZeo0zyvQqb3VUuprDKKcWqPpcHjVMxDwtpPlvfS764ar8yR88DTQzpsk9S5+aTT2Yi caxCbdZiMWqwtaXxUR2z2BUsRMcLMr8pQkT8+X46IkKzLX+vjV0lpuUTZi0Tb14iJbq+ Zi8L0mZs+2hlqbSJ6EZGFVPk2hA9JbZpCwLoQDfIpGcI7A2qxYUUt6aLauX0f+1ZA8zc F90g== X-Gm-Message-State: AN3rC/5bVecPBpqT6HO7C3bZvmOqOGsb2yef3HJu/UzfT905QXlKYkS+ dMqQuPoohojn48fhKYy2N32YwtRqU5UosIo= X-Received: by 10.237.42.22 with SMTP id c22mr8052115qtd.120.1492695076177; Thu, 20 Apr 2017 06:31:16 -0700 (PDT) MIME-Version: 1.0 From: Michael Styles Date: Thu, 20 Apr 2017 13:31:05 +0000 Message-ID: Subject: New Optimizer Hint To: dev@spark.apache.org Content-Type: multipart/alternative; boundary=001a11455eacc6253b054d9927de archived-at: Thu, 20 Apr 2017 13:31:23 -0000 --001a11455eacc6253b054d9927de Content-Type: text/plain; charset=UTF-8 Hello, I am in the process of putting together a PR that introduces a new hint called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following example and generated query plan. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_agent"]))) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Analyzed Logical Plan == c1: string, c2: string Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Optimized Logical Plan == Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Physical Plan == *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- *Project [UDF(user_agent#65) AS ua#69] +- Scan ExistingRDD[id#64L,user_agent#65] As can be seen from the query plan, the user-defined function is now evaluated once per row. I would like to get some feedback on this proposal. Thanks. --001a11455eacc6253b054d9927de Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello,=C2=A0

I am in the pro= cess of putting together a PR that introduces a new hint called NO_COLLAPSE= . This hint is essentially identical to Oracle's NO_MERGE hint.=C2=A0

Let me first give an example of why I am proposing = this.=C2=A0

df1 =3D sc.sql.createDataFrame([(1, &q= uot;abc")], ["id", "user_agent"])=C2=A0
= df2 =3D df1.withColumn("ua", user_agent_details(df1["user_ag= ent"]))=C2=A0
df3 =3D df2.select(df2["ua"].device_= form_factor.alias("c1"), df2["ua"].browser_version.alia= s("c2"))=C2=A0
df3.explain(True)=C2=A0

=3D=3D Parsed Logical Plan =3D=3D=C2=A0
'Project [ua= #85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91]=C2=A0
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]=C2= =A0
=C2=A0 =C2=A0+- LogicalRDD [id#80L, user_agent#81]=C2=A0

=3D=3D Analyzed Logical Plan =3D=3D=C2=A0
c1= : string, c2: string=C2=A0
Project [ua#85.device_form_factor AS c= 1#90, ua#85.browser_version AS c2#91]=C2=A0
+- Project [id#80L, u= ser_agent#81, UDF(user_agent#81) AS ua#85]=C2=A0
=C2=A0 =C2=A0+- = LogicalRDD [id#80L, user_agent#81]=C2=A0

=3D=3D Op= timized Logical Plan =3D=3D=C2=A0
Project [UDF(user_agent#81).dev= ice_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91]=C2= =A0
+- LogicalRDD [id#80L, user_agent#81]=C2=A0

=3D=3D Physical Plan =3D=3D=C2=A0
*Project [UDF(user_agen= t#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2= #91]=C2=A0
+- Scan ExistingRDD[id#80L,user_agent#81]=C2=A0
<= div>
user_agent_details is a user-defined function that retur= ns a struct. As can be seen from the generated query plan, the function is = being executed multiple times which could lead to performance issues. This = is due to the CollapseProject optimizer rule that collapses adjacent projec= tions.=C2=A0

I'm proposing a hint that prevent= the optimizer from collapsing adjacent projections. A new function called = 'no_collapse' would be introduced for this purpose. Consider the fo= llowing example and generated query plan.=C2=A0

df= 1 =3D sc.sql.createDataFrame([(1, "abc")], ["id", "= ;user_agent"])=C2=A0
df2 =3D F.no_collapse(df1.withColumn(&q= uot;ua", user_agent_details(df1["user_agent"])))=C2=A0
=
df3 =3D df2.select(df2["ua"].device_form_factor.alias("= c1"), df2["ua"].browser_version.alias("c2"))=C2=A0=
df3.explain(True)=C2=A0

=3D=3D Parsed L= ogical Plan =3D=3D=C2=A0
'Project [ua#69[device_form_factor] = AS c1#75, ua#69[browser_version] AS c2#76]=C2=A0
+- NoCollapseHin= t=C2=A0
=C2=A0 =C2=A0+- Project [id#64L, user_agent#65, UDF(user_= agent#65) AS ua#69]=C2=A0
=C2=A0 =C2=A0 =C2=A0 +- LogicalRDD [id#= 64L, user_agent#65]=C2=A0

=3D=3D Analyzed Logical = Plan =3D=3D=C2=A0
c1: string, c2: string=C2=A0
Project = [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76]=C2=A0
+- NoCollapseHint=C2=A0
=C2=A0 =C2=A0+- Project [id#64L,= user_agent#65, UDF(user_agent#65) AS ua#69]=C2=A0
=C2=A0 =C2=A0 = =C2=A0 +- LogicalRDD [id#64L, user_agent#65]=C2=A0

=3D=3D Optimized Logical Plan =3D=3D=C2=A0
Project [ua#69.device= _form_factor AS c1#75, ua#69.browser_version AS c2#76]=C2=A0
+- N= oCollapseHint=C2=A0
=C2=A0 =C2=A0+- Project [UDF(user_agent#65) A= S ua#69]=C2=A0
=C2=A0 =C2=A0 =C2=A0 +- LogicalRDD [id#64L, user_a= gent#65]=C2=A0

=3D=3D Physical Plan =3D=3D=C2=A0
*Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version= AS c2#76]=C2=A0
+- *Project [UDF(user_agent#65) AS ua#69]=C2=A0<= /div>
=C2=A0 =C2=A0+- Scan ExistingRDD[id#64L,user_agent#65]=C2=A0

As can be seen from the query plan, the user-defined f= unction is now evaluated once per row.=C2=A0

I wou= ld like to get some feedback on this proposal.=C2=A0

Thanks.

--001a11455eacc6253b054d9927de--