Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D38C4200C5C for ; Thu, 20 Apr 2017 17:33:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D21D1160B9F; Thu, 20 Apr 2017 15:33:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 249CC160B91 for ; Thu, 20 Apr 2017 17:33:07 +0200 (CEST) Received: (qmail 67437 invoked by uid 500); 20 Apr 2017 15:33:07 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 67428 invoked by uid 99); 20 Apr 2017 15:33:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Apr 2017 15:33:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id CC1A01AFA2A for ; Thu, 20 Apr 2017 15:33:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id oz1W44kSMsLp for ; Thu, 20 Apr 2017 15:33:05 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 52BF75FBBB for ; Thu, 20 Apr 2017 15:33:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C38B4E05A2 for ; Thu, 20 Apr 2017 15:33:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 816BA21B53 for ; Thu, 20 Apr 2017 15:33:04 +0000 (UTC) Date: Thu, 20 Apr 2017 15:33:04 +0000 (UTC) From: "Apache Spark (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 20 Apr 2017 15:33:09 -0000 [ https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20413: ------------------------------------ Assignee: Apache Spark > New Optimizer Hint to prevent collapsing of adjacent projections > ---------------------------------------------------------------- > > Key: SPARK-20413 > URL: https://issues.apache.org/jira/browse/SPARK-20413 > Project: Spark > Issue Type: Improvement > Components: Optimizer, PySpark, SQL > Affects Versions: 2.1.0 > Reporter: Michael Styles > Assignee: Apache Spark > > I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. > Let me first give an example of why I am proposing this. > df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) > df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) > df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) > df3.explain(True) > == Parsed Logical Plan == > 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] > +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] > +- LogicalRDD [id#80L, user_agent#81] > == Analyzed Logical Plan == > c1: string, c2: string > Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] > +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] > +- LogicalRDD [id#80L, user_agent#81] > == Optimized Logical Plan == > Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] > +- LogicalRDD [id#80L, user_agent#81] > == Physical Plan == > *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] > +- Scan ExistingRDD[id#80L,user_agent#81] > user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. > I'm proposing a hint that prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following example and generated query plan. > df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) > df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_agent"]))) > df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) > df3.explain(True) > == Parsed Logical Plan == > 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] > +- NoCollapseHint > +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] > +- LogicalRDD [id#64L, user_agent#65] > == Analyzed Logical Plan == > c1: string, c2: string > Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] > +- NoCollapseHint > +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] > +- LogicalRDD [id#64L, user_agent#65] > == Optimized Logical Plan == > Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] > +- NoCollapseHint > +- Project [UDF(user_agent#65) AS ua#69] > +- LogicalRDD [id#64L, user_agent#65] > == Physical Plan == > *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] > +- *Project [UDF(user_agent#65) AS ua#69] > +- Scan ExistingRDD[id#64L,user_agent#65] > As can be seen from the query plan, the user-defined function is now evaluated once per row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org