From reviews-return-607347-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Wed Jan 24 08:11:42 2018
Return-Path: <reviews-return-607347-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id BD07C180630
	for <archive-asf-public@eu.ponee.io>; Wed, 24 Jan 2018 08:11:42 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id AA632160C3C; Wed, 24 Jan 2018 07:11:42 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id EF605160C2E
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 24 Jan 2018 08:11:41 +0100 (CET)
Received: (qmail 40417 invoked by uid 500); 24 Jan 2018 07:11:40 -0000
Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:reviews-help@spark.apache.org>
List-Unsubscribe: <mailto:reviews-unsubscribe@spark.apache.org>
List-Post: <mailto:reviews@spark.apache.org>
List-Id: <reviews.spark.apache.org>
Delivered-To: mailing list reviews@spark.apache.org
Received: (qmail 40406 invoked by uid 99); 24 Jan 2018 07:11:40 -0000
Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jan 2018 07:11:40 +0000
Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33)
	id 95C8BDFF4E; Wed, 24 Jan 2018 07:11:38 +0000 (UTC)
From: viirya <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
Message-ID: <git-pr-20379-spark@git.apache.org>
Subject: [GitHub] spark pull request #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract...
Content-Type: text/plain
Date: Wed, 24 Jan 2018 07:11:38 +0000 (UTC)

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20379

    [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-parameter UDFs from aggregate

    ## What changes were proposed in this pull request?
    
    We extract Python UDFs in logical aggregate which depends on aggregate expression or grouping key in ExtractPythonUDFFromAggregate rule. But Python UDFs which don't depend on above expressions should also be extracted to avoid the issue reported in the JIRA.
    
    A small code snippet to reproduce that issue looks like:
    ```python
    import pyspark.sql.functions as f
    
    df = spark.createDataFrame([(1,2), (3,4)])
    f_udf = f.udf(lambda: str("const_str"))
    df2 = df.distinct().withColumn("a", f_udf())
    df2.show()
    ```
    
    Error exception is raised as:
    ```
    : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#50
            at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
            at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
            at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
            at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
            at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
            at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
            at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
            at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
            at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
            at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
            at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
            at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
            at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
    ```
    
    This exception raises because `HashAggregateExec` tries to bind the aliased Python UDF expression (e.g., `pythonUDF0#50 AS a#44`) to grouping key.
    
    ## How was this patch tested?
    
    Added test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23177-backport-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20379
    
----
commit a66b5e0c4b81444974f02c7154111b47a1a5137c
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-01-24T06:50:53Z

    Extract parameter-less UDFs from aggregate.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org