spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From HyukjinKwon <...@git.apache.org>
Subject [GitHub] spark pull request #18615: [SPARK-21394][PYTHON] Reviving callable object su...
Date Wed, 12 Jul 2017 21:53:22 GMT
GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18615

    [SPARK-21394][PYTHON] Reviving callable object support in UDF in PySpark

    ## What changes were proposed in this pull request?
    
    This PR proposes to avoid `__name__` in the tuple naming the attributes assigned directly
from the wrapped function to the wrapper function, and use `self._name` (`func.__name__` or
`obj.__class__.name__`).
    
    After SPARK-19161, we happened to break callable objects as UDFs in Python as below:
    
    ```python
    from pyspark.sql import functions
    
    
    class F(object):
        def __call__(self, x):
            return x
    
    foo = F()
    udf = functions.udf(foo)
    ```
    
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
        return _udf(f=f, returnType=returnType)
      File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
        return udf_obj._wrapped()
      File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
        @functools.wraps(self.func)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py",
line 33, in update_wrapper
        setattr(wrapper, attr, getattr(wrapped, attr))
    AttributeError: F instance has no attribute '__name__'
    ```
    
    This worked in Spark 2.1:
    
    ```python
    from pyspark.sql import functions
    
    
    class F(object):
        def __call__(self, x):
            return x
    
    foo = F()
    udf = functions.udf(foo)
    spark.range(1).select(udf("id")).show()
    ```
    
    ```
    +-----+
    |F(id)|
    +-----+
    |    0|
    +-----+
    ```
    
    **After**
    
    ```python
    from pyspark.sql import functions
    
    
    class F(object):
        def __call__(self, x):
            return x
    
    foo = F()
    udf = functions.udf(foo)
    spark.range(1).select(udf("id")).show()
    ```
    
    ```
    +-----+
    |F(id)|
    +-----+
    |    0|
    +-----+
    ```
    
    ## How was this patch tested?
    
    Unit tests in `python/pyspark/sql/tests.py`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark callable-object

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18615.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18615
    
----
commit 6d3ef484ff026b10973df2bd3163849540891af9
Author: hyukjinkwon <gurwls223@gmail.com>
Date:   2017-07-12T21:38:37Z

    Reviving callable objects support in UDF in PySpark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message