spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BryanCutler <...@git.apache.org>
Subject [GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Date Mon, 02 Oct 2017 20:37:29 GMT
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18732#discussion_r142248106
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
     @since(2.3)
     def pandas_udf(f=None, returnType=StringType()):
         """
    -    Creates a :class:`Column` expression representing a user defined function (UDF) that
accepts
    -    `Pandas.Series` as input arguments and outputs a `Pandas.Series` of the same length.
    +    Creates a :class:`Column` expression representing a vectorized user defined function
(UDF).
    +
    +    The user-defined function can define one of the following transformations:
    +    1. One or more `pandas.Series` -> A `pandas.Series`
    +
    +       This udf is used with `DataFrame.withColumn` and `DataFrame.select`.
    +       The returnType should be a primitive data type, e.g., DoubleType()
    +
    +       Example:
    +
    +       >>> from pyspark.sql.types import IntegerType, StringType
    +       >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    +       >>> @pandas_udf(returnType=StringType())
    +       ... def to_upper(s):
    +       ...     return s.str.upper()
    +       ...
    +       >>> @pandas_udf(returnType="integer")
    +       ... def add_one(x):
    +       ...     return x + 1
    +       ...
    +       >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name",
"age"))
    +       >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age"))
\\
    +       ...     .show()  # doctest: +SKIP
    +       +----------+--------------+------------+
    +       |slen(name)|to_upper(name)|add_one(age)|
    +       +----------+--------------+------------+
    +       |         8|      JOHN DOE|          22|
    +       +----------+--------------+------------+
    +
    +    2. A `pandas.DataFrame` -> A `pandas.DataFrame`
    +
    +       This udf is used with `GroupedData.apply`
    +       The returnType should be a StructType describing the schema of the returned
    +       `pandas.DataFrame`.
    +
    +       Example:
    +
    +       >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 4.0)],
("id", "v"))
    +       >>> @pandas_udf(returnType=df.schema)
    +       ... def normalize(df):
    +       ...     v = df.v
    +       ...     ret = df.assign(v=(v - v.mean()) / v.std())
    +       >>> df.groupby('id').apply(normalize).show() # doctest: + SKIP
    +      +---+-------------------+
    +      | id|                  v|
    +      +---+-------------------+
    +      |  1|-0.7071067811865475|
    +      |  1| 0.7071067811865475|
    +      |  2|-0.7071067811865475|
    +      |  2| 0.7071067811865475|
    +      +---+-------------------+
    +
    +
    +    .. note:: The user-defined functions must be deterministic.
     
         :param f: python function if used as a standalone function
         :param returnType: a :class:`pyspark.sql.types.DataType` object
     
    -    >>> from pyspark.sql.types import IntegerType, StringType
    -    >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
    -    >>> @pandas_udf(returnType=StringType())
    -    ... def to_upper(s):
    -    ...     return s.str.upper()
    -    ...
    -    >>> @pandas_udf(returnType="integer")
    -    ... def add_one(x):
    -    ...     return x + 1
    -    ...
    -    >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
    -    >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age"))
\\
    -    ...     .show()  # doctest: +SKIP
    -    +----------+--------------+------------+
    -    |slen(name)|to_upper(name)|add_one(age)|
    -    +----------+--------------+------------+
    -    |         8|      JOHN DOE|          22|
    -    +----------+--------------+------------+
         """
    +    import pandas as pd
    +    if isinstance(returnType, pd.Series):
    --- End diff --
    
    I'm still not crazy about this option.  First, you are expecting the type to be a `Pandas.Series`
of dtypes so I think you would need to check the data type of the Series here too, just to
make sure the user doesn't pass in some other kind of Series.  Second, I tend to think of
a UDF declaration as being static, especially with a decorator, so it seems a little weird
to do any work in the declaration.  For instance
    ```
    sample_df = df.filter(df.id == 1).toPandas()
    
    @pandas_udf(returnType=foo(sample_df))
    def foo(df):
          ret = # Some transformation on the input pd.DataFrame
          return ret
    ```
    
    Is this allowed?  When exactly is `foo(sample_df)` called here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message