spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] cloud-fan commented on a change in pull request #27466: [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints
Date Thu, 06 Feb 2020 08:44:29 GMT
cloud-fan commented on a change in pull request #27466: [SPARK-30722][PYTHON][DOCS] Update
documentation for Pandas UDF with Python type hints
URL: https://github.com/apache/spark/pull/27466#discussion_r375702068
 
 

 ##########
 File path: docs/sql-pyspark-pandas-with-arrow.md
 ##########
 @@ -65,132 +65,204 @@ Spark will fall back to create the DataFrame without Arrow.
 
 ## Pandas UDFs (a.k.a. Vectorized UDFs)
 
-Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer
data and
-Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf` as a
decorator
-or to wrap the function, no additional configuration is required. Currently, there are two
types of
-Pandas UDF: Scalar and Grouped Map.
+Pandas UDFs are user defined functions that are executed by Spark using
+Arrow to transfer data and Pandas to work with the data, which allows vectorized operations.
A Pandas
+UDF is defined using the `pandas_udf` as a decorator or to wrap the function, and no additional
+configuration is required. A Pandas UDF behaves as a regular PySpark function API in general.
 
-### Scalar
+Before Spark 3.0, Pandas UDFs used to be defined with `PandasUDFType`. From Spark 3.0
+with Python 3.6+, you can also use [Python type hints](https://www.python.org/dev/peps/pep-0484).
+Using Python type hints are preferred and using `PandasUDFType` will be deprecated in
+the future release.
 
-Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with functions
such
-as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs and
return
-a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by splitting
-columns into batches and calling the function for each batch as a subset of the data, then
-concatenating the results together.
 
-The following example shows how to create a scalar Pandas UDF that computes the product of
2 columns.
+The below combinations of the type hints are supported for Pandas UDFs. Note that the type
hint should
+be `pandas.Series` in all cases but there is one variant case that `pandas.DataFrame` should
be mapped
+as its input or output type hint instead when the input or output column is of `StructType`.
 
 Review comment:
   still has some confusion. What if I have 3 input columns and only one of it is struct?
Should the type hint be `Series, DataFrame, Series -> Series`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message