spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject spark git commit: [SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column assignment
Date Fri, 01 Jun 2018 03:59:05 GMT
Repository: spark
Updated Branches:
  refs/heads/master cbaa72913 -> b2d022656

[SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column assignment

## What changes were proposed in this pull request?

Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned
by position.

## How was this patch tested?


Author: Bryan Cutler <>

Closes #21471 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-SPARK-21427.


Branch: refs/heads/master
Commit: b2d022656298c7a39ff3e84b04f813d5f315cb95
Parents: cbaa729
Author: Bryan Cutler <>
Authored: Fri Jun 1 11:58:59 2018 +0800
Committer: hyukjinkwon <>
Committed: Fri Jun 1 11:58:59 2018 +0800

 docs/   | 9 +++++++++
 python/pyspark/sql/ | 9 ++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/docs/ b/docs/
index 5060086..4d8a738 100644
--- a/docs/
+++ b/docs/
@@ -1752,6 +1752,15 @@ To use `groupBy().apply()`, the user needs to define the following:
 * A Python function that defines the computation for each group.
 * A `StructType` object or a string that defines the schema of the output `DataFrame`.
+The output schema will be applied to the columns of the returned `pandas.DataFrame` in order
by position,
+not by name. This means that the columns in the `pandas.DataFrame` must be indexed so that
+position matches the corresponding field in the schema.
+Note that when creating a new `pandas.DataFrame` using a dictionary, the actual position
of the column
+can differ from the order that it was placed in the dictionary. It is recommended in this
case to
+explicitly define the column order using the `columns` keyword, e.g.
+`pandas.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])`, or alternatively use an
 Note that all data for a group will be loaded into memory before the function is applied.
This can
 lead to out of memory exceptons, especially if the group sizes are skewed. The configuration
 [maxRecordsPerBatch](#setting-arrow-batch-size) is not applied on groups and it is up to
the user
diff --git a/python/pyspark/sql/ b/python/pyspark/sql/
index efcce25..fd656c5 100644
--- a/python/pyspark/sql/
+++ b/python/pyspark/sql/
@@ -2500,7 +2500,8 @@ def pandas_udf(f=None, returnType=None, functionType=None):
        A grouped map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`
        The returnType should be a :class:`StructType` describing the schema of the returned
-       The length of the returned `pandas.DataFrame` can be arbitrary.
+       The length of the returned `pandas.DataFrame` can be arbitrary and the columns must
+       indexed so that their position matches the corresponding field in the schema.
        Grouped map UDFs are used with :meth:`pyspark.sql.GroupedData.apply`.
@@ -2548,6 +2549,12 @@ def pandas_udf(f=None, returnType=None, functionType=None):
        |  2|6.0|
+       .. note:: If returning a new `pandas.DataFrame` constructed with a dictionary, it
+           recommended to explicitly index the columns by name to ensure the positions are
+           or alternatively use an `OrderedDict`.
+           For example, `pd.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])` or
+           `pd.DataFrame(OrderedDict([('id', ids), ('a', data)]))`.
        .. seealso:: :meth:`pyspark.sql.GroupedData.apply`

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message