spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cloud-fan <>
Subject [GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Date Sat, 27 Jan 2018 03:11:09 GMT
Github user cloud-fan commented on a diff in the pull request:
    --- Diff: docs/ ---
    @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your `hive-site.xml`,
`core-site.xml` a
     You may run `./bin/spark-sql --help` for a complete list of all available
    +# PySpark Usage Guide for Pandas with Apache Arrow
    +## Apache Arrow in Spark
    +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently
    +data between JVM and Python processes. This currently is most beneficial to Python users
    +work with Pandas/NumPy data. Its usage is not automatic and might require some minor
    +changes to configuration or code to take full advantage and ensure compatibility. This
guide will
    +give a high-level description of how to use Arrow in Spark and highlight any differences
    +working with Arrow-enabled data.
    +### Ensure PyArrow Installed
    +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency
of the
    +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that
    +is installed and available on all cluster nodes. The current supported version is 0.8.0.
    +You can install using pip or conda from the conda-forge channel. See PyArrow
    +[installation]( for details.
    +## Enabling for Conversion to/from Pandas
    +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using
the call
    +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`.
    +To use Arrow when executing these calls, users need to first set the Spark configuration
    +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example dataframe_with_arrow python/sql/ %}
    +Using the above optimizations with Arrow will produce the same results as when Arrow
is not
    +enabled. Note that even with Arrow, `toPandas()` results in the collection of all records
in the
    +DataFrame to the driver program and should be done on a small subset of the data. Not
all Spark
    +data types are currently supported and an error can be raised if a column has an unsupported
    +see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,
    +Spark will fall back to create the DataFrame without Arrow.
    +## Pandas UDFs (a.k.a. Vectorized UDFs)
    +Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer
data and
    +Pandas to work with the data. A Pandas UDF is defined using the keyword `pandas_udf`
as a decorator
    +or to wrap the function, no additional configuration is required. Currently, there are
two types of
    +Pandas UDF: Scalar and Group Map.
    +### Scalar
    +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be used with
functions such
    +as `select` and `withColumn`. The Python function should take `pandas.Series` as inputs
and return
    +a `pandas.Series` of the same length. Internally, Spark will execute a Pandas UDF by
    +columns into batches and calling the function for each batch as a subset of the data,
    +concatenating the results together.
    +The following example shows how to create a scalar Pandas UDF that computes the product
of 2 columns.
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example scalar_pandas_udf python/sql/ %}
    +### Group Map
    +Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine"
    +Split-apply-combine consists of three steps:
    +* Split the data into groups by using `DataFrame.groupBy`.
    +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`.
    +  input data contains all the rows and columns for each group.
    +* Combine the results into a new `DataFrame`.
    +To use `groupBy().apply()`, the user needs to define the following:
    +* A Python function that defines the computation for each group.
    +* A `StructType` object or a string that defines the schema of the output `DataFrame`.
    +The following example shows how to use `groupby().apply()` to subtract the mean from
each value in the group.
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example group_map_pandas_udf python/sql/ %}
    +For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
    +## Usage Notes
    +### Supported SQL-Arrow Types
    +Currently, all Spark SQL data types are supported except `MapType`, `ArrayType` of `TimestampType`,
    +nested `StructType`.
    +### Setting Arrow Batch Size
    +Data partitions in Spark are converted into Arrow record batches, which can temporarily
lead to
    +high memory usage in the JVM. To avoid possible out of memory exceptions, the size of
the Arrow
    +record batches can be adjusted by setting the conf "spark.sql.execution.arrow.maxRecordsPerBatch"
    --- End diff --
    We went with `maxRecordsPerBatch` because it's easy to implement, otherwise we may need
some way to estimate/calculate the memory consumption of arrow data. @BryanCutler is it easy
to do?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message