spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BryanCutler <>
Subject [GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...
Date Fri, 26 Jan 2018 17:44:11 GMT
Github user BryanCutler commented on a diff in the pull request:
    --- Diff: docs/ ---
    @@ -1640,6 +1640,129 @@ Configuration of Hive is done by placing your `hive-site.xml`,
`core-site.xml` a
     You may run `./bin/spark-sql --help` for a complete list of all available
    +# PySpark Usage Guide for Pandas with Arrow
    +## Arrow in Spark
    +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently
    +data between JVM and Python processes. This currently is most beneficial to Python users
    +work with Pandas/NumPy data. Its usage is not automatic and might require some minor
    +changes to configuration or code to take full advantage and ensure compatibility. This
guide will
    +give a high-level description of how to use Arrow in Spark and highlight any differences
    +working with Arrow-enabled data.
    +### Ensure PyArrow Installed
    +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency
of the
    +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that
    +is installed and available on all cluster nodes. The current supported version is 0.8.0.
    +You can install using pip or conda from the conda-forge channel. See PyArrow
    +[installation]( for details.
    +## Enabling for Conversion to/from Pandas
    +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using
the call
    +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`.
    +To use Arrow when executing these calls, users need to first set the Spark configuration
    +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example dataframe_with_arrow python/sql/ %}
    +Using the above optimizations with Arrow will produce the same results as when Arrow
is not
    +enabled. Not all Spark data types are currently supported and an error will be raised
if a column
    --- End diff --
    Good point, maybe it should be mentioned that it will fall back


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message