arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject arrow-site git commit: Add Spark+Arrow blog post
Date Thu, 27 Jul 2017 15:29:01 GMT
Repository: arrow-site
Updated Branches:
  refs/heads/asf-site 893662492 -> b3256101e


Add Spark+Arrow blog post


Project: http://git-wip-us.apache.org/repos/asf/arrow-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow-site/commit/b3256101
Tree: http://git-wip-us.apache.org/repos/asf/arrow-site/tree/b3256101
Diff: http://git-wip-us.apache.org/repos/asf/arrow-site/diff/b3256101

Branch: refs/heads/asf-site
Commit: b3256101ec4d38cec8a96de22b9b55dccc8c0b98
Parents: 8936624
Author: Wes McKinney <wes.mckinney@twosigma.com>
Authored: Thu Jul 27 11:28:56 2017 -0400
Committer: Wes McKinney <wes.mckinney@twosigma.com>
Committed: Thu Jul 27 11:28:56 2017 -0400

----------------------------------------------------------------------
 blog/2017/07/26/spark-arrow/index.html | 259 ++++++++++++++++++++++++++++
 blog/index.html                        | 151 ++++++++++++++++
 feed.xml                               | 121 ++++++++++++-
 3 files changed, 530 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow-site/blob/b3256101/blog/2017/07/26/spark-arrow/index.html
----------------------------------------------------------------------
diff --git a/blog/2017/07/26/spark-arrow/index.html b/blog/2017/07/26/spark-arrow/index.html
new file mode 100644
index 0000000..2b483f9
--- /dev/null
+++ b/blog/2017/07/26/spark-arrow/index.html
@@ -0,0 +1,259 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Speeding up PySpark with Apache Arrow</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.4.3">
+    <!-- The above 3 meta tags *must* come first in the head; any other head content must
come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <title>Apache Arrow Homepage</title>
+    <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.2.1.min.js"
+            integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4="
+            crossorigin="anonymous"></script>
+    <script src="/assets/javascripts/bootstrap.min.js"></script>
+  </head>
+
+
+
+<body class="wrap">
+  <div class="container">
+    <nav class="navbar navbar-default">
+  <div class="container-fluid">
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+      <a class="navbar-brand" href="/">Apache Arrow&#8482;&nbsp;&nbsp;&nbsp;</a>
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Project Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/install/">Install</a></li>
+            <li><a href="/blog/">Blog</a></li>
+            <li><a href="/release/">Releases</a></li>
+            <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li>
+            <li><a href="https://github.com/apache/arrow">Source Code</a></li>
+            <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing
List</a></li>
+            <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li>
+            <li><a href="/committers/">Committers</a></li>
+          </ul>
+        </li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Specification<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/memory_layout.html">Memory Layout</a></li>
+            <li><a href="/docs/metadata.html">Metadata</a></li>
+            <li><a href="/docs/ipc.html">Messaging / IPC</a></li>
+          </ul>
+        </li>
+
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Documentation<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/python">Python</a></li>
+            <li><a href="/docs/cpp">C++ API</a></li>
+            <li><a href="/docs/java">Java API</a></li>
+            <li><a href="/docs/c_glib">C GLib API</a></li>
+          </ul>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">ASF Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="http://www.apache.org/">ASF Website</a></li>
+            <li><a href="http://www.apache.org/licenses/">License</a></li>
+            <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li>
+            <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+            <li><a href="http://www.apache.org/security/">Security</a></li>
+          </ul>
+        </li>
+      </ul>
+      <a href="http://www.apache.org/">
+        <img style="float:right;" src="/img/asf_logo.svg" width="120px"/>
+      </a>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+
+    <h2>
+      Speeding up PySpark with Apache Arrow
+      <a href="/blog/2017/07/26/spark-arrow/" class="permalink" title="Permalink">∞</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            26 Jul 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://people.apache.org/~BryanCutler"><i class="fa fa-user"></i>
 (BryanCutler)</a>
+        </div>
+      </div>
+    </div>
+
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/BryanCutler">Bryan Cutler</a>
is a software engineer at IBM’s Spark Technology Center <a href="http://www.spark.tc/">STC</a></em></p>
+
+<p>Beginning with <a href="https://spark.apache.org/">Apache Spark</a>
version 2.3, <a href="https://arrow.apache.org/">Apache Arrow</a> will be a supported
+dependency and begin to offer increased performance with columnar data transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with <code class="highlighter-rouge">toPandas()</code>, which I will discuss
below, however there are many additional
+improvements that are currently <a href="https://issues.apache.org/jira/issues/?filter=12335725&amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p>
+
+<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to Pandas</h1>
+
+<p>The previous way of converting a Spark DataFrame to Pandas with <code class="highlighter-rouge">DataFrame.toPandas()</code>
+in PySpark was painfully inefficient. Basically, it worked by first collecting all
+rows to the Spark driver. Next, each row would get serialized into Python’s pickle
+format and sent to a Python worker process. This child process unpickles each row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list using
+<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p>
+
+<p>This all might seem like standard procedure, but suffers from 2 glaring issues:
1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a <code class="highlighter-rouge">pandas.DataFrame</code> using <code class="highlighter-rouge">from_records</code>
must slowly iterate over the list of pure
+Python data and convert each value to Pandas format. See <a href="https://gist.github.com/wesm/0cb5531b1c2e346a0007">here</a>
for a detailed
+analysis.</p>
+
+<p>Here is where Arrow really shines to help optimize these steps: 1) Once the data
is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can
+be sent directly to the Python process, 2) When the Arrow data is received in Python,
+then pyarrow can utilize zero-copy methods to create a <code class="highlighter-rouge">pandas.DataFrame</code>
from entire
+chunks of data at once instead of processing individual scalar values. Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the Spark
+executors to perform in parallel, drastically reducing the load on the driver.</p>
+
+<p>As of the merging of <a href="https://issues.apache.org/jira/browse/SPARK-13534">SPARK-13534</a>,
the use of Arrow when calling <code class="highlighter-rouge">toPandas()</code>
+needs to be enabled by setting the SQLConf “spark.sql.execution.arrow.enable” to
+“true”.  Let’s look at a simple usage example.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &lt;&lt; 22).toDF("id").withColumn("x", rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set("spark.sql.execution.arrow.enable", "true")
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+</code></pre>
+</div>
+
+<p>This example was run locally on my laptop using Spark defaults so the times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.</p>
+
+<h1 id="notes-on-usage">Notes on Usage</h1>
+
+<p>Here are some things to keep in mind before making use of this new feature. At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation <a href="https://github.com/apache/arrow/blob/master/site/install.md">instructions</a>.
+It is planned to add pyarrow as a pyspark dependency so that 
+<code class="highlighter-rouge">&gt; pip install pyspark</code> will also
install pyarrow.</p>
+
+<p>Currently, the controlling SQLConf is disabled by default. This can be enabled
+programmatically as in the example above or by adding the line
+“spark.sql.execution.arrow.enable=true” to <code class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p>
+
+<p>Also, not all Spark data types are currently supported and limited to primitive
+types. Expanded type support is in the works and expected to also be in the Spark
+2.3 release.</p>
+
+<h1 id="future-improvements">Future Improvements</h1>
+
+<p>As mentioned, this was just a first step in using Arrow to make life easier for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (<a href="https://issues.apache.org/jira/browse/SPARK-21190">SPARK-21190</a>,
<a href="https://issues.apache.org/jira/browse/SPARK-21404">SPARK-21404</a>),
and the ability
+to apply a function on grouped data using a Pandas DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20396">SPARK-20396</a>).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20791">SPARK-20791</a>).
Stay tuned for more!</p>
+
+<h1 id="collaborators">Collaborators</h1>
+
+<p>Reaching this first milestone was a group effort from both the Apache Arrow and
+Spark communities. Thanks to the hard work of <a href="https://github.com/wesm">Wes
McKinney</a>, <a href="https://github.com/icexelloss">Li Jin</a>,
+<a href="https://github.com/holdenk">Holden Karau</a>, Reynold Xin, Wenchen Fan,
Shane Knapp and many others that
+helped push this effort forwards.</p>
+
+
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project
logo are either registered trademarks or trademarks of The Apache Software Foundation in the
United States and other countries.</p>
+  <p>&copy; 2017 Apache Software Foundation</p>
+</footer>
+
+  </div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/b3256101/blog/index.html
----------------------------------------------------------------------
diff --git a/blog/index.html b/blog/index.html
index c8a810a..1350659 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -111,6 +111,157 @@
     
   <div class="container">
     <h2>
+      Speeding up PySpark with Apache Arrow
+      <a href="/blog/2017/07/26/spark-arrow/" class="permalink" title="Permalink">∞</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            26 Jul 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href=""><i class="fa fa-user"></i>  (BryanCutler)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/BryanCutler">Bryan Cutler</a>
is a software engineer at IBM’s Spark Technology Center <a href="http://www.spark.tc/">STC</a></em></p>
+
+<p>Beginning with <a href="https://spark.apache.org/">Apache Spark</a>
version 2.3, <a href="https://arrow.apache.org/">Apache Arrow</a> will be a supported
+dependency and begin to offer increased performance with columnar data transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with <code class="highlighter-rouge">toPandas()</code>, which I will discuss
below, however there are many additional
+improvements that are currently <a href="https://issues.apache.org/jira/issues/?filter=12335725&amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p>
+
+<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to Pandas</h1>
+
+<p>The previous way of converting a Spark DataFrame to Pandas with <code class="highlighter-rouge">DataFrame.toPandas()</code>
+in PySpark was painfully inefficient. Basically, it worked by first collecting all
+rows to the Spark driver. Next, each row would get serialized into Python’s pickle
+format and sent to a Python worker process. This child process unpickles each row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list using
+<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p>
+
+<p>This all might seem like standard procedure, but suffers from 2 glaring issues:
1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a <code class="highlighter-rouge">pandas.DataFrame</code> using <code class="highlighter-rouge">from_records</code>
must slowly iterate over the list of pure
+Python data and convert each value to Pandas format. See <a href="https://gist.github.com/wesm/0cb5531b1c2e346a0007">here</a>
for a detailed
+analysis.</p>
+
+<p>Here is where Arrow really shines to help optimize these steps: 1) Once the data
is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can
+be sent directly to the Python process, 2) When the Arrow data is received in Python,
+then pyarrow can utilize zero-copy methods to create a <code class="highlighter-rouge">pandas.DataFrame</code>
from entire
+chunks of data at once instead of processing individual scalar values. Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the Spark
+executors to perform in parallel, drastically reducing the load on the driver.</p>
+
+<p>As of the merging of <a href="https://issues.apache.org/jira/browse/SPARK-13534">SPARK-13534</a>,
the use of Arrow when calling <code class="highlighter-rouge">toPandas()</code>
+needs to be enabled by setting the SQLConf “spark.sql.execution.arrow.enable” to
+“true”.  Let’s look at a simple usage example.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &lt;&lt; 22).toDF("id").withColumn("x", rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set("spark.sql.execution.arrow.enable", "true")
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+</code></pre>
+</div>
+
+<p>This example was run locally on my laptop using Spark defaults so the times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.</p>
+
+<h1 id="notes-on-usage">Notes on Usage</h1>
+
+<p>Here are some things to keep in mind before making use of this new feature. At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation <a href="https://github.com/apache/arrow/blob/master/site/install.md">instructions</a>.
+It is planned to add pyarrow as a pyspark dependency so that 
+<code class="highlighter-rouge">&gt; pip install pyspark</code> will also
install pyarrow.</p>
+
+<p>Currently, the controlling SQLConf is disabled by default. This can be enabled
+programmatically as in the example above or by adding the line
+“spark.sql.execution.arrow.enable=true” to <code class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p>
+
+<p>Also, not all Spark data types are currently supported and limited to primitive
+types. Expanded type support is in the works and expected to also be in the Spark
+2.3 release.</p>
+
+<h1 id="future-improvements">Future Improvements</h1>
+
+<p>As mentioned, this was just a first step in using Arrow to make life easier for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (<a href="https://issues.apache.org/jira/browse/SPARK-21190">SPARK-21190</a>,
<a href="https://issues.apache.org/jira/browse/SPARK-21404">SPARK-21404</a>),
and the ability
+to apply a function on grouped data using a Pandas DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20396">SPARK-20396</a>).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20791">SPARK-20791</a>).
Stay tuned for more!</p>
+
+<h1 id="collaborators">Collaborators</h1>
+
+<p>Reaching this first milestone was a group effort from both the Apache Arrow and
+Spark communities. Thanks to the hard work of <a href="https://github.com/wesm">Wes
McKinney</a>, <a href="https://github.com/icexelloss">Li Jin</a>,
+<a href="https://github.com/holdenk">Holden Karau</a>, Reynold Xin, Wenchen Fan,
Shane Knapp and many others that
+helped push this effort forwards.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
       Apache Arrow 0.5.0 Release
       <a href="/blog/2017/07/25/0.5.0-release/" class="permalink" title="Permalink">∞</a>
     </h2>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/b3256101/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index c999bac..f01301e 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,123 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"
><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link
href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate"
type="text/html" /><updated>2017-07-25T13:38:25-04:00</updated><id>/</id><entry><title
type="html">Apache Arrow 0.5.0 Release</title><link href="/blog/2017/07/25/0.5.0-release/"
rel="alternate" type="text/html" title="Apache Arrow 0.5.0 Release" /><published>2017-07-25T00:00:00-04:00</published><updated>2017-07-25T00:00:00-04:00</updated><id>/blog/2017/07/25/0.5.0-release</id><content
type="html" xml:base="/blog/2017/07/25/0.5.0-release/">&lt;!--
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"
><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link
href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate"
type="text/html" /><updated>2017-07-27T11:28:36-04:00</updated><id>/</id><entry><title
type="html">Speeding up PySpark with Apache Arrow</title><link href="/blog/2017/07/26/spark-arrow/"
rel="alternate" type="text/html" title="Speeding up PySpark with Apache Arrow" /><published>2017-07-26T12:00:00-04:00</published><updated>2017-07-26T12:00:00-04:00</updated><id>/blog/2017/07/26/spark-arrow</id><content
type="html" xml:base="/blog/2017/07/26/spark-arrow/">&lt;!--
+
+--&gt;
+
+&lt;p&gt;&lt;em&gt;&lt;a href=&quot;https://github.com/BryanCutler&quot;&gt;Bryan
Cutler&lt;/a&gt; is a software engineer at IBM’s Spark Technology Center &lt;a
href=&quot;http://www.spark.tc/&quot;&gt;STC&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;Beginning with &lt;a href=&quot;https://spark.apache.org/&quot;&gt;Apache
Spark&lt;/a&gt; version 2.3, &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;Apache
Arrow&lt;/a&gt; will be a supported
+dependency and begin to offer increased performance with columnar data transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with &lt;code class=&quot;highlighter-rouge&quot;&gt;toPandas()&lt;/code&gt;,
which I will discuss below, however there are many additional
+improvements that are currently &lt;a href=&quot;https://issues.apache.org/jira/issues/?filter=12335725&amp;amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC&quot;&gt;underway&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h1 id=&quot;optimizing-spark-conversion-to-pandas&quot;&gt;Optimizing
Spark Conversion to Pandas&lt;/h1&gt;
+
+&lt;p&gt;The previous way of converting a Spark DataFrame to Pandas with &lt;code
class=&quot;highlighter-rouge&quot;&gt;DataFrame.toPandas()&lt;/code&gt;
+in PySpark was painfully inefficient. Basically, it worked by first collecting all
+rows to the Spark driver. Next, each row would get serialized into Python’s pickle
+format and sent to a Python worker process. This child process unpickles each row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list using
+&lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.DataFrame.from_records()&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;This all might seem like standard procedure, but suffers from 2 glaring
issues: 1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a &lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.DataFrame&lt;/code&gt;
using &lt;code class=&quot;highlighter-rouge&quot;&gt;from_records&lt;/code&gt;
must slowly iterate over the list of pure
+Python data and convert each value to Pandas format. See &lt;a href=&quot;https://gist.github.com/wesm/0cb5531b1c2e346a0007&quot;&gt;here&lt;/a&gt;
for a detailed
+analysis.&lt;/p&gt;
+
+&lt;p&gt;Here is where Arrow really shines to help optimize these steps: 1) Once
the data is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can
+be sent directly to the Python process, 2) When the Arrow data is received in Python,
+then pyarrow can utilize zero-copy methods to create a &lt;code class=&quot;highlighter-rouge&quot;&gt;pandas.DataFrame&lt;/code&gt;
from entire
+chunks of data at once instead of processing individual scalar values. Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the Spark
+executors to perform in parallel, drastically reducing the load on the driver.&lt;/p&gt;
+
+&lt;p&gt;As of the merging of &lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-13534&quot;&gt;SPARK-13534&lt;/a&gt;,
the use of Arrow when calling &lt;code class=&quot;highlighter-rouge&quot;&gt;toPandas()&lt;/code&gt;
+needs to be enabled by setting the SQLConf “spark.sql.execution.arrow.enable” to
+“true”.  Let’s look at a simple usage example.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Welcome
to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &amp;lt;&amp;lt; 22).toDF(&quot;id&quot;).withColumn(&quot;x&quot;,
rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set(&quot;spark.sql.execution.arrow.enable&quot;, &quot;true&quot;)
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;This example was run locally on my laptop using Spark defaults so the times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.&lt;/p&gt;
+
+&lt;h1 id=&quot;notes-on-usage&quot;&gt;Notes on Usage&lt;/h1&gt;
+
+&lt;p&gt;Here are some things to keep in mind before making use of this new feature.
At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation &lt;a href=&quot;https://github.com/apache/arrow/blob/master/site/install.md&quot;&gt;instructions&lt;/a&gt;.
+It is planned to add pyarrow as a pyspark dependency so that 
+&lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;gt; pip install
pyspark&lt;/code&gt; will also install pyarrow.&lt;/p&gt;
+
+&lt;p&gt;Currently, the controlling SQLConf is disabled by default. This can be enabled
+programmatically as in the example above or by adding the line
+“spark.sql.execution.arrow.enable=true” to &lt;code class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME/conf/spark-defaults.conf&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;Also, not all Spark data types are currently supported and limited to primitive
+types. Expanded type support is in the works and expected to also be in the Spark
+2.3 release.&lt;/p&gt;
+
+&lt;h1 id=&quot;future-improvements&quot;&gt;Future Improvements&lt;/h1&gt;
+
+&lt;p&gt;As mentioned, this was just a first step in using Arrow to make life easier
for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (&lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-21190&quot;&gt;SPARK-21190&lt;/a&gt;,
&lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-21404&quot;&gt;SPARK-21404&lt;/a&gt;),
and the ability
+to apply a function on grouped data using a Pandas DataFrame (&lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-20396&quot;&gt;SPARK-20396&lt;/a&gt;).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (&lt;a href=&quot;https://issues.apache.org/jira/browse/SPARK-20791&quot;&gt;SPARK-20791&lt;/a&gt;).
Stay tuned for more!&lt;/p&gt;
+
+&lt;h1 id=&quot;collaborators&quot;&gt;Collaborators&lt;/h1&gt;
+
+&lt;p&gt;Reaching this first milestone was a group effort from both the Apache Arrow
and
+Spark communities. Thanks to the hard work of &lt;a href=&quot;https://github.com/wesm&quot;&gt;Wes
McKinney&lt;/a&gt;, &lt;a href=&quot;https://github.com/icexelloss&quot;&gt;Li
Jin&lt;/a&gt;,
+&lt;a href=&quot;https://github.com/holdenk&quot;&gt;Holden Karau&lt;/a&gt;,
Reynold Xin, Wenchen Fan, Shane Knapp and many others that
+helped push this effort forwards.&lt;/p&gt;</content><author><name>BryanCutler</name></author></entry><entry><title
type="html">Apache Arrow 0.5.0 Release</title><link href="/blog/2017/07/25/0.5.0-release/"
rel="alternate" type="text/html" title="Apache Arrow 0.5.0 Release" /><published>2017-07-25T00:00:00-04:00</published><updated>2017-07-25T00:00:00-04:00</updated><id>/blog/2017/07/25/0.5.0-release</id><content
type="html" xml:base="/blog/2017/07/25/0.5.0-release/">&lt;!--
 
 --&gt;
 


Mime
View raw message