arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject arrow-site git commit: Add turbodbc guest blog post
Date Fri, 16 Jun 2017 08:42:21 GMT
Repository: arrow-site
Updated Branches:
  refs/heads/asf-site 2316712e9 -> 22ab633ab


Add turbodbc guest blog post


Project: http://git-wip-us.apache.org/repos/asf/arrow-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow-site/commit/22ab633a
Tree: http://git-wip-us.apache.org/repos/asf/arrow-site/tree/22ab633a
Diff: http://git-wip-us.apache.org/repos/asf/arrow-site/diff/22ab633a

Branch: refs/heads/asf-site
Commit: 22ab633ab6352c8b38026cfe0e73fb8289d56a5e
Parents: 2316712
Author: Wes McKinney <wes.mckinney@twosigma.com>
Authored: Fri Jun 16 04:42:10 2017 -0400
Committer: Wes McKinney <wes.mckinney@twosigma.com>
Committed: Fri Jun 16 04:42:10 2017 -0400

----------------------------------------------------------------------
 blog/2017/06/16/turbodbc-arrow/index.html | 219 +++++++++++++++++++++++++
 blog/index.html                           | 111 +++++++++++++
 docs/ipc.html                             |  29 +---
 docs/memory_layout.html                   |  18 +-
 feed.xml                                  |  81 ++++++++-
 img/turbodbc_arrow.png                    | Bin 0 -> 75697 bytes
 6 files changed, 425 insertions(+), 33 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/blog/2017/06/16/turbodbc-arrow/index.html
----------------------------------------------------------------------
diff --git a/blog/2017/06/16/turbodbc-arrow/index.html b/blog/2017/06/16/turbodbc-arrow/index.html
new file mode 100644
index 0000000..84e3d43
--- /dev/null
+++ b/blog/2017/06/16/turbodbc-arrow/index.html
@@ -0,0 +1,219 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Connecting Relational Databases to the Apache Arrow World with turbodbc</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.4.3">
+    <!-- The above 3 meta tags *must* come first in the head; any other head content must
come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <title>Apache Arrow Homepage</title>
+    <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.2.1.min.js"
+            integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4="
+            crossorigin="anonymous"></script>
+    <script src="/assets/javascripts/bootstrap.min.js"></script>
+  </head>
+
+
+
+<body class="wrap">
+  <div class="container">
+    <nav class="navbar navbar-default">
+  <div class="container-fluid">
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+      <a class="navbar-brand" href="/">Apache Arrow&#8482;&nbsp;&nbsp;&nbsp;</a>
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Project Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/install/">Install</a></li>
+            <li><a href="/blog/">Blog</a></li>
+            <li><a href="/release/">Releases</a></li>
+            <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li>
+            <li><a href="https://github.com/apache/arrow">Source Code</a></li>
+            <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing
List</a></li>
+            <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li>
+            <li><a href="/committers/">Committers</a></li>
+          </ul>
+        </li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Specification<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/memory_layout.html">Memory Layout</a></li>
+            <li><a href="/docs/metadata.html">Metadata</a></li>
+            <li><a href="/docs/ipc.html">Messaging / IPC</a></li>
+          </ul>
+        </li>
+
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Documentation<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/python">Python</a></li>
+            <li><a href="/docs/cpp">C++ API</a></li>
+            <li><a href="/docs/java">Java API</a></li>
+            <li><a href="/docs/c_glib">C GLib API</a></li>
+          </ul>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">ASF Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="http://www.apache.org/">ASF Website</a></li>
+            <li><a href="http://www.apache.org/licenses/">License</a></li>
+            <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li>
+            <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+            <li><a href="http://www.apache.org/security/">Security</a></li>
+          </ul>
+        </li>
+      </ul>
+      <a href="http://www.apache.org/">
+        <img style="float:right;" src="/img/asf_logo.svg" width="120px"/>
+      </a>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+
+    <h2>
+      Connecting Relational Databases to the Apache Arrow World with turbodbc
+      <a href="/blog/2017/06/16/turbodbc-arrow/" class="permalink" title="Permalink">∞</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            16 Jun 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://github.com/MathMagique"><i class="fa fa-user"></i>
Michael König (MathMagique)</a>
+        </div>
+      </div>
+    </div>
+
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/mathmagique">Michael König</a>
is the lead developer of the <a href="https://github.com/blue-yonder/turbodbc">turbodbc
project</a></em></p>
+
+<p>The <a href="https://arrow.apache.org/">Apache Arrow</a> project set
out to become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+<a href="https://github.com/blue-yonder/turbodbc">turbodbc</a> brings Apache
Arrow support to these databases using a much
+older, more specialized data exchange layer: <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity">ODBC</a>.</p>
+
+<p>ODBC is a database interface that offers developers the option to transfer data
+either in row-wise or column-wise fashion. Previous Python ODBC modules typically
+use the row-wise approach, and often trade repeated database roundtrips for simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical databases.</p>
+
+<p>In contrast, turbodbc was designed to leverage columnar data processing from day
+one. Naturally, this implies using the columnar portion of the ODBC API. Equally
+important, however, is to find new ways of providing columnar data to Python users
+that exceed the capabilities of the row-wise API mandated by Python’s <a href="https://www.python.org/dev/peps/pep-0249/">PEP
249</a>.
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:</p>
+
+<div class="language-python highlighter-rouge"><pre class="highlight"><code><span
class="o">&gt;&gt;&gt;</span> <span class="kn">from</span>
<span class="nn">turbodbc</span> <span class="kn">import</span> <span
class="n">connect</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">connection</span>
<span class="o">=</span> <span class="n">connect</span><span class="p">(</span><span
class="n">dsn</span><span class="o">=</span><span class="s">"My
columnar database"</span><span class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span>
<span class="o">=</span> <span class="n">connection</span><span
class="o">.</span><span class="n">cursor</span><span class="p">()</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span
class="o">.</span><span class="n">execute</span><span class="p">(</span><span
class="s">"SELECT some_integers, some_strings FROM my_table"</span><span class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span
class="o">.</span><span class="n">fetchallarrow</span><span class="p">()</span>
+<span class="n">pyarrow</span><span class="o">.</span><span class="n">Table</span>
+<span class="n">some_integers</span><span class="p">:</span> <span
class="n">int64</span>
+<span class="n">some_strings</span><span class="p">:</span> <span
class="n">string</span>
+</code></pre>
+</div>
+
+<p>With this new addition, the data flow for a result set of a typical SELECT query
+is like this:</p>
+<ul>
+  <li>The database prepares the result set and exposes it to the ODBC driver using
+either row-wise or column-wise storage.</li>
+  <li>Turbodbc has the ODBC driver write chunks of the result set into columnar buffers.</li>
+  <li>These buffers are exposed to turbodbc’s Apache Arrow frontend. This frontend
+will create an Arrow table and fill in the buffered values.</li>
+  <li>The previous steps are repeated until the entire result set is retrieved.</li>
+</ul>
+
+<p><img src="/img/turbodbc_arrow.png" alt="Data flow from relational databases to
Python with turbodbc and the Apache Arrow frontend" class="img-responsive" width="75%" /></p>
+
+<p>In practice, it is possible to achieve the following ideal situation: A 64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values
+by copying the entire 64-bit buffer into a free portion of an Arrow table’s 64-bit
+integer column.</p>
+
+<p>Moving data from the database to an Arrow table and, thus, providing it to the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to hundred
+thousands of rows at a time. The absence of serialization and conversion logic renders
+the process extremely efficient.</p>
+
+<p>Once the data is stored in an Arrow table, Python users can continue to do some
+actual work. They can convert it into a <a href="https://arrow.apache.org/docs/python/pandas.html">Pandas
DataFrame</a> for data analysis
+(using a quick <code class="highlighter-rouge">table.to_pandas()</code>), pass
it on to other data processing
+systems such as <a href="http://spark.apache.org/">Apache Spark</a> or <a
href="http://impala.apache.org/">Apache Impala (incubating)</a>, or store
+it in the <a href="http://parquet.apache.org/">Apache Parquet</a> file format.
This way, non-Python systems are
+efficiently connected with relational databases.</p>
+
+<p>In the future, turbodbc’s Arrow support will be extended to use more
+sophisticated features such as <a href="https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding">dictionary-encoded</a>
string fields. We also
+plan to pick smaller than 64-bit <a href="https://arrow.apache.org/docs/metadata.html#integers">data
types</a> where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+databases.</p>
+
+<p>If you would like to learn more about turbodbc, check out the <a href="https://github.com/blue-yonder/turbodbc">GitHub
project</a> and the
+<a href="http://turbodbc.readthedocs.io/">project documentation</a>. If you want
to learn more about how turbodbc implements the
+nitty-gritty details, check out parts <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">one</a>
and <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/">two</a>
of the
+<a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">“Making
of turbodbc”</a> series at <a href="https://tech.blue-yonder.com/">Blue Yonder’s
technology blog</a>.</p>
+
+
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project
logo are either registered trademarks or trademarks of The Apache Software Foundation in the
United States and other countries.</p>
+  <p>&copy; 2017 Apache Software Foundation</p>
+</footer>
+
+  </div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/blog/index.html
----------------------------------------------------------------------
diff --git a/blog/index.html b/blog/index.html
index 2102219..9b2c972 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -111,6 +111,117 @@
     
   <div class="container">
     <h2>
+      Connecting Relational Databases to the Apache Arrow World with turbodbc
+      <a href="/blog/2017/06/16/turbodbc-arrow/" class="permalink" title="Permalink">∞</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            16 Jun 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://github.com/MathMagique"><i class="fa fa-user"></i>
Michael König (MathMagique)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/mathmagique">Michael König</a>
is the lead developer of the <a href="https://github.com/blue-yonder/turbodbc">turbodbc
project</a></em></p>
+
+<p>The <a href="https://arrow.apache.org/">Apache Arrow</a> project set
out to become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+<a href="https://github.com/blue-yonder/turbodbc">turbodbc</a> brings Apache
Arrow support to these databases using a much
+older, more specialized data exchange layer: <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity">ODBC</a>.</p>
+
+<p>ODBC is a database interface that offers developers the option to transfer data
+either in row-wise or column-wise fashion. Previous Python ODBC modules typically
+use the row-wise approach, and often trade repeated database roundtrips for simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical databases.</p>
+
+<p>In contrast, turbodbc was designed to leverage columnar data processing from day
+one. Naturally, this implies using the columnar portion of the ODBC API. Equally
+important, however, is to find new ways of providing columnar data to Python users
+that exceed the capabilities of the row-wise API mandated by Python’s <a href="https://www.python.org/dev/peps/pep-0249/">PEP
249</a>.
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:</p>
+
+<div class="language-python highlighter-rouge"><pre class="highlight"><code><span
class="o">&gt;&gt;&gt;</span> <span class="kn">from</span>
<span class="nn">turbodbc</span> <span class="kn">import</span> <span
class="n">connect</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">connection</span>
<span class="o">=</span> <span class="n">connect</span><span class="p">(</span><span
class="n">dsn</span><span class="o">=</span><span class="s">"My
columnar database"</span><span class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span>
<span class="o">=</span> <span class="n">connection</span><span
class="o">.</span><span class="n">cursor</span><span class="p">()</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span
class="o">.</span><span class="n">execute</span><span class="p">(</span><span
class="s">"SELECT some_integers, some_strings FROM my_table"</span><span class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span
class="o">.</span><span class="n">fetchallarrow</span><span class="p">()</span>
+<span class="n">pyarrow</span><span class="o">.</span><span class="n">Table</span>
+<span class="n">some_integers</span><span class="p">:</span> <span
class="n">int64</span>
+<span class="n">some_strings</span><span class="p">:</span> <span
class="n">string</span>
+</code></pre>
+</div>
+
+<p>With this new addition, the data flow for a result set of a typical SELECT query
+is like this:</p>
+<ul>
+  <li>The database prepares the result set and exposes it to the ODBC driver using
+either row-wise or column-wise storage.</li>
+  <li>Turbodbc has the ODBC driver write chunks of the result set into columnar buffers.</li>
+  <li>These buffers are exposed to turbodbc’s Apache Arrow frontend. This frontend
+will create an Arrow table and fill in the buffered values.</li>
+  <li>The previous steps are repeated until the entire result set is retrieved.</li>
+</ul>
+
+<p><img src="/img/turbodbc_arrow.png" alt="Data flow from relational databases to
Python with turbodbc and the Apache Arrow frontend" class="img-responsive" width="75%" /></p>
+
+<p>In practice, it is possible to achieve the following ideal situation: A 64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values
+by copying the entire 64-bit buffer into a free portion of an Arrow table’s 64-bit
+integer column.</p>
+
+<p>Moving data from the database to an Arrow table and, thus, providing it to the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to hundred
+thousands of rows at a time. The absence of serialization and conversion logic renders
+the process extremely efficient.</p>
+
+<p>Once the data is stored in an Arrow table, Python users can continue to do some
+actual work. They can convert it into a <a href="https://arrow.apache.org/docs/python/pandas.html">Pandas
DataFrame</a> for data analysis
+(using a quick <code class="highlighter-rouge">table.to_pandas()</code>), pass
it on to other data processing
+systems such as <a href="http://spark.apache.org/">Apache Spark</a> or <a
href="http://impala.apache.org/">Apache Impala (incubating)</a>, or store
+it in the <a href="http://parquet.apache.org/">Apache Parquet</a> file format.
This way, non-Python systems are
+efficiently connected with relational databases.</p>
+
+<p>In the future, turbodbc’s Arrow support will be extended to use more
+sophisticated features such as <a href="https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding">dictionary-encoded</a>
string fields. We also
+plan to pick smaller than 64-bit <a href="https://arrow.apache.org/docs/metadata.html#integers">data
types</a> where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+databases.</p>
+
+<p>If you would like to learn more about turbodbc, check out the <a href="https://github.com/blue-yonder/turbodbc">GitHub
project</a> and the
+<a href="http://turbodbc.readthedocs.io/">project documentation</a>. If you want
to learn more about how turbodbc implements the
+nitty-gritty details, check out parts <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">one</a>
and <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/">two</a>
of the
+<a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">“Making
of turbodbc”</a> series at <a href="https://tech.blue-yonder.com/">Blue Yonder’s
technology blog</a>.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
       Apache Arrow 0.4.1 Release
       <a href="/blog/2017/06/14/0.4.1-release/" class="permalink" title="Permalink">∞</a>
     </h2>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/docs/ipc.html
----------------------------------------------------------------------
diff --git a/docs/ipc.html b/docs/ipc.html
index ffbe491..9a0a246 100644
--- a/docs/ipc.html
+++ b/docs/ipc.html
@@ -194,11 +194,9 @@ as an <code class="highlighter-rouge">int32</code> or simply
closing the stream
 <p>We define a “file format” supporting random access in a very similar format
to
 the streaming format. The file starts and ends with a magic string <code class="highlighter-rouge">ARROW1</code>
 (plus padding). What follows in the file is identical to the stream format. At
-the end of the file, we write a <em>footer</em> containing a redundant copy of
the
-schema (which is a part of the streaming format) plus memory offsets and sizes
-for each of the data blocks in the file. This enables random access any record
-batch in the file. See <a href="https://github.com/apache/arrow/blob/master/format/File.fbs">format/File.fbs</a>
for the precise details of the file
-footer.</p>
+the end of the file, we write a <em>footer</em> including offsets and sizes for
each
+of the data blocks in the file, so that random access is possible. See
+<a href="https://github.com/apache/arrow/blob/master/format/File.fbs">format/File.fbs</a>
for the precise details of the file footer.</p>
 
 <p>Schematically we have:</p>
 
@@ -270,24 +268,9 @@ flatbuffer, and any padding bytes</li>
 
 <h3 id="dictionary-batches">Dictionary Batches</h3>
 
-<p>Dictionaries are written in the stream and file formats as a sequence of record
-batches, each having a single field. The complete semantic schema for a
-sequence of record batches, therefore, consists of the schema along with all of
-the dictionaries. The dictionary types are found in the schema, so it is
-necessary to read the schema to first determine the dictionary types so that
-the dictionaries can be properly interpreted.</p>
-
-<div class="highlighter-rouge"><pre class="highlight"><code>table DictionaryBatch
{
-  id: long;
-  data: RecordBatch;
-}
-</code></pre>
-</div>
-
-<p>The dictionary <code class="highlighter-rouge">id</code> in the message
metadata can be referenced one or more times
-in the schema, so that dictionaries can even be used for multiple fields. See
-the <a href="https://github.com/apache/arrow/blob/master/format/Layout.md">Physical
Layout</a> document for more about the semantics of
-dictionary-encoded data.</p>
+<p>Dictionary batches have not yet been implemented, while they are provided for
+in the metadata. For the time being, the <code class="highlighter-rouge">DICTIONARY</code>
segments shown above in
+the file do not appear in any of the file implementations.</p>
 
 <h3 id="tensor-multi-dimensional-array-message-format">Tensor (Multi-dimensional Array)
Message Format</h3>
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/docs/memory_layout.html
----------------------------------------------------------------------
diff --git a/docs/memory_layout.html b/docs/memory_layout.html
index 32c5b92..0f7f819 100644
--- a/docs/memory_layout.html
+++ b/docs/memory_layout.html
@@ -161,7 +161,7 @@ array of some array with a nested type.</li>
 <ul>
   <li>A physical memory layout enabling zero-deserialization data interchange
 amongst a variety of systems handling flat and nested columnar data, including
-such systems as Spark, Drill, Impala, Kudu, Ibis, ODBC protocols, and
+such systems as Spark, Drill, Impala, Kudu, Ibis, Spark, ODBC protocols, and
 proprietary systems that utilize the open source components.</li>
   <li>All array slots are accessible in constant time, with complexity growing
 linearly in the nesting level</li>
@@ -231,7 +231,7 @@ data-structures over 64 bytes (which will be a common case for Arrow Arrays).</l
 
 <p>Requiring padding to a multiple of 64 bytes allows for using <a href="https://software.intel.com/en-us/node/600110">SIMD</a>
instructions
 consistently in loops without additional conditional checks.
-This should allow for simpler and more efficient code.
+This should allow for simpler and more efficient code.<br />
 The specific padding length was chosen because it matches the largest known
 SIMD instruction registers available as of April 2016 (Intel AVX-512).
 Guaranteed padding can also allow certain compilers
@@ -265,7 +265,7 @@ signed integer, as it may be as large as the array length.</p>
 <p>Any relative type can have null value slots, whether primitive or nested type.</p>
 
 <p>An array with nulls must have a contiguous memory buffer, known as the null (or
-validity) bitmap, whose length is a multiple of 64 bytes (as discussed above)
+validity) bitmap, whose length is a multiple of 64 bytes (as discussed above)<br />
 and large enough to have at least 1 bit for each array
 slot.</p>
 
@@ -322,7 +322,7 @@ does not need to be adjacent in memory to the values buffer.</p>
 
   |Byte 0 (validity bitmap) | Bytes 1-63            |
   |-------------------------|-----------------------|
-  | 00011011                | 0 (padding)           |
+  |00011011                 | 0 (padding)           |
 
 * Value Buffer:
 
@@ -497,16 +497,16 @@ primitive value array having Int32 logical type.</char></p>
 <div class="highlighter-rouge"><pre class="highlight"><code>* Length: 4,
Null count: 1
 * Null bitmap buffer:
 
-  |Byte 0 (validity bitmap) | Bytes 1-63            |
-  |-------------------------|-----------------------|
-  | 00001011                | 0 (padding)           |
+  | Byte 0 (validity bitmap) | Bytes 1-7   | Bytes 8-63  |
+  |--------------------------|-------------|-------------|
+  | 00001011                 | 0 (padding) | unspecified |
 
 * Children arrays:
   * field-0 array (`List&lt;char&gt;`):
     * Length: 4, Null count: 1
     * Null bitmap buffer:
 
-      | Byte 0 (validity bitmap) | Bytes 1-63            |
+      | Byte 0 (validity bitmap) | Bytes 1-7             |
       |--------------------------|-----------------------|
       | 00001101                 | 0 (padding)           |
 
@@ -678,7 +678,7 @@ union, it has some advantages that may be desirable in certain use cases:</p>
 
       |Byte 0 (validity bitmap) | Bytes 1-63            |
       |-------------------------|-----------------------|
-      | 00001010                | 0 (padding)           |
+      |00001010                 | 0 (padding)           |
 
     * Value buffer:
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index d3e09bd..d4b7a37 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,83 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"
><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link
href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate"
type="text/html" /><updated>2017-06-14T10:33:53-04:00</updated><id>/</id><entry><title
type="html">Apache Arrow 0.4.1 Release</title><link href="/blog/2017/06/14/0.4.1-release/"
rel="alternate" type="text/html" title="Apache Arrow 0.4.1 Release" /><published>2017-06-14T10:00:00-04:00</published><updated>2017-06-14T10:00:00-04:00</updated><id>/blog/2017/06/14/0.4.1-release</id><content
type="html" xml:base="/blog/2017/06/14/0.4.1-release/">&lt;!--
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"
><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link
href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate"
type="text/html" /><updated>2017-06-16T04:40:53-04:00</updated><id>/</id><entry><title
type="html">Connecting Relational Databases to the Apache Arrow World with turbodbc</title><link
href="/blog/2017/06/16/turbodbc-arrow/" rel="alternate" type="text/html" title="Connecting
Relational Databases to the Apache Arrow World with turbodbc" /><published>2017-06-16T04:00:00-04:00</published><updated>2017-06-16T04:00:00-04:00</updated><id>/blog/2017/06/16/turbodbc-arrow</id><content
type="html" xml:base="/blog/2017/06/16/turbodbc-arrow/">&lt;!--
+
+--&gt;
+
+&lt;p&gt;&lt;em&gt;&lt;a href=&quot;https://github.com/mathmagique&quot;&gt;Michael
König&lt;/a&gt; is the lead developer of the &lt;a href=&quot;https://github.com/blue-yonder/turbodbc&quot;&gt;turbodbc
project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;The &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;Apache
Arrow&lt;/a&gt; project set out to become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+&lt;a href=&quot;https://github.com/blue-yonder/turbodbc&quot;&gt;turbodbc&lt;/a&gt;
brings Apache Arrow support to these databases using a much
+older, more specialized data exchange layer: &lt;a href=&quot;https://en.wikipedia.org/wiki/Open_Database_Connectivity&quot;&gt;ODBC&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;ODBC is a database interface that offers developers the option to transfer
data
+either in row-wise or column-wise fashion. Previous Python ODBC modules typically
+use the row-wise approach, and often trade repeated database roundtrips for simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical databases.&lt;/p&gt;
+
+&lt;p&gt;In contrast, turbodbc was designed to leverage columnar data processing
from day
+one. Naturally, this implies using the columnar portion of the ODBC API. Equally
+important, however, is to find new ways of providing columnar data to Python users
+that exceed the capabilities of the row-wise API mandated by Python’s &lt;a href=&quot;https://www.python.org/dev/peps/pep-0249/&quot;&gt;PEP
249&lt;/a&gt;.
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;pre
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;turbodbc&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span
class=&quot;n&quot;&gt;connect&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;connection&lt;/span&gt; &lt;span
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dsn&lt;/span&gt;&lt;span
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;My
columnar database&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cursor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;connection&lt;/span&gt;&lt;span
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cursor&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cursor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span
class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span
class=&quot;s&quot;&gt;&quot;SELECT some_integers, some_strings FROM my_table&quot;&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cursor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span
class=&quot;n&quot;&gt;fetchallarrow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;pyarrow&lt;/span&gt;&lt;span
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Table&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;some_integers&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;some_strings&lt;/span&gt;&lt;span
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;With this new addition, the data flow for a result set of a typical SELECT
query
+is like this:&lt;/p&gt;
+&lt;ul&gt;
+  &lt;li&gt;The database prepares the result set and exposes it to the ODBC driver
using
+either row-wise or column-wise storage.&lt;/li&gt;
+  &lt;li&gt;Turbodbc has the ODBC driver write chunks of the result set into columnar
buffers.&lt;/li&gt;
+  &lt;li&gt;These buffers are exposed to turbodbc’s Apache Arrow frontend. This
frontend
+will create an Arrow table and fill in the buffered values.&lt;/li&gt;
+  &lt;li&gt;The previous steps are repeated until the entire result set is retrieved.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img src=&quot;/img/turbodbc_arrow.png&quot; alt=&quot;Data
flow from relational databases to Python with turbodbc and the Apache Arrow frontend&quot;
class=&quot;img-responsive&quot; width=&quot;75%&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;In practice, it is possible to achieve the following ideal situation: A
64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values
+by copying the entire 64-bit buffer into a free portion of an Arrow table’s 64-bit
+integer column.&lt;/p&gt;
+
+&lt;p&gt;Moving data from the database to an Arrow table and, thus, providing it
to the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to hundred
+thousands of rows at a time. The absence of serialization and conversion logic renders
+the process extremely efficient.&lt;/p&gt;
+
+&lt;p&gt;Once the data is stored in an Arrow table, Python users can continue to
do some
+actual work. They can convert it into a &lt;a href=&quot;https://arrow.apache.org/docs/python/pandas.html&quot;&gt;Pandas
DataFrame&lt;/a&gt; for data analysis
+(using a quick &lt;code class=&quot;highlighter-rouge&quot;&gt;table.to_pandas()&lt;/code&gt;),
pass it on to other data processing
+systems such as &lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache
Spark&lt;/a&gt; or &lt;a href=&quot;http://impala.apache.org/&quot;&gt;Apache
Impala (incubating)&lt;/a&gt;, or store
+it in the &lt;a href=&quot;http://parquet.apache.org/&quot;&gt;Apache Parquet&lt;/a&gt;
file format. This way, non-Python systems are
+efficiently connected with relational databases.&lt;/p&gt;
+
+&lt;p&gt;In the future, turbodbc’s Arrow support will be extended to use more
+sophisticated features such as &lt;a href=&quot;https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding&quot;&gt;dictionary-encoded&lt;/a&gt;
string fields. We also
+plan to pick smaller than 64-bit &lt;a href=&quot;https://arrow.apache.org/docs/metadata.html#integers&quot;&gt;data
types&lt;/a&gt; where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+databases.&lt;/p&gt;
+
+&lt;p&gt;If you would like to learn more about turbodbc, check out the &lt;a
href=&quot;https://github.com/blue-yonder/turbodbc&quot;&gt;GitHub project&lt;/a&gt;
and the
+&lt;a href=&quot;http://turbodbc.readthedocs.io/&quot;&gt;project documentation&lt;/a&gt;.
If you want to learn more about how turbodbc implements the
+nitty-gritty details, check out parts &lt;a href=&quot;https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/&quot;&gt;one&lt;/a&gt;
and &lt;a href=&quot;https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/&quot;&gt;two&lt;/a&gt;
of the
+&lt;a href=&quot;https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/&quot;&gt;“Making
of turbodbc”&lt;/a&gt; series at &lt;a href=&quot;https://tech.blue-yonder.com/&quot;&gt;Blue
Yonder’s technology blog&lt;/a&gt;.&lt;/p&gt;</content><author><name>MathMagique</name></author></entry><entry><title
type="html">Apache Arrow 0.4.1 Release</title><link href="/blog/2017/06/14/0.4.1-release/"
rel="alternate" type="text/html" title="Apache Arrow 0.4.1 Release" /><published>2017-06-14T10:00:00-04:00</published><updated>2017-06-14T10:00:00-04:00</updated><id>/blog/2017/06/14/0.4.1-release</id><content
type="html" xml:base="/blog/2017/06/14/0.4.1-release/">&lt;!--
 
 --&gt;
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/img/turbodbc_arrow.png
----------------------------------------------------------------------
diff --git a/img/turbodbc_arrow.png b/img/turbodbc_arrow.png
new file mode 100644
index 0000000..b534bf9
Binary files /dev/null and b/img/turbodbc_arrow.png differ


Mime
View raw message