arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject arrow git commit: ARROW-1122: [Website] Add turbodbc + arrow blog post
Date Fri, 16 Jun 2017 00:48:12 GMT
Repository: arrow
Updated Branches:
  refs/heads/master d1de66bda -> 5b66c25c9

ARROW-1122: [Website] Add turbodbc + arrow blog post

Closes #757

Change-Id: Idb147f5681e725f8852108ae3a1b870afc14973c


Branch: refs/heads/master
Commit: 5b66c25c9f7ee5aeb5ce2646af4fcf3ccde702fd
Parents: d1de66b
Author: Michael König <>
Authored: Thu Jun 15 13:21:49 2017 +0200
Committer: Wes McKinney <>
Committed: Thu Jun 15 20:46:05 2017 -0400

 site/_data/contributors.yml              |   3 +
 site/_posts/ | 117 ++++++++++++++++++++++++++
 site/img/turbodbc_arrow.png              | Bin 0 -> 75697 bytes
 3 files changed, 120 insertions(+)
diff --git a/site/_data/contributors.yml b/site/_data/contributors.yml
index 7bed83d..2a8d95b 100644
--- a/site/_data/contributors.yml
+++ b/site/_data/contributors.yml
@@ -29,4 +29,7 @@
   apacheId: julienledem
   githubId: julienledem
   role: PMC
+- name: Michael König
+  apacheId: MathMagique
+  githubId: MathMagique
 # End contributors.yml
diff --git a/site/_posts/ b/site/_posts/
new file mode 100644
index 0000000..438e56a
--- /dev/null
+++ b/site/_posts/
@@ -0,0 +1,117 @@
+layout: post
+title: "Connecting Relational Databases to the Apache Arrow World with turbodbc"
+date: "2017-06-16 00:00:00 +0200"
+author: MathMagique
+categories: [application]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+*[Michael König][15] is the lead developer of the [turbodbc project][2]*
+The [Apache Arrow][1] project set out to become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+[turbodbc][2] brings Apache Arrow support to these databases using a much
+older, more specialized data exchange layer: [ODBC][3].
+ODBC is a database interface that offers developers the option to transfer data
+either in row-wise or column-wise fashion. Previous Python ODBC modules typically
+use the row-wise approach, and often trade repeated database roundtrips for simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical databases.
+In contrast, turbodbc was designed to leverage columnar data processing from day
+one. Naturally, this implies using the columnar portion of the ODBC API. Equally
+important, however, is to find new ways of providing columnar data to Python users
+that exceed the capabilities of the row-wise API mandated by Python’s [PEP 249][4].
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:
+>>> from turbodbc import connect
+>>> connection = connect(dsn="My columnar database")
+>>> cursor = connection.cursor()
+>>> cursor.execute("SELECT some_integers, some_strings FROM my_table")
+>>> cursor.fetchallarrow()
+some_integers: int64
+some_strings: string
+With this new addition, the data flow for a result set of a typical SELECT query
+is like this:
+*   The database prepares the result set and exposes it to the ODBC driver using
+    either row-wise or column-wise storage.
+*   Turbodbc has the ODBC driver write chunks of the result set into columnar buffers.
+*   These buffers are exposed to turbodbc's Apache Arrow frontend. This frontend
+    will create an Arrow table and fill in the buffered values.
+*   The previous steps are repeated until the entire result set is retrieved.
+![Data flow from relational databases to Python with turbodbc and the Apache Arrow frontend]({{
site.url }}/img/turbodbc_arrow.png){:class="img-responsive" width="75%"}
+In practice, it is possible to achieve the following ideal situation: A 64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values
+by copying the entire 64-bit buffer into a free portion of an Arrow table's 64-bit
+integer column.
+Moving data from the database to an Arrow table and, thus, providing it to the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to hundred
+thousands of rows at a time. The absence of serialization and conversion logic renders
+the process extremely efficient.
+Once the data is stored in an Arrow table, Python users can continue to do some
+actual work. They can convert it into a [Pandas DataFrame][5] for data analysis
+(using a quick `table.to_pandas()`), pass it on to other data processing
+systems such as [Apache Spark][6] or [Apache Impala (incubating)][7], or store
+it in the [Apache Parquet][8] file format. This way, non-Python systems are
+efficiently connected with relational databases.
+In the future, turbodbc’s Arrow support will be extended to use more
+sophisticated features such as [dictionary-encoded][9] string fields. We also
+plan to pick smaller than 64-bit [data types][10] where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+If you would like to learn more about turbodbc, check out the [GitHub project][2] and the
+[project documentation][11]. If you want to learn more about how turbodbc implements the
+nitty-gritty details, check out parts [one][12] and [two][13] of the
+["Making of turbodbc"][12] series at [Blue Yonder's technology blog][14].
\ No newline at end of file
diff --git a/site/img/turbodbc_arrow.png b/site/img/turbodbc_arrow.png
new file mode 100644
index 0000000..b534bf9
Binary files /dev/null and b/site/img/turbodbc_arrow.png differ

View raw message