arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject IO considerations for PyArrow
Date Wed, 01 Jun 2016 00:44:33 GMT
hi folks,

I wanted to bring up what is likely to become an issue very soon in
the context of our work to provide an Arrow-based Parquet interface
for Python Arrow users.

https://github.com/apache/arrow/pull/83

At the moment, parquet-cpp features an API that enables reading a file
from local disk (using C standard library calls):

https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111

This is fine for now, however we will quickly need to deal with a few
additional sources of data:

1) File-like Python objects (i.e. an object that has `seek`, `tell`,
and `read` methods)
2) Remote blob stores: HDFS and S3

Implementing #1 at present is a routine exercise in using the Python C
API. #2 is less so -- one of the approaches that has been taken by
others is to create separate Python file-like wrapper classes for
remote storage to make them seem file like. This has multiple
downsides:

- read/seek/tell calls must cross up into the Python interpreter and
back down into the C++ layer
- bytes buffered by read calls get copied into Python bytes objects
(see PyBytes_FromStringAndSize)

Outside of the GIL / concurrency issues, there's efficiency loss that
can be remedied by implementing instead:

- Direct C/C++-level interface (independent of Python interpreter)
with remote blob stores. These can then buffer bytes directly in the
form requested by other C++ consumer libraries (like parquet-cpp)

- Implement a Python file-like interface, so that users can still get
at the bytes in pure Python if they want (for example: some functions,
like pandas.read_csv, primarily deal with file-like things)

This is a clearly superior solution, and has been notably pursued in
recent times by Dato's SFrame library (BSD 3-clause):

https://github.com/dato-code/SFrame/tree/master/oss_src/fileio

The problem, however, is the inevitable scope creep for the PyArrow
Python package. Unlike some other programming languages, Python
programmers face a substantial development complexity burden if they
choose to break libraries containing C extensions into smaller
components, as libraries must define "internal" C APIs for each other
to connect together . Notable example is NumPy
(http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C
API is already being used in PyArrow.

I've been thinking about this problem for several weeks, and my net
recommendation is that we embrace the scope creep in PyArrow (as long
as we try to make optional features, e.g. low-level S3 / libhdfs
integration, "opt-in" versus required for all users). I'd like to hear
from some others, though (e.g. Uwe, Micah, etc.).

thanks,
Wes

Mime
View raw message