arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Korn <uw...@xhochy.com>
Subject Re: IO considerations for PyArrow
Date Fri, 03 Jun 2016 13:26:20 GMT
Hello,

I would also embrace the scope creep. As we will deal with a lot of 
data, the cross-language I/O impact will significantly matter for 
performance at the end. We definitely have to be careful in making the 
dependencies toggleable in the build system. You should be able to 
easily get a build with all dependencies but also it can very selective 
on which ones are included in a build.

For HDFS and S3 support, I'm not sure if either arrow-cpp, pyarrow or 
parquet-cpp is the right place for their C++ implementation. For 
arrow-cpp it would be the same scope creep as for PyArrow and it could 
be already used by C++ Arrow users but in parquet-cpp these IO classes 
would also be helpful for the non-arrow users.  For the moment I would 
put the C++ implementations into arrow-cpp, as this keeps the scope 
creep in Arrow itself but already provides value to the C++ users and 
other languages building on that layer.

Cheers,

Uwe

On 01.06.16 02:44, Wes McKinney wrote:
> hi folks,
>
> I wanted to bring up what is likely to become an issue very soon in
> the context of our work to provide an Arrow-based Parquet interface
> for Python Arrow users.
>
> https://github.com/apache/arrow/pull/83
>
> At the moment, parquet-cpp features an API that enables reading a file
> from local disk (using C standard library calls):
>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111
>
> This is fine for now, however we will quickly need to deal with a few
> additional sources of data:
>
> 1) File-like Python objects (i.e. an object that has `seek`, `tell`,
> and `read` methods)
> 2) Remote blob stores: HDFS and S3
>
> Implementing #1 at present is a routine exercise in using the Python C
> API. #2 is less so -- one of the approaches that has been taken by
> others is to create separate Python file-like wrapper classes for
> remote storage to make them seem file like. This has multiple
> downsides:
>
> - read/seek/tell calls must cross up into the Python interpreter and
> back down into the C++ layer
> - bytes buffered by read calls get copied into Python bytes objects
> (see PyBytes_FromStringAndSize)
>
> Outside of the GIL / concurrency issues, there's efficiency loss that
> can be remedied by implementing instead:
>
> - Direct C/C++-level interface (independent of Python interpreter)
> with remote blob stores. These can then buffer bytes directly in the
> form requested by other C++ consumer libraries (like parquet-cpp)
>
> - Implement a Python file-like interface, so that users can still get
> at the bytes in pure Python if they want (for example: some functions,
> like pandas.read_csv, primarily deal with file-like things)
>
> This is a clearly superior solution, and has been notably pursued in
> recent times by Dato's SFrame library (BSD 3-clause):
>
> https://github.com/dato-code/SFrame/tree/master/oss_src/fileio
>
> The problem, however, is the inevitable scope creep for the PyArrow
> Python package. Unlike some other programming languages, Python
> programmers face a substantial development complexity burden if they
> choose to break libraries containing C extensions into smaller
> components, as libraries must define "internal" C APIs for each other
> to connect together . Notable example is NumPy
> (http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C
> API is already being used in PyArrow.
>
> I've been thinking about this problem for several weeks, and my net
> recommendation is that we embrace the scope creep in PyArrow (as long
> as we try to make optional features, e.g. low-level S3 / libhdfs
> integration, "opt-in" versus required for all users). I'd like to hear
> from some others, though (e.g. Uwe, Micah, etc.).
>
> thanks,
> Wes


Mime
View raw message