arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <>
Subject Re: IO considerations for PyArrow
Date Thu, 16 Jun 2016 01:35:54 GMT
Hi folks,

I put some more thought into the "IO problem" as it relates Arrow in
C++ (and transitively, Python) and wrote a short Google document with
my thoughts on it:

Feedback greatly appreciated! This will be on my critical path in the
near future, so I would like to know if I'm approaching the problem
right, and we are in alignment (then can break things down into a
bunch of JIRAs).

(I can also post this doc directly to the mailing list, I thought the
initial discussion would be simpler in a GDoc)

Thank you

On Wed, Jun 8, 2016 at 4:11 PM, Wes McKinney <> wrote:
> On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <> wrote:
>> Hi Wes,
>> At what level do you imagine, the "opt-in" happening.  Right now it
>> seems like it would be fairly straightforward at build time.  However,
>> when we start packaging pyarrow for distribution how do you imagine it
>> will work? (If [1] already answers this, please let me know, I've been
>> meaning to take a look at it).
> Where packaging and distribution is concerned, it'd be easiest to
> provide non-picky users with a kitchen sink build, but otherwise
> developers could create precisely the build they want with CMake
> flags, I guess. If certain libraries aren't found then we wouldn't
> fail the build by default, for example.
>> I need to grok the python code base a little bit more to understand
>> the implications of the scope creep and the pain around taking a more
>> fine-grained component approach.  But in general my experience has
>> been that packaging things together while maintaining clear internal
>> code boundaries for later separation is a good pragmatic approach.
> I'd propose creating an `arrow_io` leaf shared library where we can
> create a small IO subsystem for reuse amongst different data
> connectors. We can leave things fairly coarse grained for the time
> being and break things up later if it becomes onerous for other Arrow
> developer-users.
>> As a side note, hopefully, we'll be able to re-use some existing
>> projects to do the heavy lifting for blob store integration.  SFrame
>> is one option [2] and [3] might be worth investigating as well (both
>> appear to be Apache 2.0 licensed).
> While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper
> around libhdfs) doesn't excite me that much, the prospect of bugs (or
> secure cluster issues) creeping up from a 3rd-party HDFS client
> without the ability to escalate problems to the Apache Hadoop team
> worries me even more. There is a new official C++ HDFS client in the
> works after the libhdfs3 patch was not accepted
> (, so this may be
> worth pursuing once it matures.
> Thoughts on this welcome.
> - Wes
>> Thanks,
>> -Micah
>> [1]
>> [2]
>> [3]

View raw message