arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: IO considerations for PyArrow
Date Wed, 08 Jun 2016 23:11:20 GMT
On Fri, Jun 3, 2016 at 10:16 AM, Micah Kornfield <emkornfield@gmail.com> wrote:
> Hi Wes,
>
> At what level do you imagine, the "opt-in" happening.  Right now it
> seems like it would be fairly straightforward at build time.  However,
> when we start packaging pyarrow for distribution how do you imagine it
> will work? (If [1] already answers this, please let me know, I've been
> meaning to take a look at it).
>

Where packaging and distribution is concerned, it'd be easiest to
provide non-picky users with a kitchen sink build, but otherwise
developers could create precisely the build they want with CMake
flags, I guess. If certain libraries aren't found then we wouldn't
fail the build by default, for example.

> I need to grok the python code base a little bit more to understand
> the implications of the scope creep and the pain around taking a more
> fine-grained component approach.  But in general my experience has
> been that packaging things together while maintaining clear internal
> code boundaries for later separation is a good pragmatic approach.
>

I'd propose creating an `arrow_io` leaf shared library where we can
create a small IO subsystem for reuse amongst different data
connectors. We can leave things fairly coarse grained for the time
being and break things up later if it becomes onerous for other Arrow
developer-users.

> As a side note, hopefully, we'll be able to re-use some existing
> projects to do the heavy lifting for blob store integration.  SFrame
> is one option [2] and [3] might be worth investigating as well (both
> appear to be Apache 2.0 licensed).

While requiring Java + $HADOOP_HOME for HDFS connectivity (wrapper
around libhdfs) doesn't excite me that much, the prospect of bugs (or
secure cluster issues) creeping up from a 3rd-party HDFS client
without the ability to escalate problems to the Apache Hadoop team
worries me even more. There is a new official C++ HDFS client in the
works after the libhdfs3 patch was not accepted
(https://issues.apache.org/jira/browse/HDFS-8707), so this may be
worth pursuing once it matures.

Thoughts on this welcome.

- Wes

>
> Thanks,
> -Micah
>
> [1] https://github.com/apache/arrow/pull/79/files
> [2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3
> [3] https://github.com/aws/aws-sdk-cpp
>
>

Mime
View raw message