hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Clampffer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9758) libhdfs++: Implement Python bindings
Date Wed, 27 Apr 2016 16:26:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260423#comment-15260423
] 

James Clampffer commented on HDFS-9758:
---------------------------------------

Here's a bunch of my thoughts about this, let me know what you think.  I haven't done much
in Python 3.x so some of my assumptions might not hold true there.

My thinking was to focus on supporting CPython via CTypes, at least initially.  I have a patch
where I hacked together a demo of how this could be done that I'll dig up and post later today
(doesn't support iterable files or readline() and isn't optimized but otherwise works well
enough).  My overall opinion about this is that we should make it as easy to access HDFS through
python as possible so less configuration and fewer dependencies is really important to get
people to use it.  Naturally if some minor amount of configurations leads to a huge performance
boost than it's worth considering.

I think CPython is the best place to focus simply because of it's ubiquity. PyPy is a cool
project but doesn't come installed by default on many linux distributions as far as I know.
 CPython ships with CTypes so that's one less dependency to bring in (unless CFFI is also
included as a default library), but as you said you're pretty much stuck writing C wrapper
functions for everything.  I don't think that's a dealbreaker as forcing a C API walls off
exceptions and things that shouldn't be getting into the interpreter anyway.  Does Cython
get you a whole lot of benefits over something like CTypes?  I don't have experience with
it.

Boost.Python or a pure python extension would mostly likely be the cleanest and most performant
way of doing this sort of thing at the expense of extra complexity.  I've also heard that
hadoop and boost generally don't mix but we've already made an exception for boost::asio (maybe
that's different because it's header only?).  The only concern I'd have with both would be
that they tie the module to the libhdfs++ C++ ABI so we'd have to be careful about compatibility.
 I could see writing a module being a big benefit because then we could hook into the GC to
properly support garbage collected async operations.

I think it's important to make sure at least some this work can help implement bindings for
other languages but I think most approaches would do that in one way or another.  I'm partial
to building language specific wrappers over the C API just because most scripting languages
have a way of calling C functions.

> libhdfs++: Implement Python bindings
> ------------------------------------
>
>                 Key: HDFS-9758
>                 URL: https://issues.apache.org/jira/browse/HDFS-9758
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>
> It'd be really useful to have bindings for various scripting languages.  Python would
be a good start because of it's popularity and how easy it is to interact with shared libraries
using the ctypes module.  I think bindings for the V8 engine that nodeJS uses would be a close
second in terms of expanding the potential user base.
> Probably worth starting with just adding a synchronous API and building from there to
avoid interactions with python's garbage collector until the bindings prove to be solid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message