hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10388) Pure native hadoop client
Date Thu, 27 Mar 2014 19:06:19 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949803#comment-13949803
] 

Colin Patrick McCabe commented on HADOOP-10388:
-----------------------------------------------

bq. What will the project structure looks like? A separate top-level hadoop-native-client-project?
Or seperate code files in common/hdfs/yarn existing dirs?

I think a separate top-level project is best, since this will allow the YARN native client
and the HDFS native client to share code much more easily.  They will have a lot of shared
code.  We can have a Maven profile that causes this subproject to be built.

bq. Why the name libhdfs-core.so and libyarn-core.so? it's a client library, doesn't sounds
like core.

I guess my thinking here is that these libraries are speaking the core hadoop protocol.  I
am open to other names if you have something better.  One problem with choosing a name is
that "libhdfs" and "libhadoop" are already taken.  We also already have directories named
"native," so that would be confusing as well.  We also need a name that is fairly short, since
it will appear in header file names, object names, etc. etc.  We could do "libhdfs-ng.so",
I guess.

bq. In short, what libraries are planned to be used?

libuv, libprotobuf-c, something for XML parsing, something for URI parsing.
CMake already has a unit test framework called CTest so we use that.

bq. I like the library to be lightweight, some people just want a header file and a static
linked library(a few MB in size), to be able to read/write from hdfs, so some heavy feature:
xml library(config file parsing), uri parsing(cross FileSystem symlink), thread pool better
be optional, not required.

I agree that having an option for static linking would be good.  We also need to think carefully
about compatibility and what the header file will look like.

The reason for supporting config file parsing is that we want this library to be a drop-in
replacement for libhdfs.so.  libhdfs.so is a JNI-based library used by a lot of C and C++
projects such as fuse_dfs and Impala,   libhdfs.so will read configuration XML files in the
usual way just by invoking the Java Configuration code.  If this library is not a drop-in
replacement for libhdfs.so, most projects simply will not be able to use it.  The other reason
for supporting config file parsing is that, well, you need some way of configuring the client!
 If we end up re-inventing the configuration wheel in a different way, that will not be good
for anyone.

Some clients may not want to read XML files, but simply set all the configuration keys themselves.
 That's fine, and we can support this.  We can even make the XML-reading code optional if
you want.

Thread pools and async I/O, I'm afraid, are something we can't live without.  The HDFS client
needs to do certain operations in the background.  If you study the existing DFSOutputStream
code, you'll see that the DFSOutputStream does transfers in the background while the client
continues to fill a buffer.  This is essential to get good performance, since otherwise we'd
have to stop and wait for the packet to be written to all 3 datanodes in the pipeline every
time our 64kb chunk filled up.  Take a look at the existing HDFS client code to get a sense
for what a native client would be like.

> Pure native hadoop client
> -------------------------
>
>                 Key: HADOOP-10388
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10388
>             Project: Hadoop Common
>          Issue Type: New Feature
>    Affects Versions: HADOOP-10388
>            Reporter: Binglin Chang
>            Assignee: Colin Patrick McCabe
>
> A pure native hadoop client has following use case/advantages:
> 1.  writing Yarn applications using c++
> 2.  direct access to HDFS, without extra proxy overhead, comparing to web/nfs interface.
> 3.  wrap native library to support more languages, e.g. python
> 4.  lightweight, small footprint compare to several hundred MB of JDK and hadoop library
with various dependencies.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message