hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Clampffer <james.clampf...@gmail.com>
Subject Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
Date Thu, 01 Mar 2018 18:04:56 GMT
Thanks for the feedback Chris and Kai!

Chris, do you mean potentially landing this in its current state and
handling some of the rough edges after?  I could see this working just
because there's no impact on any existing code.

With regards to your questions Kai:
There isn't a good doc for the internal architecture yet; I just reassigned
HDFS-9115 to myself to handle that.  Are there any specific areas you'd
like to know about so I can prioritize those?
Here's some header files that include a lot of comments that should help
out for now:
-hdfspp.h - main header for the C++ API
-filesystem.h and filehandle.h - describes some rules about object
lifetimes and threading from the API point of view (most classes have
comments describing any restrictions on threading, locking, and lifecycle).
-rpc_engine.h and rpc_connection.h begin getting into the async RPC

1) Yes, it's a reimplementation of the entire client in C++.  Using
libhdfs3 as a reference helps a lot here but it's still a lot of work.
2) EC isn't supported now, though that'd be great to have, and I agree that
it's going to be take a lot of effort to implement.  Right now if you tried
to read an EC file I think you'd get some unhelpful error out of the block
reader but I don't have an EC enabled cluster set up to test.  Adding an
explicit not supported message would be straightforward.
3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already
had so we get consistency checks on the C API.  There's a few new tests
that also get run on both libhdfs and libhdfs++ and make sure the expected
output is the same too.
4) I agree, I just haven't had a chance to look into the distribution build
to see how to do it.  HDFS-9465 is tracking this.
5) Not yet (HDFS-8765).


On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai.zk@alibaba-inc.com> wrote:

> The work sounds solid and great! + to have this.
> Is there any quick doc to take a glance at? Some quick questions to be
> familiar with:
> 1. Seems the client is all implemented in c++ without any Java codes (so
> no JVM overhead), which means lots of work, rewriting HDFS client. Right?
> 2.  Guess erasure coding feature isn't supported, as it'd involve
> significant development, right? If yes, what will it say when read erasure
> coded file?
> 3. Is there any building/testing mechanism to enforce the consistency
> between the c++ part and Java part?
> 4. I thought the public header and lib should be exported when building
> the distribution package, otherwise hard to use the new C api.
> 5. Is the short-circuit read supported?
> Thanks.
> Regards,
> Kai
> ------------------------------------------------------------------
> 发件人:Chris Douglas <cdouglas@apache.org>
> 发送时间:2018年3月1日(星期四) 05:08
> 收件人:Jim Clampffer <james.clampffer@gmail.com>
> 抄 送:Hdfs-dev <hdfs-dev@hadoop.apache.org>
> 主 题:Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
> +1
> Let's get this done. We've had many false starts on a native HDFS
> client. This is a good base to build on. -C
> On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
> <james.clampffer@gmail.com> wrote:
> > Hi everyone,
> >
> > I'd like to start a thread to discuss merging the HDFS-
> 8707 aka libhdfs++
> > into trunk.  I sent originally sent a similar
> email out last October but it
> > sounds like it was buried by discussions about other feature merges that
> > were going on at the time.
> >
> > libhdfs++ is an HDFS client written in C++ designed to be used in
> > applications that are written in non-JVM based
> languages.  In its current
> > state it supports kerberos authenticated reads from HDFS
> and has been used
> > in production clusters for over a year so it has had a
> significant amount
> > of burn-in time.  The HDFS-8707 branch has been around for about 2 years
> > now so I'd like to know people's thoughts on what it would take to merge
> > current branch and handling writes and encrypted reads in a new one.
> >
> > Current notable features:
> >   -A libhdfs/libhdfs3 compatible C API that allows
> libhdfs++ to serve as a
> > drop-in replacement for clients that only need read support (until
> > libhdfs++ also supports writes).
> >   -An asynchronous C++ API with synchronous shims on top if the client
> > application wants to do blocking operations.  Internally a single thread
> > (optionally more) uses select/epoll by way of boost::asio to watch
> > thousands of sockets without the overhead of spawning threads to emulate
> > async operation.
> >   -Kerberos/SASL authentication support
> >   -HA namenode support
> >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> > "./hdfs dfs -chmod".  The major benefit of these is the
> tool startup time
> > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
> > lot less memory since it isn't dealing with the JVM.  This makes it
> > possible to do things like write a simple bash script that stats a file,
> > applies some rules to the result, and decides if it
> should move it in a way
> > that scales to thousands of files without being penalized with O(N) JVM
> > startups.
> >   -Cancelable reads.  This has proven to be very useful in multiuser
> > applications that (pre)fetch large blocks of data but need to remain
> > responsive for interactive users.  Rather than waiting
> for a large and/or
> > slow read to finish it will return immediately and the
> associated resources
> > (buffer, file descriptor) become available for the rest
> of the application
> > to use.
> >
> > There are a couple known issues: the doc build isn't integrated with the
> > rest of hadoop and the public API headers aren't being exported when
> > building a distribution.  A short term solution for
> missing docs is to go
> > through the libhdfs(3) compatible API and use the
> libhdfs docs.  Other than
> > a few modifications to the pom files to integrate the
> build and the changes
> > are isolated to a new directory so the chance of
> causing any regressions in
> > the rest of the code is minimal.
> >
> > Please share your thoughts, thanks!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message