hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinayakumar B <vinayakum...@apache.org>
Subject Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
Date Thu, 01 Mar 2018 18:31:19 GMT
Definitely this would be great addition. Kudos to everyone's contributions.

I am not a C++ expert. So cannot vote on code.

  ---A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as
a drop-in replacement for clients that only need read support (until libhdfs++
also supports writes).

Wouldn't it be nice to have write support as well before merge...?
If everyone feels its okay to have read alone for now, I am okay anyway.

On 1 Mar 2018 11:35 pm, "Jim Clampffer" <james.clampffer@gmail.com> wrote:

> Thanks for the feedback Chris and Kai!
>
> Chris, do you mean potentially landing this in its current state and
> handling some of the rough edges after?  I could see this working just
> because there's no impact on any existing code.
>
> With regards to your questions Kai:
> There isn't a good doc for the internal architecture yet; I just reassigned
> HDFS-9115 to myself to handle that.  Are there any specific areas you'd
> like to know about so I can prioritize those?
> Here's some header files that include a lot of comments that should help
> out for now:
> -hdfspp.h - main header for the C++ API
> -filesystem.h and filehandle.h - describes some rules about object
> lifetimes and threading from the API point of view (most classes have
> comments describing any restrictions on threading, locking, and lifecycle).
> -rpc_engine.h and rpc_connection.h begin getting into the async RPC
> implementation.
>
>
> 1) Yes, it's a reimplementation of the entire client in C++.  Using
> libhdfs3 as a reference helps a lot here but it's still a lot of work.
> 2) EC isn't supported now, though that'd be great to have, and I agree that
> it's going to be take a lot of effort to implement.  Right now if you tried
> to read an EC file I think you'd get some unhelpful error out of the block
> reader but I don't have an EC enabled cluster set up to test.  Adding an
> explicit not supported message would be straightforward.
> 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already
> had so we get consistency checks on the C API.  There's a few new tests
> that also get run on both libhdfs and libhdfs++ and make sure the expected
> output is the same too.
> 4) I agree, I just haven't had a chance to look into the distribution build
> to see how to do it.  HDFS-9465 is tracking this.
> 5) Not yet (HDFS-8765).
>
> Regards,
> James
>
>
>
>
> On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai.zk@alibaba-inc.com>
> wrote:
>
> > The work sounds solid and great! + to have this.
> >
> > Is there any quick doc to take a glance at? Some quick questions to be
> > familiar with:
> > 1. Seems the client is all implemented in c++ without any Java codes (so
> > no JVM overhead), which means lots of work, rewriting HDFS client. Right?
> > 2.  Guess erasure coding feature isn't supported, as it'd involve
> > significant development, right? If yes, what will it say when read
> erasure
> > coded file?
> > 3. Is there any building/testing mechanism to enforce the consistency
> > between the c++ part and Java part?
> > 4. I thought the public header and lib should be exported when building
> > the distribution package, otherwise hard to use the new C api.
> > 5. Is the short-circuit read supported?
> >
> > Thanks.
> >
> >
> > Regards,
> > Kai
> >
> > ------------------------------------------------------------------
> > 发件人:Chris Douglas <cdouglas@apache.org>
> > 发送时间:2018年3月1日(星期四) 05:08
> > 收件人:Jim Clampffer <james.clampffer@gmail.com>
> > 抄 送:Hdfs-dev <hdfs-dev@hadoop.apache.org>
> > 主 题:Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
> >
> > +1
> >
> > Let's get this done. We've had many false starts on a native HDFS
> > client. This is a good base to build on. -C
> >
> > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
> > <james.clampffer@gmail.com> wrote:
> > > Hi everyone,
> > >
> > > I'd like to start a thread to discuss merging the HDFS-
> > 8707 aka libhdfs++
> > > into trunk.  I sent originally sent a similar
> > email out last October but it
> > > sounds like it was buried by discussions about other feature merges
> that
> > > were going on at the time.
> > >
> > > libhdfs++ is an HDFS client written in C++ designed to be used in
> > > applications that are written in non-JVM based
> > languages.  In its current
> > > state it supports kerberos authenticated reads from HDFS
> > and has been used
> > > in production clusters for over a year so it has had a
> > significant amount
> > > of burn-in time.  The HDFS-8707 branch has been around for about 2
> years
> > > now so I'd like to know people's thoughts on what it would take to
> merge
> > > current branch and handling writes and encrypted reads in a new one.
> > >
> > > Current notable features:
> > >   -A libhdfs/libhdfs3 compatible C API that allows
> > libhdfs++ to serve as a
> > > drop-in replacement for clients that only need read support (until
> > > libhdfs++ also supports writes).
> > >   -An asynchronous C++ API with synchronous shims on top if the client
> > > application wants to do blocking operations.  Internally a single
> thread
> > > (optionally more) uses select/epoll by way of boost::asio to watch
> > > thousands of sockets without the overhead of spawning threads to
> emulate
> > > async operation.
> > >   -Kerberos/SASL authentication support
> > >   -HA namenode support
> > >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> > > "./hdfs dfs -chmod".  The major benefit of these is the
> > tool startup time
> > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies
> a
> > > lot less memory since it isn't dealing with the JVM.  This makes it
> > > possible to do things like write a simple bash script that stats a
> file,
> > > applies some rules to the result, and decides if it
> > should move it in a way
> > > that scales to thousands of files without being penalized with O(N) JVM
> > > startups.
> > >   -Cancelable reads.  This has proven to be very useful in multiuser
> > > applications that (pre)fetch large blocks of data but need to remain
> > > responsive for interactive users.  Rather than waiting
> > for a large and/or
> > > slow read to finish it will return immediately and the
> > associated resources
> > > (buffer, file descriptor) become available for the rest
> > of the application
> > > to use.
> > >
> > > There are a couple known issues: the doc build isn't integrated with
> the
> > > rest of hadoop and the public API headers aren't being exported when
> > > building a distribution.  A short term solution for
> > missing docs is to go
> > > through the libhdfs(3) compatible API and use the
> > libhdfs docs.  Other than
> > > a few modifications to the pom files to integrate the
> > build and the changes
> > > are isolated to a new directory so the chance of
> > causing any regressions in
> > > the rest of the code is minimal.
> > >
> > > Please share your thoughts, thanks!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message