hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
Date Thu, 01 Mar 2018 18:54:46 GMT
On Thu, Mar 1, 2018 at 10:04 AM, Jim Clampffer
<james.clampffer@gmail.com> wrote:
> Chris, do you mean potentially landing this in its current state and
> handling some of the rough edges after?  I could see this working just
> because there's no impact on any existing code.

Yes. Better to get this committed and released than to polish it in
the branch. -C

> With regards to your questions Kai:
> There isn't a good doc for the internal architecture yet; I just reassigned
> HDFS-9115 to myself to handle that.  Are there any specific areas you'd like
> to know about so I can prioritize those?
> Here's some header files that include a lot of comments that should help out
> for now:
> -hdfspp.h - main header for the C++ API
> -filesystem.h and filehandle.h - describes some rules about object lifetimes
> and threading from the API point of view (most classes have comments
> describing any restrictions on threading, locking, and lifecycle).
> -rpc_engine.h and rpc_connection.h begin getting into the async RPC
> implementation.
> 1) Yes, it's a reimplementation of the entire client in C++.  Using libhdfs3
> as a reference helps a lot here but it's still a lot of work.
> 2) EC isn't supported now, though that'd be great to have, and I agree that
> it's going to be take a lot of effort to implement.  Right now if you tried
> to read an EC file I think you'd get some unhelpful error out of the block
> reader but I don't have an EC enabled cluster set up to test.  Adding an
> explicit not supported message would be straightforward.
> 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already had
> so we get consistency checks on the C API.  There's a few new tests that
> also get run on both libhdfs and libhdfs++ and make sure the expected output
> is the same too.
> 4) I agree, I just haven't had a chance to look into the distribution build
> to see how to do it.  HDFS-9465 is tracking this.
> 5) Not yet (HDFS-8765).
> Regards,
> James
> On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai.zk@alibaba-inc.com> wrote:
>> The work sounds solid and great! + to have this.
>> Is there any quick doc to take a glance at? Some quick questions to be
>> familiar with:
>> 1. Seems the client is all implemented in c++ without any Java codes (so
>> no JVM overhead), which means lots of work, rewriting HDFS client. Right?
>> 2.  Guess erasure coding feature isn't supported, as it'd involve
>> significant development, right? If yes, what will it say when read erasure
>> coded file?
>> 3. Is there any building/testing mechanism to enforce the consistency
>> between the c++ part and Java part?
>> 4. I thought the public header and lib should be exported when building
>> the distribution package, otherwise hard to use the new C api.
>> 5. Is the short-circuit read supported?
>> Thanks.
>> Regards,
>> Kai
>> ------------------------------------------------------------------
>> 发件人:Chris Douglas <cdouglas@apache.org>
>> 发送时间:2018年3月1日(星期四) 05:08
>> 收件人:Jim Clampffer <james.clampffer@gmail.com>
>> 抄 送:Hdfs-dev <hdfs-dev@hadoop.apache.org>
>> 主 题:Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
>> +1
>> Let's get this done. We've had many false starts on a native HDFS
>> client. This is a good base to build on. -C
>> On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
>> <james.clampffer@gmail.com> wrote:
>> > Hi everyone,
>> >
>> > I'd like to start a thread to discuss merging the HDFS-8707 aka
>> > libhdfs++
>> > into trunk.  I sent originally sent a similar email out last October but
>> > it
>> > sounds like it was buried by discussions about other feature merges that
>> > were going on at the time.
>> >
>> > libhdfs++ is an HDFS client written in C++ designed to be used in
>> > applications that are written in non-JVM based languages.  In its
>> > current
>> > state it supports kerberos authenticated reads from HDFS and has been
>> > used
>> > in production clusters for over a year so it has had a significant
>> > amount
>> > of burn-in time.  The HDFS-8707 branch has been around for about 2 years
>> > now so I'd like to know people's thoughts on what it would take to merge
>> > current branch and handling writes and encrypted reads in a new one.
>> >
>> > Current notable features:
>> >   -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as
>> > a
>> > drop-in replacement for clients that only need read support (until
>> > libhdfs++ also supports writes).
>> >   -An asynchronous C++ API with synchronous shims on top if the client
>> > application wants to do blocking operations.  Internally a single thread
>> > (optionally more) uses select/epoll by way of boost::asio to watch
>> > thousands of sockets without the overhead of spawning threads to emulate
>> > async operation.
>> >   -Kerberos/SASL authentication support
>> >   -HA namenode support
>> >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
>> > "./hdfs dfs -chmod".  The major benefit of these is the tool startup
>> > time
>> > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
>> > lot less memory since it isn't dealing with the JVM.  This makes it
>> > possible to do things like write a simple bash script that stats a file,
>> > applies some rules to the result, and decides if it should move it in a
>> > way
>> > that scales to thousands of files without being penalized with O(N) JVM
>> > startups.
>> >   -Cancelable reads.  This has proven to be very useful in multiuser
>> > applications that (pre)fetch large blocks of data but need to remain
>> > responsive for interactive users.  Rather than waiting for a large
>> > and/or
>> > slow read to finish it will return immediately and the associated
>> > resources
>> > (buffer, file descriptor) become available for the rest of the
>> > application
>> > to use.
>> >
>> > There are a couple known issues: the doc build isn't integrated with the
>> > rest of hadoop and the public API headers aren't being exported when
>> > building a distribution.  A short term solution for missing docs is to
>> > go
>> > through the libhdfs(3) compatible API and use the libhdfs docs.  Other
>> > than
>> > a few modifications to the pom files to integrate the build and the
>> > changes
>> > are isolated to a new directory so the chance of causing any regressions
>> > in
>> > the rest of the code is minimal.
>> >
>> > Please share your thoughts, thanks!
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org

To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org

View raw message