hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-6994) libhdfs3 - A native C/C++ HDFS client
Date Fri, 20 Feb 2015 18:47:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329333#comment-14329333
] 

Colin Patrick McCabe edited comment on HDFS-6994 at 2/20/15 6:47 PM:
---------------------------------------------------------------------

bq. [~wheat9] wrote: I'm concerned about this. What are the guarantees of the APIs for the
releases? Are the APIs / ABIs going to be compatible once we remove exceptions in later versions?
Can the user simply do a drop-in replacement to upgrade? For the part of libhdfs binding the
answer might be yes, but my general impression is no due to the complexity of SEH on Windows
and various quirks on the implementation of the C++ exceptions.

The current plan is to expose only the existing {{libhdfs.h}} API for now.  Since this is
a C API, it does not include exceptions, clearly.  So I do not think there will be a problem
with this.

What we are discussing is eliminating the use of exceptions internally.  Since this happens
at a level which is not visible to users, it can certainly be done later if we want to.  However,
I would like to see it fixed sooner rather than later.  it is important to stick to a consistent
coding style, and we want to work on the robustness.

I have proposed a C\+\+ API for libhdfs and libhdfs3 at https://issues.apache.org/jira/browse/HDFS-7207.
 I would welcome more discussion there.  Note that my API does not use exceptions and does
not require C\+\+11 (although it can make use of C\+\+11 features if it is available.)

bq. [asynchronous api discussion]

If you look at a high-performance HDFS client like Impala or HAWQ, they are fine with synchronous
APIs.  Why?  Well, most of the time your read performance is limited by the bandwidth of the
local disks (high performance clients always try to do local reads, and use short-circuit
and mmap if possible).  A local hard disk can't handle more than maybe 100 seeks a second,
and the more seeks you do, the lower your bandwidth will be.

There is also the CPU aspect: what are you doing with the data?  Sure you can have 10,000
async requests going with 1 thread, but if that thread is actually doing anything with the
data, you can cut a few zeroes off of that.  And then you're back to an amount of concurrent
reads that can be comfortably done synchronously.

Async APIs work best for cases where you are doing very, very little processing on each request.
 So an async web server like ngnix, which is written in highly optimized straight C (no \+\+)
can squeeze a few more pages per second out of reducing its thread count.  But in a DB it's
tougher (and as you mentioned, it also makes the code much more complex).

So while we should probably consider an async client at some point, I think it is much lower
priority than other things (like finishing the existing native client and merging it)


was (Author: cmccabe):
bq. [~wheat9] wrote: I'm concerned about this. What are the guarantees of the APIs for the
releases? Are the APIs / ABIs going to be compatible once we remove exceptions in later versions?
Can the user simply do a drop-in replacement to upgrade? For the part of libhdfs binding the
answer might be yes, but my general impression is no due to the complexity of SEH on Windows
and various quirks on the implementation of the C++ exceptions.

The current plan is to expose only the existing {{libhdfs.h}} API for now.  Since this is
a C API, it does not include exceptions, clearly.  So I do not think there will be a problem
with this.

What we are discussing is eliminating the use of exceptions internally.  Since this happens
at a level which is not visible to users, it can certainly be done later if we want to.  However,
I would like to see it fixed sooner rather than later.  it is important to stick to a consistent
coding style, and we want to work on the robustness.

I have proposed a C\+\+ API for libhdfs and libhdfs3 at https://issues.apache.org/jira/browse/HDFS-7207.
 I would welcome more discussion there.  Note that my API does not use exceptions and does
not require C\+\+11 (although it can make use of C\+\+ features if it is available.)

bq. [asynchronous api discussion]

If you look at a high-performance HDFS client like Impala or HAWQ, they are fine with synchronous
APIs.  Why?  Well, most of the time your read performance is limited by the bandwidth of the
local disks (high performance clients always try to do local reads, and use short-circuit
and mmap if possible).  A local hard disk can't handle more than maybe 100 seeks a second,
and the more seeks you do, the lower your bandwidth will be.

There is also the CPU aspect: what are you doing with the data?  Sure you can have 10,000
async requests going with 1 thread, but if that thread is actually doing anything with the
data, you can cut a few zeroes off of that.  And then you're back to an amount of concurrent
reads that can be comfortably done synchronously.

Async APIs work best for cases where you are doing very, very little processing on each request.
 So an async web server like ngnix, which is written in highly optimized straight C (no \+\+)
can squeeze a few more pages per second out of reducing its thread count.  But in a DB it's
tougher (and as you mentioned, it also makes the code much more complex).

So while we should probably consider an async client at some point, I think it is much lower
priority than other things (like finishing the existing native client and merging it)

> libhdfs3 - A native C/C++ HDFS client
> -------------------------------------
>
>                 Key: HDFS-6994
>                 URL: https://issues.apache.org/jira/browse/HDFS-6994
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs-client
>            Reporter: Zhanwei Wang
>            Assignee: Zhanwei Wang
>         Attachments: HDFS-6994-rpc-8.patch, HDFS-6994.patch
>
>
> Hi All
> I just got the permission to open source libhdfs3, which is a native C/C++ HDFS client
based on Hadoop RPC protocol and HDFS Data Transfer Protocol.
> libhdfs3 provide the libhdfs style C interface and a C++ interface. Support both HADOOP
RPC version 8 and 9. Support Namenode HA and Kerberos authentication.
> libhdfs3 is currently used by HAWQ of Pivotal
> I'd like to integrate libhdfs3 into HDFS source code to benefit others.
> You can find libhdfs3 code from github
> https://github.com/PivotalRD/libhdfs3
> http://pivotalrd.github.io/libhdfs3/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message