hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-2656) Implement a pure c client based on webhdfs
Date Wed, 22 Aug 2012 18:40:42 GMT

     [ https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jing Zhao updated HDFS-2656:
----------------------------

    Attachment: teragen_terasort_teravalidate_performance.png
                HDFS-2656.patch

Some update on the libwebhdfs. The main change is trying to keep the same writing semantic
with libhdfs (thanks to Zhanwei for pointing that out), i.e., when a client opens a file for
writing/appending, before the client closes the file, other clients should not be able to
open the same file for writing. This is achieved by maintain a single http connection with
the corresponding datanode between the open and close operation. Also addressed part of Nicholas's
comments -- try to get rid of some of unnecessary memory copying. 

For the performance measurement, before we directly compare the performance between libhdfs
and libwebhdfs, we first run teragen, terasort, teravalidate in a 3-node mini cluster (data
size 100,000,000), and compare the performance between using DFSClient and WebHdfs. The measurement
result is also attached. It seems like the main performance bottleneck for webhdfs is in reading
(teravalidate), which is >3 times more than DFSClient. This is maybe because currently
webhdfs uses a datanode as a proxy node for reading data even if it is across block boundaries.

So in the next step, my work will focus on 1) test/fix/improve current code, and 2) to develop
a smarter reading mechanism in the client side (i.e., to identify the block locations for
a large file in the client side), and 3) to improve client reading performance by decreasing
the number of times of http connection creation.

Waiting for your guys' comments!
                
> Implement a pure c client based on webhdfs
> ------------------------------------------
>
>                 Key: HDFS-2656
>                 URL: https://issues.apache.org/jira/browse/HDFS-2656
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: webhdfs
>            Reporter: Zhanwei.Wang
>         Attachments: HDFS-2656.patch, HDFS-2656.patch, HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png
>
>
> Currently, the implementation of libhdfs is based on JNI. The overhead of JVM seems a
little big, and libhdfs can also not be used in the environment without hdfs.
> It seems a good idea to implement a pure c client by wrapping webhdfs. It also can be
used to access different version of hdfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message