hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3672) support for persistent connections to improve random read performance.
Date Mon, 14 Jul 2008 23:04:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613483#action_12613483

Raghu Angadi commented on HADOOP-3672:

> Currently, once you ask a datanode to start sending you data, it keeps sending that block
until you close the connection or the entire block has been sent.

One clarification : the above is true only if client does not inform how much data it wants
(i.e. normal reads). For pread()s, client tells datanode how much data it wants and datanode
only sends that much data. The description of the jira makes me think this is about preads.

Regd rest of the comment, are you proposing RPCs for all datanode transfers? If I remember
correctly HBase was considering not to use RPCs for large data transfers.

> However, with sufficiently large buffers, round-trip delays introduced by RPC might not
be significant.
> For example, round-trip delays might be significant for 8k buffers but not for 128k buffers.

I guess it depends on actual design/implementation. If datanode sends one buffer in RPC and
and client sends next RPC after consuming the first RPC, then most of the overhead is the
fact that datanode is idle half the time (assuming client and datanode  can produce and consume
at the same rate) and this overhead does not depend much on buffer size.

I think RPCs for data transfer comes up pretty often, may be someone should open a jira and
have detailed design so that it easier to discuss the specifics.

> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB memory
>            Reporter: George Wu
>         Attachments: pread_test.java
> preads() establish new connections per request. yourkit java profiles show that this
connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() from different
offsets of a large file. I hacked DFSClient/DataNode code to re-use the same connection/DataNode
request handler thread. The performance improvement was 7% when the data is served from disk
and 80% when the data is served from the OS page cache.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message