hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3672) support for persistent connections to improve random read performance.
Date Mon, 14 Jul 2008 22:46:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613477#action_12613477
] 

Doug Cutting commented on HADOOP-3672:
--------------------------------------

> Why are asynchronous RPCs even relevant here?

Currently, once you ask a datanode to start sending you data, it keeps sending that block
until you close the connection or the entire block has been sent.  TCP's flow control is async,
so there are no round-trip delays once a block starts streaming to the client.  However, with
sufficiently large buffers, round-trip delays introduced by RPC might not be significant.
 For example, round-trip delays might be significant for 8k buffers but not for 128k buffers.
 But we probably don't want to make buffers too large, so if the round-trip overhead proves
to be significant even with 128k buffers, then we should consider using async RPC.  Make sense?


> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB memory
>            Reporter: George Wu
>         Attachments: pread_test.java
>
>
> preads() establish new connections per request. yourkit java profiles show that this
connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() from different
offsets of a large file. I hacked DFSClient/DataNode code to re-use the same connection/DataNode
request handler thread. The performance improvement was 7% when the data is served from disk
and 80% when the data is served from the OS page cache.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message