hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3672) support for persistent connections to improve random read performance.
Date Mon, 14 Jul 2008 19:36:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613422#action_12613422

Doug Cutting commented on HADOOP-3672:

> The RPC code is synchronous by definition.

Perhaps.  It would not be hard to add async calls and responses to RPC.  Maybe you'd no longer
call it  "RPC" then?  In any case, one could, e.g., add RPC methods like:

/** return a call id */
public static int send(Method method, Object[] params,  InetSocketAddress addr);

/** poll a set of call ids, returning count that are complete */
public static int select(int[] calls, boolean[] areDone, Object[] values);

Then the DFSClient code could then, each time it receives a buffer, before returning to the
client, request the next buffer, as a read-ahead. 

> there will be a degradation of read performance compared to what we have now

Perhaps, but would it be significant?  Would the costs outweigh the benefits of sharing code,
connection pools, security models, etc. with RPC code?

> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB memory
>            Reporter: George Wu
>         Attachments: pread_test.java
> preads() establish new connections per request. yourkit java profiles show that this
connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() from different
offsets of a large file. I hacked DFSClient/DataNode code to re-use the same connection/DataNode
request handler thread. The performance improvement was 7% when the data is served from disk
and 80% when the data is served from the OS page cache.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message