hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3672) support for persistent connections to improve random read performance.
Date Mon, 14 Jul 2008 21:44:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613465#action_12613465

Raghu Angadi commented on HADOOP-3672:

Why are asynchronous RPCs even relevant here? FileSystem does *not* have an async read api.

Also I don't follow degradation because of round trip. There are round trip costs in current
datanode protocol also (client sends a request for data and DN sends it, not fundamentally
different from an RPC).

Main problem I see with RPCs is that it involves extra copies of data both at the server and
the clients (with current implementation) and needs few feature to muxing/demuxing of data
for different blocks so that one read does not block the other. This is quite a bit of change
and would be hard pressed to show any improvement over what we have now. Avoiding buffer copies
is an absolute must to even consider the approach, I think.

I think we have better quantify over head of connection establishment and I think a very simple
connection sharing at DFSClient.BlockReader() level for preads will do the job. The example
attached reads 4k in reverse order. Is 4k long enough? Is reading in reverse order same as
random access?

> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB memory
>            Reporter: George Wu
>         Attachments: pread_test.java
> preads() establish new connections per request. yourkit java profiles show that this
connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() from different
offsets of a large file. I hacked DFSClient/DataNode code to re-use the same connection/DataNode
request handler thread. The performance improvement was 7% when the data is served from disk
and 80% when the data is served from the OS page cache.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message