hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3672) support for persistent connections to improve random read performance.
Date Mon, 14 Jul 2008 23:26:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613486#action_12613486

Doug Cutting commented on HADOOP-3672:

Applications that perform random access don't always know how much they'll read.  For example,
Lucene uses read(), not pread(), to retrieve posting lists.  Lucene could perhaps be modified
so that it could provide lengths whenever it reads data.  So we'd ideally like random access
performance to be good for both read() and pread().  Most filesystems optimize both cases,
and consequently most applications are written assuming that a random read() will be reasonably

> are you proposing RPCs for all datanode transfers?

We need to understand whether there are hard reasons why we cannot use RPC for all network
communications.  Right now, HDFS uses both RPC and raw TCP, and mapred uses RPC and HTTP.
 Security, authentication and authorization would all be simpler if we used fewer communication
mechanisms, plus we'd have a unified connection cache, etc.  But we obviously don't want to
go that way if it will kill performance.

> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB memory
>            Reporter: George Wu
>         Attachments: pread_test.java
> preads() establish new connections per request. yourkit java profiles show that this
connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() from different
offsets of a large file. I hacked DFSClient/DataNode code to re-use the same connection/DataNode
request handler thread. The performance improvement was 7% when the data is served from disk
and 80% when the data is served from the OS page cache.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message