hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3672) support for persistent connections to improve random read performance.
Date Mon, 14 Jul 2008 23:43:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613489#action_12613489

Doug Cutting commented on HADOOP-3672:

> I think current implementation serializes/deserializes all the arguments and returned
objects, doesn't it imply extra copies?

I don't think so.  It calls their write(OutputStream) methods.  If an object points to a stream's
buffer, and that buffer is at least as the OutputStream's buffer, then that OutputStream should
cascade the write directly to its underlying OutputStream, and so on, until the data is written
directly from the client's buffer to the socket.

Supporting kernel transfer would be a little trickier, but not impossible, I think.  The write()
method for the response object would need to transfer data directly from the block to the
socket, right?  So we'd need a generic way to get the socket's channel from the OutputStream
passed to write(), right?

> I should probably just wait for a design for RPC transfers.

The design can't happen without knowing the requirements.  So, yes, we should start a separate
issue about this, but the first step is understanding what would be required of RPC before
it would be useable by HDFS.  Perhaps a separate issue should be filed to explore what would
be needed for RPC to be usable by the mapred shuffle?

> support for persistent connections to improve random read performance.
> ----------------------------------------------------------------------
>                 Key: HADOOP-3672
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3672
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: Linux 2.6.9-55  , Dual Core Opteron 280 2.4Ghz , 4GB memory
>            Reporter: George Wu
>         Attachments: pread_test.java
> preads() establish new connections per request. yourkit java profiles show that this
connection overhead is pretty significant on the DataNode. 
> I wrote a simple microbenchmark program which does many iterations of pread() from different
offsets of a large file. I hacked DFSClient/DataNode code to re-use the same connection/DataNode
request handler thread. The performance improvement was 7% when the data is served from disk
and 80% when the data is served from the OS page cache.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message