hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-941) Datanode xceiver protocol should allow reuse of a connection
Date Wed, 08 Jun 2011 21:07:59 GMT

     [ https://issues.apache.org/jira/browse/HDFS-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HDFS-941:

    Attachment: hdfs-941.txt

New patch with response to stack's review:

bq. ... google-collections ...

I think you were looking at a previous patch. New patch uses guava, which is already included
in ivy.xml since a month or two ago

bq. If you make another version of this patch, change the mentions of getEOS in comments to
be 'eos' to match the change of variable name.

fixed (only found one such mention)

bq. 469 +        sock.setTcpNoDelay(true);
Sharp eyes. The reason we need TCP_NODELAY now whereas we didn't before is that, without it,
we'll hit this nasty interaction between nagle's algorithm and delayed ACKs. Basically, our
interaction pattern now looks like:

C -> DN: read block X
DN -> C: data
C -> DN: status OK
C -> DN: read block Y

The fact that we send two packets in a row triggers Nagle's algorithm - it won't send the
read request for block Y until it's received an ACK for its "status OK" packet, or a length
delay elapses (40ms I think in Linux). The DN isn't sending the ACK for "status OK" because
it's gone into delayed ACK mode. So, throughput drops by a factor of 20x or more. I confirmed
this with TestParallelRead - I saw a 12MB/sec result vs the 274M/sec with NODELAY on.

NODELAY fixes this by forcing the "read block Y" request to be sent immediately even though
the ACK for the prior packet is still outstanding.

To be sure that the performance gains in this patch aren't just from NODELAY, I added NODELAY
to unpatched trunk and ran TestParallelRead. It made no performance difference when it wasn't
combined with connection reuse.

I'll add a comment to the code explaining why the nodelay is crucial for performance.

bq. Also, old code used set timer after making connection
Fixed, I don't think this was for any particular reason.

bq. You think 16 a good number for the socket cache (doesn't seem easily chanageable)?

16 seems like a decent default, but I made it configurable just in case it bites anyone in
production. Added dfs.client.socketcache.capacity as an undocumented parameter, since we don't
expect it to be tuned.

> Datanode xceiver protocol should allow reuse of a connection
> ------------------------------------------------------------
>                 Key: HDFS-941
>                 URL: https://issues.apache.org/jira/browse/HDFS-941
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: bc Wong
>         Attachments: HDFS-941-1.patch, HDFS-941-2.patch, HDFS-941-3.patch, HDFS-941-3.patch,
HDFS-941-4.patch, HDFS-941-5.patch, HDFS-941-6.22.patch, HDFS-941-6.patch, HDFS-941-6.patch,
HDFS-941-6.patch, hdfs-941.txt, hdfs-941.txt, hdfs-941.txt, hdfs941-1.png
> Right now each connection into the datanode xceiver only processes one operation.
> In the case that an operation leaves the stream in a well-defined state (eg a client
reads to the end of a block successfully) the same connection could be reused for a second
operation. This should improve random read performance significantly.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message