hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-941) Datanode xceiver protocol should allow reuse of a connection
Date Mon, 20 Jun 2011 19:10:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052166#comment-13052166
] 

Konstantin Shvachko commented on HDFS-941:
------------------------------------------

Answers to some issues raised here:

Stack> RM says whats in a release and no one else.

We can still talk about technical merits of the implementation, don't we?

Todd> nrFiles <= nrNodes means full locality, right?

No. In DFSIO there is no locality, since files that DFSIO reads/writes are not the input of
the MR job. Their names are. The reason here is to make sure the job completes in one wave
of mappers, and to minimize contention on the drives between tasks.

I was trying to avoid making this issue yet another discussion about DFSIO, because
the objective here is to verify that the patch does not introduce regression in performance
for sequential ios. If the benchmark I proposed doesn't work for you guys, you can propose
a different one.

Dhruba, Todd, Nicholas> TestDFSIO exhibits very high variance, and its results are dependent
on mapreduce's scheduling.

DFSIO does not depend on the MR scheduling. It depends on the OS memory cache. 
Cluster nodes these days run with 16, 32 GB RAM. So a 10GB file almost entirely can be cached
by OS. When you repeatedly run DFSIO then you are not measuring cold IO, but RAM access and
communication. And high variation is explained by the fact that some data is cached and some
is not.
For example DFSIO -write is usually very stable with std.dev < 1. This is because it deals
with cold writes.
For DFSIO -read you need to choose file size larger than your RAM. With sequential reads OS
cache works as LRU, so if your file is larger than RAM, the OS cache will "forget" blocks
from the head of the file, when you get to reading the tail. And when you start reading the
file again cache will release oldest pages, which correspond to the higher offset in the file.
So it is going to be cold read.
I had to go to 100GB files, which brought std.dev to < 2, and variation in throughput was
around 3%.
Alternatively you can clean Linux cache on all DataNodes.
 
Nicholas> it is hard to explain what do the "Throughput" and "Average IO rate" really mean.

[This post|http://old.nabble.com/Re%3A-TestDFSIO-delivers-bad-values-of-%22throughput%22-and-%22average-IO-rate%22-p21322404.html]
has the definitions.

Nicholas, I agree with you the results you are posting don't make sense. 
The point is though not to screw the benchmark, but to find the conditions when it reliably
measures what you need.

> Datanode xceiver protocol should allow reuse of a connection
> ------------------------------------------------------------
>
>                 Key: HDFS-941
>                 URL: https://issues.apache.org/jira/browse/HDFS-941
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: bc Wong
>             Fix For: 0.22.0
>
>         Attachments: 941.22.txt, 941.22.txt, 941.22.v2.txt, 941.22.v3.txt, HDFS-941-1.patch,
HDFS-941-2.patch, HDFS-941-3.patch, HDFS-941-3.patch, HDFS-941-4.patch, HDFS-941-5.patch,
HDFS-941-6.22.patch, HDFS-941-6.patch, HDFS-941-6.patch, HDFS-941-6.patch, fix-close-delta.txt,
hdfs-941.txt, hdfs-941.txt, hdfs-941.txt, hdfs-941.txt, hdfs941-1.png
>
>
> Right now each connection into the datanode xceiver only processes one operation.
> In the case that an operation leaves the stream in a well-defined state (eg a client
reads to the end of a block successfully) the same connection could be reused for a second
operation. This should improve random read performance significantly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message