hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7122) Use of ThreadLocal<Random> results in poor block placement
Date Wed, 01 Oct 2014 18:25:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155248#comment-14155248

Tsz Wo Nicholas Sze commented on HDFS-7122:

The original patch was also reviewed for several weeks and given a +1 by Aaron, who has been
working full time on HDFS for several years. I also discussed the design with Andrew offline
(though I apologize for not commenting on the JIRA). So, we had around 10 man-years of experience
looking at the patch or its design and we still missed the bug. I wouldn't say that "domain
expertise" was the issue there. Rather, it was a simple mistake which was quickly addressed
after being discovered.
I am surprise that you are measuring the HDFS man-years and adding them up!  Do you think
if it makes sense for 3 four-grade students to claim that they all together is equivalent
to a twelve-grade student?

I agree that it is a simple mistake (i.e. not a complicated one).  However, I do not understand
why we did not revert the original patch but committed another patch for reverting the code.

Please do post on JIRA if you have comments.  Otherwise, it is hard for hard people to work
with you.  Thanks.

Nicholas – speaking of respect, could you please refrain from ad hominem attacks like this
in the JIRA?
I was responding to Andrew's earlier comments about he working on CDH with people in Cloudera
but not working on Apache trunk with Apache open source community.  I am not sure why you
have such interpretation.

The data seems to show that Andrew works quite a bit with open source. In fact, he is the
number one most prolific committer to Hadoop in the last 6 months, and the number two most
prolific in the past year (#1 if you only count HDFS). I don't think dragging unsubstantiated
attacks on the open source commitment level of different vendors is productive in the context
of a bug fix.

The man-years of experience counted earlier includes Andrew, Aaron and Todd.  The so called
"most prolific in last 6 months" measurement includes only Andrew.  Is it suggesting that,
for Andrew alone, the number of man-years is very small and, for Andrew, Aaron and Todd, the
"most prolific in last 6 months" measurement is not quite good?

> Use of ThreadLocal<Random> results in poor block placement
> ----------------------------------------------------------
>                 Key: HDFS-7122
>                 URL: https://issues.apache.org/jira/browse/HDFS-7122
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.3.0
>         Environment: medium-large environments with 100's to 1000's of DNs will be most
affected, but potentially all environments.
>            Reporter: Jeff Buell
>            Assignee: Andrew Wang
>            Priority: Blocker
>              Labels: performance
>             Fix For: 2.6.0
>         Attachments: copies_per_slave.jpg, hdfs-7122-cdh5.1.2-testing.001.patch, hdfs-7122.001.patch
> Summary:
> Since HDFS-6268, the distribution of replica block copies across the DataNodes (replicas
2,3,... as distinguished from the first "primary" replica) is extremely poor, to the point
that TeraGen slows down by as much as 3X for certain configurations.  This is almost certainly
due to the introduction of Thread Local Random in HDFS-6268.  The mechanism appears to be
that this change causes all the random numbers in the threads to be correlated, thus preventing
a truly random choice of DN for each replica copy.
> Testing details:
> 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 256MB block
size.  This results in 6 "primary" blocks on each DN.  With replication=3, there will be on
average 12 more copies on each DN that are copies of blocks from other DNs.  Because of the
random selection of DNs, exactly 12 copies are not expected, but I found that about 160 DNs
(1/4 of all DNs!) received absolutely no copies, while one DN received over 100 copies, and
the elapsed time increased by about 3X from a pre-HDFS-6268 distro.  There was no pattern
to which DNs didn't receive copies, nor was the set of such DNs repeatable run-to-run. In
addition to the performance problem, there could be capacity problems due to one or a few
DNs running out of space. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but
I don't see a significant difference from the Apache Hadoop source in this regard. The workaround
to recover the previous behavior is to set dfs.namenode.handler.count=1 but of course this
has scaling implications for large clusters.
> I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a better
algorithm can be implemented and tested.  Testing should include a case with many DNs and
a small number of blocks on each.
> It should also be noted that even pre-HDFS-6268, the random choice of DN algorithm produces
a rather non-uniform distribution of copies.  This is not due to any bug, but purely a case
of random distributions being much less uniform than one might intuitively expect. In the
above case, pre-HDFS-6268 yields something like a range of 3 to 25 block copies on each DN.
Surprisingly, the performance penalty of this non-uniformity is not as big as might be expected
(maybe only 10-20%), but HDFS should do better, and in any case the capacity issue remains.
 Round-robin choice of DN?  Better awareness of which DNs currently store fewer blocks? It's
not sufficient that the total number of blocks is similar on each DN at the end, but that
at each point in time no individual DN receives a disproportionate number of blocks at once
(which could be a danger of a RR algorithm).
> Probably should limit this jira to tracking the ThreadLocal issue, and track the random
choice issue in another one.

This message was sent by Atlassian JIRA

View raw message