Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Fri, 26 Sep 2014 20:18:37 +0000 (UTC)
From: "Jeff Buell (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12743278.1411410341000.134519.1411762717816@Atlassian.JIRA>
In-Reply-To: <JIRA.12743278.1411410341000@Atlassian.JIRA>
References: <JIRA.12743278.1411410341000@Atlassian.JIRA>
 <JIRA.12743278.1411410341300@arcas>
Subject: [jira] [Updated] (HDFS-7122) Very poor distribution of replication
 copies
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HDFS-7122?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Buell updated HDFS-7122:
-----------------------------
    Attachment: copies_per_slave.jpg

> Very poor distribution of replication copies
> --------------------------------------------
>
>                 Key: HDFS-7122
>                 URL: https://issues.apache.org/jira/browse/HDFS-7122
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.3.0
>         Environment: medium-large environments with 100's to 1000's of DN=
s will be most affected, but potentially all environments.
>            Reporter: Jeff Buell
>            Assignee: Andrew Wang
>            Priority: Blocker
>              Labels: performance
>         Attachments: copies_per_slave.jpg
>
>
> Summary:
> Since HDFS-6268, the distribution of replica block copies across the Data=
Nodes (replicas 2,3,... as distinguished from the first "primary" replica) =
is extremely poor, to the point that TeraGen slows down by as much as 3X fo=
r certain configurations.  This is almost certainly due to the introduction=
 of Thread Local Random in HDFS-6268.  The mechanism appears to be that thi=
s change causes all the random numbers in the threads to be correlated, thu=
s preventing a truly random choice of DN for each replica copy.
> Testing details:
> 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), =
256MB block size.  This results in 6 "primary" blocks on each DN.  With rep=
lication=3D3, there will be on average 12 more copies on each DN that are c=
opies of blocks from other DNs.  Because of the random selection of DNs, ex=
actly 12 copies are not expected, but I found that about 160 DNs (1/4 of al=
l DNs!) received absolutely no copies, while one DN received over 100 copie=
s, and the elapsed time increased by about 3X from a pre-HDFS-6268 distro. =
 There was no pattern to which DNs didn't receive copies, nor was the set o=
f such DNs repeatable run-to-run. In addition to the performance problem, t=
here could be capacity problems due to one or a few DNs running out of spac=
e. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don'=
t see a significant difference from the Apache Hadoop source in this regard=
. The workaround to recover the previous behavior is to set dfs.namenode.ha=
ndler.count=3D1 but of course this has scaling implications for large clust=
ers.
> I recommend that the ThreadLocal Random part of HDFS-6268 be reverted unt=
il a better algorithm can be implemented and tested.  Testing should includ=
e a case with many DNs and a small number of blocks on each.
> It should also be noted that even pre-HDFS-6268, the random choice of DN =
algorithm produces a rather non-uniform distribution of copies.  This is no=
t due to any bug, but purely a case of random distributions being much less=
 uniform than one might intuitively expect. In the above case, pre-HDFS-626=
8 yields something like a range of 3 to 25 block copies on each DN. Surpris=
ingly, the performance penalty of this non-uniformity is not as big as migh=
t be expected (maybe only 10-20%), but HDFS should do better, and in any ca=
se the capacity issue remains.  Round-robin choice of DN?  Better awareness=
 of which DNs currently store fewer blocks? It's not sufficient that the to=
tal number of blocks is similar on each DN at the end, but that at each poi=
nt in time no individual DN receives a disproportionate number of blocks at=
 once (which could be a danger of a RR algorithm).
> Probably should limit this jira to tracking the ThreadLocal issue, and tr=
ack the random choice issue in another one.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)