Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
Date: Mon, 22 Sep 2014 18:26:33 +0000 (UTC)
From: "Jeff Buell (JIRA)" <jira@apache.org>
To: hdfs-dev@hadoop.apache.org
Message-ID: <JIRA.12743278.1411410341000.96046.1411410393712@Atlassian.JIRA>
In-Reply-To: <JIRA.12743278.1411410341000@Atlassian.JIRA>
References: <JIRA.12743278.1411410341000@Atlassian.JIRA>
 <JIRA.12743278.1411410341300@arcas>
Subject: [jira] [Created] (HDFS-7122) Very poor distribution of replication
 copies
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Jeff Buell created HDFS-7122:
--------------------------------

             Summary: Very poor distribution of replication copies
                 Key: HDFS-7122
                 URL: https://issues.apache.org/jira/browse/HDFS-7122
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.3.0
         Environment: medium-large environments with 100's to 1000's of DNs=
 will be most affected, but potentially all environments.
            Reporter: Jeff Buell


Summary:
Since HDFS-6268, the distribution of replica block copies across the DataNo=
des (replicas 2,3,... as distinguished from the first "primary" replica) is=
 extremely poor, to the point that TeraGen slows down by as much as 3X for =
certain configurations.  This is almost certainly due to the introduction o=
f Thread Local Random in HDFS-6268.  The mechanism appears to be that this =
change causes all the random numbers in the threads to be correlated, thus =
preventing a truly random choice of DN for each replica copy.

Testing details:
1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 25=
6MB block size.  This results in 6 "primary" blocks on each DN.  With repli=
cation=3D3, there will be on average 12 more copies on each DN that are cop=
ies of blocks from other DNs.  Because of the random selection of DNs, exac=
tly 12 copies are not expected, but I found that about 160 DNs (1/4 of all =
DNs!) received absolutely no copies, while one DN received over 100 copies,=
 and the elapsed time increased by about 3X from a pre-HDFS-6268 distro.  T=
here was no pattern to which DNs didn't receive copies, nor was the set of =
such DNs repeatable run-to-run. In addition to the performance problem, the=
re could be capacity problems due to one or a few DNs running out of space.=
 Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don't =
see a significant difference from the Apache Hadoop source in this regard. =
The workaround to recover the previous behavior is to set dfs.namenode.hand=
ler.count=3D1 but of course this has scaling implications for large cluster=
s.

I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until=
 a better algorithm can be implemented and tested.  Testing should include =
a case with many DNs and a small number of blocks on each.

It should also be noted that even pre-HDFS-6268, the random choice of DN al=
gorithm produces a rather non-uniform distribution of copies.  This is not =
due to any bug, but purely a case of random distributions being much less u=
niform than one might intuitively expect. In the above case, pre-HDFS-6268 =
yields something like a range of 3 to 25 block copies on each DN. Surprisin=
gly, the performance penalty of this non-uniformity is not as big as might =
be expected (maybe only 10-20%), but HDFS should do better, and in any case=
 the capacity issue remains.  Round-robin choice of DN?  Better awareness o=
f which DNs currently store fewer blocks? It's not sufficient that the tota=
l number of blocks is similar on each DN at the end, but that at each point=
 in time no individual DN receives a disproportionate number of blocks at o=
nce (which could be a danger of a RR algorithm).

Probably should limit this jira to tracking the ThreadLocal issue, and trac=
k the random choice issue in another one.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)