hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Buell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7122) Very poor distribution of replication copies
Date Mon, 22 Sep 2014 18:26:33 GMT
Jeff Buell created HDFS-7122:

             Summary: Very poor distribution of replication copies
                 Key: HDFS-7122
                 URL: https://issues.apache.org/jira/browse/HDFS-7122
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.3.0
         Environment: medium-large environments with 100's to 1000's of DNs will be most affected,
but potentially all environments.
            Reporter: Jeff Buell

Since HDFS-6268, the distribution of replica block copies across the DataNodes (replicas 2,3,...
as distinguished from the first "primary" replica) is extremely poor, to the point that TeraGen
slows down by as much as 3X for certain configurations.  This is almost certainly due to the
introduction of Thread Local Random in HDFS-6268.  The mechanism appears to be that this change
causes all the random numbers in the threads to be correlated, thus preventing a truly random
choice of DN for each replica copy.

Testing details:
1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 256MB block size.
 This results in 6 "primary" blocks on each DN.  With replication=3, there will be on average
12 more copies on each DN that are copies of blocks from other DNs.  Because of the random
selection of DNs, exactly 12 copies are not expected, but I found that about 160 DNs (1/4
of all DNs!) received absolutely no copies, while one DN received over 100 copies, and the
elapsed time increased by about 3X from a pre-HDFS-6268 distro.  There was no pattern to which
DNs didn't receive copies, nor was the set of such DNs repeatable run-to-run. In addition
to the performance problem, there could be capacity problems due to one or a few DNs running
out of space. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don't see
a significant difference from the Apache Hadoop source in this regard. The workaround to recover
the previous behavior is to set dfs.namenode.handler.count=1 but of course this has scaling
implications for large clusters.

I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a better algorithm
can be implemented and tested.  Testing should include a case with many DNs and a small number
of blocks on each.

It should also be noted that even pre-HDFS-6268, the random choice of DN algorithm produces
a rather non-uniform distribution of copies.  This is not due to any bug, but purely a case
of random distributions being much less uniform than one might intuitively expect. In the
above case, pre-HDFS-6268 yields something like a range of 3 to 25 block copies on each DN.
Surprisingly, the performance penalty of this non-uniformity is not as big as might be expected
(maybe only 10-20%), but HDFS should do better, and in any case the capacity issue remains.
 Round-robin choice of DN?  Better awareness of which DNs currently store fewer blocks? It's
not sufficient that the total number of blocks is similar on each DN at the end, but that
at each point in time no individual DN receives a disproportionate number of blocks at once
(which could be a danger of a RR algorithm).

Probably should limit this jira to tracking the ThreadLocal issue, and track the random choice
issue in another one.

This message was sent by Atlassian JIRA

View raw message