Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9576E1191C for ; Mon, 22 Sep 2014 18:26:34 +0000 (UTC) Received: (qmail 75120 invoked by uid 500); 22 Sep 2014 18:26:33 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 75016 invoked by uid 500); 22 Sep 2014 18:26:33 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 75003 invoked by uid 99); 22 Sep 2014 18:26:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Sep 2014 18:26:33 +0000 Date: Mon, 22 Sep 2014 18:26:33 +0000 (UTC) From: "Jeff Buell (JIRA)" To: hdfs-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-7122) Very poor distribution of replication copies MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Jeff Buell created HDFS-7122: -------------------------------- Summary: Very poor distribution of replication copies Key: HDFS-7122 URL: https://issues.apache.org/jira/browse/HDFS-7122 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.3.0 Environment: medium-large environments with 100's to 1000's of DNs= will be most affected, but potentially all environments. Reporter: Jeff Buell Summary: Since HDFS-6268, the distribution of replica block copies across the DataNo= des (replicas 2,3,... as distinguished from the first "primary" replica) is= extremely poor, to the point that TeraGen slows down by as much as 3X for = certain configurations. This is almost certainly due to the introduction o= f Thread Local Random in HDFS-6268. The mechanism appears to be that this = change causes all the random numbers in the threads to be correlated, thus = preventing a truly random choice of DN for each replica copy. Testing details: 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 25= 6MB block size. This results in 6 "primary" blocks on each DN. With repli= cation=3D3, there will be on average 12 more copies on each DN that are cop= ies of blocks from other DNs. Because of the random selection of DNs, exac= tly 12 copies are not expected, but I found that about 160 DNs (1/4 of all = DNs!) received absolutely no copies, while one DN received over 100 copies,= and the elapsed time increased by about 3X from a pre-HDFS-6268 distro. T= here was no pattern to which DNs didn't receive copies, nor was the set of = such DNs repeatable run-to-run. In addition to the performance problem, the= re could be capacity problems due to one or a few DNs running out of space.= Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don't = see a significant difference from the Apache Hadoop source in this regard. = The workaround to recover the previous behavior is to set dfs.namenode.hand= ler.count=3D1 but of course this has scaling implications for large cluster= s. I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until= a better algorithm can be implemented and tested. Testing should include = a case with many DNs and a small number of blocks on each. It should also be noted that even pre-HDFS-6268, the random choice of DN al= gorithm produces a rather non-uniform distribution of copies. This is not = due to any bug, but purely a case of random distributions being much less u= niform than one might intuitively expect. In the above case, pre-HDFS-6268 = yields something like a range of 3 to 25 block copies on each DN. Surprisin= gly, the performance penalty of this non-uniformity is not as big as might = be expected (maybe only 10-20%), but HDFS should do better, and in any case= the capacity issue remains. Round-robin choice of DN? Better awareness o= f which DNs currently store fewer blocks? It's not sufficient that the tota= l number of blocks is similar on each DN at the end, but that at each point= in time no individual DN receives a disproportionate number of blocks at o= nce (which could be a danger of a RR algorithm). Probably should limit this jira to tracking the ThreadLocal issue, and trac= k the random choice issue in another one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)