Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C49CE175EA for ; Fri, 26 Sep 2014 20:18:38 +0000 (UTC) Received: (qmail 76338 invoked by uid 500); 26 Sep 2014 20:18:38 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 76298 invoked by uid 500); 26 Sep 2014 20:18:37 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 76189 invoked by uid 99); 26 Sep 2014 20:18:37 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Sep 2014 20:18:37 +0000 Date: Fri, 26 Sep 2014 20:18:37 +0000 (UTC) From: "Jeff Buell (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-7122) Very poor distribution of replication copies MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-7122?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Buell updated HDFS-7122: ----------------------------- Attachment: copies_per_slave.jpg > Very poor distribution of replication copies > -------------------------------------------- > > Key: HDFS-7122 > URL: https://issues.apache.org/jira/browse/HDFS-7122 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.3.0 > Environment: medium-large environments with 100's to 1000's of DN= s will be most affected, but potentially all environments. > Reporter: Jeff Buell > Assignee: Andrew Wang > Priority: Blocker > Labels: performance > Attachments: copies_per_slave.jpg > > > Summary: > Since HDFS-6268, the distribution of replica block copies across the Data= Nodes (replicas 2,3,... as distinguished from the first "primary" replica) = is extremely poor, to the point that TeraGen slows down by as much as 3X fo= r certain configurations. This is almost certainly due to the introduction= of Thread Local Random in HDFS-6268. The mechanism appears to be that thi= s change causes all the random numbers in the threads to be correlated, thu= s preventing a truly random choice of DN for each replica copy. > Testing details: > 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), = 256MB block size. This results in 6 "primary" blocks on each DN. With rep= lication=3D3, there will be on average 12 more copies on each DN that are c= opies of blocks from other DNs. Because of the random selection of DNs, ex= actly 12 copies are not expected, but I found that about 160 DNs (1/4 of al= l DNs!) received absolutely no copies, while one DN received over 100 copie= s, and the elapsed time increased by about 3X from a pre-HDFS-6268 distro. = There was no pattern to which DNs didn't receive copies, nor was the set o= f such DNs repeatable run-to-run. In addition to the performance problem, t= here could be capacity problems due to one or a few DNs running out of spac= e. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don'= t see a significant difference from the Apache Hadoop source in this regard= . The workaround to recover the previous behavior is to set dfs.namenode.ha= ndler.count=3D1 but of course this has scaling implications for large clust= ers. > I recommend that the ThreadLocal Random part of HDFS-6268 be reverted unt= il a better algorithm can be implemented and tested. Testing should includ= e a case with many DNs and a small number of blocks on each. > It should also be noted that even pre-HDFS-6268, the random choice of DN = algorithm produces a rather non-uniform distribution of copies. This is no= t due to any bug, but purely a case of random distributions being much less= uniform than one might intuitively expect. In the above case, pre-HDFS-626= 8 yields something like a range of 3 to 25 block copies on each DN. Surpris= ingly, the performance penalty of this non-uniformity is not as big as migh= t be expected (maybe only 10-20%), but HDFS should do better, and in any ca= se the capacity issue remains. Round-robin choice of DN? Better awareness= of which DNs currently store fewer blocks? It's not sufficient that the to= tal number of blocks is similar on each DN at the end, but that at each poi= nt in time no individual DN receives a disproportionate number of blocks at= once (which could be a danger of a RR algorithm). > Probably should limit this jira to tracking the ThreadLocal issue, and tr= ack the random choice issue in another one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)