Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 06C66108E0 for ; Thu, 18 Apr 2013 00:45:57 +0000 (UTC) Received: (qmail 35435 invoked by uid 500); 18 Apr 2013 00:45:51 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 35304 invoked by uid 500); 18 Apr 2013 00:45:51 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 35295 invoked by uid 99); 18 Apr 2013 00:45:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 00:45:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wyssman@gmail.com designates 209.85.219.48 as permitted sender) Received: from [209.85.219.48] (HELO mail-oa0-f48.google.com) (209.85.219.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 00:45:47 +0000 Received: by mail-oa0-f48.google.com with SMTP id f4so466504oah.35 for ; Wed, 17 Apr 2013 17:45:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=TecGxhx9wlRQL9Kw1IBAhicYY40o6YXo+YeLGqpMJ8U=; b=rH12DLRr19uXgDnTnz6wnQMP/liiKr9QQttAO3GxzdpIfi//jDDSdrtQPPZJhurm6D GNGUV/g+m14tUqSpeDuFbD7zBpW4raPTdPRVv8tvhhk9VTvRenKaealiKFtwvi3XyhI3 TaeA6rgAHVdMRg0LeZnlmNnB9HcTCVvNOo/08RBf3Ut4AJQsniEhMRyWt2z+gzo56Xsd 6NTbcixMt4bjQ94mAoEgOxeKxr6bBu2e/M8pMAqeViVgUGdYtR1d/opgnbUlbsk+MUSK I4ikq4kvtMngHOK6pSA2PlezdaUz7H9+xdb31jEmXG5N4yvZdnMCDm6LkrcNmOE6zae7 CZZg== MIME-Version: 1.0 X-Received: by 10.182.111.199 with SMTP id ik7mr3928807obb.44.1366245926813; Wed, 17 Apr 2013 17:45:26 -0700 (PDT) Received: by 10.60.172.161 with HTTP; Wed, 17 Apr 2013 17:45:26 -0700 (PDT) In-Reply-To: References: Date: Wed, 17 Apr 2013 17:45:26 -0700 Message-ID: Subject: Re: How to fix Under-replication From: Keith Wyss To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e0149cd4a24c70004da97eb3a X-Virus-Checked: Checked by ClamAV on apache.org --089e0149cd4a24c70004da97eb3a Content-Type: text/plain; charset=ISO-8859-1 In case anyone finds this tucked away on the internet in the future and is in a situation like us.... We only had 3 racks, and 4 machines on one rack, compared with almost 70 nodes in our virtually provisioned service. I found that hadoop calculates the maximum number of blocks per rack by int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2; The division is integer division. So with three racks, this amounts to two blocks per rack. With 2 racks, you can store 3 blocks on each rack. We are decommissioning the 4 nodes so that even if we don't have rack awareness, we'll at least get three copies. The proper way to fix this is to adjust hadoop's topology script as you can read about here: http://wiki.apache.org/hadoop/topology_rack_awareness_scripts Alternatively, HDFS-385 might be included in your distribution, allowing you to control the block allocation policy. https://issues.apache.org/jira/browse/HDFS-385 Cheers, Keith On Wed, Apr 17, 2013 at 1:27 PM, Keith Wyss wrote: > Hello there. > > I am operating a cluster that is consistently unable to create three > replicas for a large percentage of blocks. > > I think I have a good idea of why this is the case, but I would like > suggestions about how to fix it. > > First of all, we can begin with the namenode logs. > > There are lots of incidences of this statement: > WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to > place enough replicas, still in need of 1 > > The cluster is only just over 50% full and has well over 3 nodes. This and > the lack of other widespread areas rules out the possibility that there is > simply not room for the blocks. > > This leaves the possibility that the namenode is unable to satisfy the the > block placement policy. I believe that this is what is happening. > > I read in > http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera > that if there are more than 2 racks, then a block must be present on at > least two racks. > > This makes sense, but our network situation is a little bizarre. It > consists of: > -a small amount of machines that have a dedicated datacenter/rack/host > configuration > -- These are spread across a few racks. > -a large amount of machines that are provisioned using an internal > hardware as a service provider. > -- These are listed as one rack. > > The details of the rack allocation for the machines that are provisioned > from the hardware as a service provider are abstracted away and are not > attainable. The connection to the hardware provisioned as a service has a > lot of bandwidth, so this is not as crazy as it sounds. > > Our problem is that the machines on all the smaller racks have now filled > up the amount of space partitioned > by dfs.datanode.du.reserved. This means that all of the blocks since these > machines ran out of space are lacking one replication. > > Is there a way to configure Hadoop to create a third replication anyway > (aside from changing the hadoop.topology.script implementation)? > > What can I do to either confirm or deny my suspicions? > > Thanks for your help, > Keith > > > --089e0149cd4a24c70004da97eb3a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
In case anyone finds this tucked away on the int= ernet in the future and is in a situation like us....

We only = had 3 racks, and 4 machines on one rack, compared with almost 70 nodes in o= ur virtually provisioned service.

I found that hadoop calculates the maximum number of blocks per r= ack by

 int maxNodesPerRack=
 =3D=20
      (totalNumOfReplicas-<=
span class=3D"">1)/c=
lusterMap.getNumOfRacks()+2;

The division is integer d=
ivision. So with three racks, this amounts to two blocks per rack. With 2 r=
acks, you can store 3 blocks on each rack.
We are decommissioning the 4 = nodes so that even if we don't have rack awareness, we'll at least = get three copies.

The proper way to fix thi=
s is to adjust hadoop's topology script as you can read about here: http=
://wiki.apache.org/hadoop/topology_rack_awareness_scripts

Alternatively, HDFS-385 m=
ight be included in your distribution, allowing you to control the block al=
location policy.
https://issues.apache.org/jira/browse/HDFS-385

Cheers,
Keith




On Wed, Apr 17, 2013= at 1:27 PM, Keith Wyss <wyssman@gmail.com> wrote:
Hello there.

I am operating a cluster that is consistently = unable to create three replicas for a large percentage of blocks.

I think I have a good idea of why this is the case, but I wo= uld like suggestions about how to fix it.

First of all, we can begin with the namenode logs.
=
There are lots of incidences of this statement:
WARN org.apach= e.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replic= as, still in need of 1

The cluster is only just over 50% full and has well over 3 nodes.= This and the lack of other widespread areas rules out the possibility that= there is simply not room for the blocks.

This leaves the poss= ibility that the namenode is unable to satisfy the the block placement poli= cy. I believe that this is what is happening.

that if there are more than 2 racks, then a block must be presen= t on at least two racks.

This makes sense, but our network situation is a little biza= rre. It consists of:
-a small amount of machines that have a = dedicated datacenter/rack/host configuration
-- These are spr= ead across a few racks.
-a large amount of machines that are provisioned using an intern= al hardware as a service provider.
-- These are listed as one= rack.

The details of the rack allocation for the machine= s that are provisioned from the hardware as a service provider are abstract= ed away and are not attainable. The connection to the hardware provisioned = as a service has a lot of bandwidth, so this is not as crazy as it sounds.<= br>
Our problem is that the machines on all the smaller racks ha= ve now filled up the amount of space partitioned
by dfs.data= node.du.reserved. This means that all of the blocks since these machines ra= n out of space are lacking one replication.

Is there a way to configure Hadoop to create a third replica= tion anyway (aside from changing the hadoop.topology.script implementation)= ?

What can I do to either confirm or deny my suspicions? =

Thanks for your help,
Keith

=

--089e0149cd4a24c70004da97eb3a--