Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 85701 invoked from network); 18 Mar 2011 16:59:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Mar 2011 16:59:52 -0000 Received: (qmail 43237 invoked by uid 500); 18 Mar 2011 16:59:49 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 43188 invoked by uid 500); 18 Mar 2011 16:59:49 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 43180 invoked by uid 99); 18 Mar 2011 16:59:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Mar 2011 16:59:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.34.213 as permitted sender) Received: from [65.55.34.213] (HELO col0-omc4-s11.col0.hotmail.com) (65.55.34.213) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Mar 2011 16:59:41 +0000 Received: from COL117-W29 ([65.55.34.200]) by col0-omc4-s11.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 18 Mar 2011 09:59:19 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_7ba2b452-731b-4fa5-83b6-1829f04e8c1f_" X-Originating-IP: [65.167.11.254] From: Michael Segel To: Subject: RE: decommissioning node woes Date: Fri, 18 Mar 2011 11:59:19 -0500 Importance: Normal In-Reply-To: References: ,, <4E091F9C-FF89-46C2-A288-6340B86D4106@tynt.com>, MIME-Version: 1.0 X-OriginalArrivalTime: 18 Mar 2011 16:59:19.0666 (UTC) FILETIME=[D29A2D20:01CBE58D] X-Virus-Checked: Checked by ClamAV on apache.org --_7ba2b452-731b-4fa5-83b6-1829f04e8c1f_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Uhmm... If you use the default bandwidth allocation and you have a lot of data on t= he node you want to decommission you can be waiting for weeks before you ca= n safely take the node out. If you wanted to=2C you can take the nodes down one by one where you do an = fsck in between the removal of nodes to get the under replicated blocks ide= ntified and replicated. ("Normally Namenode automatically corrects most of the recoverable failures.")=20 Once you see those blocks successfully replicated... you can take down the = next. Is it clean? No=2C not really. Is it dangerous? No=2C not really. Do I recommend it? No=2C but its a quick and dirty way of doing things...=20 Or you can up your dfs.balance.bandwidthPerSecIn the configuration files. T= he default is pretty low. The downside is that you have to bounce the cloud to get this value updated= =2C and it could have a negative impact on performance if set too high. HTH -Mike > From: tdunning@maprtech.com > Date: Fri=2C 18 Mar 2011 09:38:31 -0700 > Subject: Re: decommissioning node woes > To: common-user@hadoop.apache.org > CC: james@tynt.com >=20 > Unless the last copy is on that node. >=20 > Decommissioning is the only safe way to shut off 10 nodes at once. Doing > them one at a time and waiting for replication to (asymptotically) recove= r > is painful and error prone. >=20 > On Fri=2C Mar 18=2C 2011 at 9:08 AM=2C James Seigel wrot= e: >=20 > > Just a note. If you just shut the node off=2C the blocks will replicat= e > > faster. > > > > James. > > > > > > On 2011-03-18=2C at 10:03 AM=2C Ted Dunning wrote: > > > > > If nobody else more qualified is willing to jump in=2C I can at least > > provide > > > some pointers. > > > > > > What you describe is a bit surprising. I have zero experience with a= ny > > 0.21 > > > version=2C but decommissioning was working well > > > in much older versions=2C so this would be a surprising regression. > > > > > > The observations you have aren't all inconsistent with how > > decommissioning > > > should work. The fact that your nodes look up > > > after starting the decommissioning isn't so strange. The idea is tha= t no > > > new data will be put on the node=2C nor should it be > > > counted as a replica=2C but it will help in reading data. > > > > > > So that isn't such a big worry. > > > > > > The fact that it takes forever and a day=2C however=2C is a big worry= . I > > cannot > > > provide any help there just off hand. > > > > > > What happens when a datanode goes down? Do you see under-replicated > > files? > > > Does the number of such files decrease over time? > > > > > > On Fri=2C Mar 18=2C 2011 at 4:23 AM=2C Rita wr= ote: > > > > > >> Any help? > > >> > > >> > > >> On Wed=2C Mar 16=2C 2011 at 9:36 PM=2C Rita w= rote: > > >> > > >>> Hello=2C > > >>> > > >>> I have been struggling with decommissioning data nodes. I have a 5= 0+ > > >> data > > >>> node cluster (no MR) with each server holding about 2TB of storage.= I > > >> split > > >>> the nodes into 2 racks. > > >>> > > >>> > > >>> I edit the 'exclude' file and then do a -refreshNodes. I see the no= de > > >>> immediate in 'Decommiosied node' and I also see it as a 'live' node= ! > > >>> Eventhough I wait 24+ hours its still like this. I am suspecting it= s a > > >> bug > > >>> in my version. The data node process is still running on the node = I am > > >>> trying to decommission. So=2C sometimes I kill -9 the process and I= see > > the > > >>> 'under replicated' blocks...this can't be the normal procedure. > > >>> > > >>> There were even times that I had corrupt blocks because I was impat= ient > > >> -- > > >>> waited 24-34 hours > > >>> > > >>> I am using 23 August=2C 2010: release 0.21.0 < > > >> > > http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+releas= e+0.21.0+available > > >>> > > >>> version. > > >>> > > >>> Is this a known bug? Is there anything else I need to do to > > decommission > > >> a > > >>> node? > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> -- > > >>> --- Get your facts first=2C then you can distort them as you please= .-- > > >>> > > >> > > >> > > >> > > >> -- > > >> --- Get your facts first=2C then you can distort them as you please.= -- > > >> > > > > = --_7ba2b452-731b-4fa5-83b6-1829f04e8c1f_--