Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 31925 invoked from network); 8 Jul 2010 01:59:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Jul 2010 01:59:02 -0000 Received: (qmail 80283 invoked by uid 500); 8 Jul 2010 01:58:59 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 80091 invoked by uid 500); 8 Jul 2010 01:58:59 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 80076 invoked by uid 99); 8 Jul 2010 01:58:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jul 2010 01:58:59 +0000 X-ASF-Spam-Status: No, hits=3.6 required=10.0 tests=FREEMAIL_FROM,FS_REPLICA,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of edlinuxguru@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jul 2010 01:58:52 +0000 Received: by qyk12 with SMTP id 12so3367402qyk.14 for ; Wed, 07 Jul 2010 18:57:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=mY+VctMsA794sIypW6LXRszzmX1vo7cTEYIMIMbkcqY=; b=f3rvfbl/au1YXCMBHKoBJwkf8S41soT255TrzxFx3VoaZDcsDHQOfFSWAwsvAsqVus 6AsWjKs5MMKAkgfb8EmHcYbrYH8WSJpcdytnA5fxdAolkH32JBeKOcXebOZoIH/QfM3c pdOiuy05qjnaxSrBjDtvPm0K5JJ1mCBMZUbmQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=QBxLWb7Fw1ssJjEjltA7i7THT+XojM9bHla2qyGQJtb1v47hxL3Xz6/nmVCj4Z7wM8 me/dFwgk3LzeOm4sK1/NyqL1ZW4hoLY9okWZeV24zMNC2tMZ3Hnz5vBgstQMv+lIflBj olFEkrQpfbZayO+s+v8PJNDnHLfphmc4/IZfI= MIME-Version: 1.0 Received: by 10.224.106.18 with SMTP id v18mr4171006qao.66.1278554251101; Wed, 07 Jul 2010 18:57:31 -0700 (PDT) Received: by 10.229.182.206 with HTTP; Wed, 7 Jul 2010 18:57:31 -0700 (PDT) In-Reply-To: References: Date: Wed, 7 Jul 2010 21:57:31 -0400 Message-ID: Subject: Re: rebalancing replication help From: Edward Capriolo To: common-user@hadoop.apache.org Cc: "general@hadoop.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Jul 7, 2010 at 9:18 PM, Arun Ramakrishnan wrote: > Looks like there is not much activity in the hdfs-user list. So, am repos= ting it in the general list. > > Hi guys. > =A0I have a few related questions. I am going to layout the steps I have = taken. Please comment on what I can do better. > > =A0I was trying to to add 5 nodes to my existing 10 node cluster and also= increase the replication factor from 2 to 3. > I thought I don't have to run the balancer cause it would most likely put= the new replicas into the new nodes. > > There are about 500k blocks. > I wanted to get it all stabilized(replication and balancing) within 24 ho= urs. Its more than 24 hours now and fsck reports 30% under replication. Is = there a way to force hdfs to use balance/replicate more aggressively. > > It would be great if someone explained what/when things happen to blocks = in the context of > > 1) =A0 =A0 =A0Rebalancing > > 2) =A0 =A0 =A0-setrep > > 3) =A0 =A0 =A0Restarting cluster with a higher/lower replication factor. > > A few questions and a few issues here. > > 1) =A0 =A0 =A0When you restart the cluster with a higher than previous re= plication value. Does it also apply to existing blocks or only to new block= s being created ? > > 2) =A0 =A0 =A0Does the balancer take into account under replication of bl= ocks or does it blindly start moving existing blocks to reach threshold ? > > > A very specific problem . =A0I am having this strange problem where the -= setrep hangs on one particular block for hours. Is this because its corrupt= ?. But, fsck said its healthy. > > > Thanks > Arun > > > 2) -setrep This will change the replication factor of an existing file (in the background it should start replicating) > 2) Does the balancer take into account under replication of blocks or doe= s it blindly start moving existing blocks to reach threshold ? Files most under replication should be prioritized. > 3) Restarting cluster with a higher/lower replication factor. This only affects new files that are created. Where the client has not specified a value > A very specific problem . I am having this strange problem where the -se= trep hangs on one particular block for hours. Is this because its corrupt ?= . But, fsck said its healthy. Not sure > Its more than 24 hours now and fsck reports 30% under There is a configuration setting for maximum replication bandwidth. You might have to tune that.