Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 16617 invoked from network); 26 Sep 2007 21:11:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Sep 2007 21:11:47 -0000 Received: (qmail 21233 invoked by uid 500); 26 Sep 2007 21:11:31 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 21217 invoked by uid 500); 26 Sep 2007 21:11:31 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 21206 invoked by uid 99); 26 Sep 2007 21:11:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2007 14:11:31 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of masukomi@gmail.com designates 66.249.92.171 as permitted sender) Received: from [66.249.92.171] (HELO ug-out-1314.google.com) (66.249.92.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2007 21:11:33 +0000 Received: by ug-out-1314.google.com with SMTP id a2so1312915ugf for ; Wed, 26 Sep 2007 14:11:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=JFoJKjCYZJlCdETxwFYiv7l4y6o0dxvUT5LAzhMMv6Y=; b=lYPvMszez9iuGPjqsBHJ0xbiBLG0F69dXbDfF5vyaOjQwimrtZczJQTKM6Py2CpRFM2S0nSiNEBc7NH4w9KQb+yBtwIKr1AB5rvZs4VvgAdCmCBZrJ5fiQpmLR1doAa7mu3hrSj9kMZbnHARSx/ZaW9fb4lkZzG+u6Xrxe0Yya0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=O/OO0jBh69bgf47nl3/JrLQi93PQ/U79kCpxScuegwNH/Bh/3bJQ1B5UZjlL7NQw3nFMyMwJQi2GGc927f2whXBm0SKaiwEQvNAb71sdQwdQVFQImi7TtrG4hZYIctQRugrGrTgE3iQud3nvMJ5d7rwbFBp/Tz2IjhKSCCoKUHg= Received: by 10.67.28.3 with SMTP id f3mr2732874ugj.1190841072137; Wed, 26 Sep 2007 14:11:12 -0700 (PDT) Received: by 10.86.54.19 with HTTP; Wed, 26 Sep 2007 14:11:11 -0700 (PDT) Message-ID: <8c7320670709261411u7a904d41l9834fb542088986a@mail.gmail.com> Date: Wed, 26 Sep 2007 17:11:11 -0400 From: "kate rhodes" To: hadoop-user@lucene.apache.org Subject: Re: a million log lines from one job tracker startup In-Reply-To: <003b01c80074$fadfb3b0$eb44420a@ds.corp.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <8c7320670709260725v41225051g7cb349a959b3337@mail.gmail.com> <003b01c80074$fadfb3b0$eb44420a@ds.corp.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Is it me or is a million log lines a bit excessive if this is a sequence of events that is not only possible but intended? If it is supposed to go into safe mode every time it starts then things shouldn't be throwing exceptions about it doing what it was supposed to? Should i file a ticket about this? Because I've just spent the past few hours trying to figure out what I'd misconfigured in order to generate all this crap only to find out it's just a horribly misleading and verbose set of errors. -Kate On 9/26/07, Devaraj Das wrote: > Ted, to clarify my earlier mail... So what I thought was Kate was curious= to > know why thousands of lines of the delete exception log messages appeared= at > the JobTracker startup. Now, what you pointed out regarding corrupted > block, etc., makes sense - it might prevent the namenode from coming out = of > the safe mode, but the namenode goes into safe mode every time it starts = up, > and even without any dfs corruption, the JobTracker would get those > exceptions to do with deleting the mapred's system directory until the > NameNode comes out the Safe more. From his log message "Starting RUNNING"= , > it is clear that the namenode did come out of the Safe mode and entered a > consistent state, otherwise the JobTracker wouldn't have entered the RUNN= ING > state. > > > -----Original Message----- > > From: Ted Dunning [mailto:tdunning@veoh.com] > > Sent: Wednesday, September 26, 2007 9:31 PM > > To: hadoop-user@lucene.apache.org > > Subject: Re: a million log lines from one job tracker startup > > > > > > It looks like you have a problem with insufficient > > replication or a corrupted file. This can happen if you are > > running with low replication count and have lost a datanode > > or few. I have also seen this happen associated with > > somewhat aggressive nuking of hadoop jobs or processes or > > overfull disk (I am not sure which). In that case, I wound > > up with missing blocks for map reduce intermediate output. > > > > The simplest, but almost always unsatisfactory repair is to > > simply nuke the contents of HDFS and reload cleanly. > > > > It is also possible that the namenode will eventually be able > > to repair the situation. > > > > You may also be able to repair the file system piece-meal if > > the persistent problems that you are experiencing have to do > > with files that you don=B9t care about. To do this, you would > > use hadoop fsck / to find what the problems really are, turn > > off safe mode by hand (warning, Will Robinson, DANGER), and > > delete the files that are causing problems. This is somewhat > > laborious. I think that there is a =B3force repair=B2 option on > > fsck, but I was unable to get that right. > > > > If you are a real cowboy, you can simply turn off safe mode > > and go forward. > > If the goobered files are not important to you, this can let > > you get some work. This is a really bad idea, of course, > > since you are circumventing some really important safe-guards. > > > > My own impression of having experienced this as well as > > having watched files slooowwly be replicated more widely > > after changing the replication count for a bunch of files is > > that I would love to be able to tell the namenode to be very > > aggressive about repairing replication issues. Normally, the > > slow pace that is used for fixing under-replication is a good > > thing since it allows you to continue with additional work > > while replication goes on, but there are situations where you > > really want the issues resolved sooner. > > > > > > On 9/26/07 7:25 AM, "kate rhodes" wrote: > > > > >> 2007-09-26 09:58:06,472 INFO org.apache.hadoop.mapred.JobTracker: > > >> problem cleaning system directory: > > >> /home/krhodes/hadoop_files/temp/krhodes/mapred/system > > >> org.apache.hadoop.ipc.RemoteException: > > >> org.apache.hadoop.dfs.SafeModeException: Cannot delete > > >> /home/krhodes/hadoop_files/temp/krhodes/mapred/system. > > Name node is > > >> in safe mode. > > >> Safe mode will be turned off automatically. > > >> at > > >> > > org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem > > .java:1222) > > >> at > > org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1200) > > >> at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:399) > > >> at > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > >> at > > > > > >