Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of masukomi@gmail.com designates
 66.249.92.171 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=O/OO0jBh69bgf47nl3/JrLQi93PQ/U79kCpxScuegwNH/Bh/3bJQ1B5UZjlL7NQw3nFMyMwJQi2GGc927f2whXBm0SKaiwEQvNAb71sdQwdQVFQImi7TtrG4hZYIctQRugrGrTgE3iQud3nvMJ5d7rwbFBp/Tz2IjhKSCCoKUHg=
Message-ID: <8c7320670709261411u7a904d41l9834fb542088986a@mail.gmail.com>
Date: Wed, 26 Sep 2007 17:11:11 -0400
From: "kate rhodes" <masukomi@gmail.com>
To: hadoop-user@lucene.apache.org
Subject: Re: a million log lines from one job tracker startup
In-Reply-To: <003b01c80074$fadfb3b0$eb44420a@ds.corp.yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <8c7320670709260725v41225051g7cb349a959b3337@mail.gmail.com>
	 <C31FD02E.24346%tdunning@veoh.com>
	 <003b01c80074$fadfb3b0$eb44420a@ds.corp.yahoo.com>

Is it me or is a million log lines a bit excessive if this is a
sequence of events that is not only possible but intended? If it is
supposed to go into safe mode every time it starts then things
shouldn't be throwing exceptions about it doing what it was supposed
to?

Should i file a ticket about this? Because I've just spent the past
few hours trying to figure out what I'd misconfigured in order to
generate all this crap only to find out it's just a horribly
misleading and verbose set of errors.

-Kate

On 9/26/07, Devaraj Das <ddas@yahoo-inc.com> wrote:
> Ted, to clarify my earlier mail... So what I thought was Kate was curious=
 to
> know why thousands of lines of the delete exception log messages appeared=
 at
> the JobTracker startup.  Now, what you pointed out regarding corrupted
> block, etc., makes sense - it might prevent the namenode from coming out =
of
> the safe mode, but the namenode goes into safe mode every time it starts =
up,
> and even without any dfs corruption, the JobTracker would get those
> exceptions to do with deleting the mapred's system directory until the
> NameNode comes out the Safe more. From his log message "Starting RUNNING"=
,
> it is clear that the namenode did come out of the Safe mode and entered a
> consistent state, otherwise the JobTracker wouldn't have entered the RUNN=
ING
> state.
>
> > -----Original Message-----
> > From: Ted Dunning [mailto:tdunning@veoh.com]
> > Sent: Wednesday, September 26, 2007 9:31 PM
> > To: hadoop-user@lucene.apache.org
> > Subject: Re: a million log lines from one job tracker startup
> >
> >
> > It looks like you have a problem with insufficient
> > replication or a corrupted file.  This can happen if you are
> > running with low replication count and have lost a datanode
> > or few.  I have also seen this happen associated with
> > somewhat aggressive nuking of hadoop jobs or processes or
> > overfull disk (I am not sure which).  In that case, I wound
> > up with missing blocks for map reduce intermediate output.
> >
> > The simplest, but almost always unsatisfactory repair is to
> > simply nuke the contents of HDFS and reload cleanly.
> >
> > It is also possible that the namenode will eventually be able
> > to repair the situation.
> >
> > You may also be able to repair the file system piece-meal if
> > the persistent problems that you are experiencing have to do
> > with files that you don=B9t care about.  To do this, you would
> > use hadoop fsck / to find what the problems really are, turn
> > off safe mode by hand (warning, Will Robinson, DANGER), and
> > delete the files that are causing problems.  This is somewhat
> > laborious.  I think that there is a =B3force repair=B2 option on
> > fsck, but I was unable to get that right.
> >
> > If you are a real cowboy, you can simply turn off safe mode
> > and go forward.
> > If the goobered files are not important to you, this can let
> > you get some work.  This is a really bad idea, of course,
> > since you are circumventing some really important safe-guards.
> >
> > My own impression of having experienced this as well as
> > having watched files slooowwly be replicated more widely
> > after changing the replication count for a bunch of files is
> > that I would love to be able to tell the namenode to be very
> > aggressive about repairing replication issues.  Normally, the
> > slow pace that is used for fixing under-replication is a good
> > thing since it allows you to continue with additional work
> > while replication goes on, but there are situations where you
> > really want the issues resolved sooner.
> >
> >
> > On 9/26/07 7:25 AM, "kate rhodes" <masukomi@gmail.com> wrote:
> >
> > >> 2007-09-26 09:58:06,472 INFO org.apache.hadoop.mapred.JobTracker:
> > >> problem cleaning system directory:
> > >> /home/krhodes/hadoop_files/temp/krhodes/mapred/system
> > >> org.apache.hadoop.ipc.RemoteException:
> > >> org.apache.hadoop.dfs.SafeModeException: Cannot delete
> > >> /home/krhodes/hadoop_files/temp/krhodes/mapred/system.
> > Name node is
> > >> in safe mode.
> > >> Safe mode will be turned off automatically.
> > >>         at
> > >>
> > org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem
> > .java:1222)
> > >>         at
> > org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1200)
> > >>         at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:399)
> > >>         at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >>         at
> >
> >
>
>