hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das" <d...@yahoo-inc.com>
Subject RE: a million log lines from one job tracker startup
Date Wed, 26 Sep 2007 19:39:37 GMT
Ted, to clarify my earlier mail... So what I thought was Kate was curious to
know why thousands of lines of the delete exception log messages appeared at
the JobTracker startup.  Now, what you pointed out regarding corrupted
block, etc., makes sense - it might prevent the namenode from coming out of
the safe mode, but the namenode goes into safe mode every time it starts up,
and even without any dfs corruption, the JobTracker would get those
exceptions to do with deleting the mapred's system directory until the
NameNode comes out the Safe more. From his log message "Starting RUNNING",
it is clear that the namenode did come out of the Safe mode and entered a
consistent state, otherwise the JobTracker wouldn't have entered the RUNNING

> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com] 
> Sent: Wednesday, September 26, 2007 9:31 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: a million log lines from one job tracker startup
> It looks like you have a problem with insufficient 
> replication or a corrupted file.  This can happen if you are 
> running with low replication count and have lost a datanode 
> or few.  I have also seen this happen associated with 
> somewhat aggressive nuking of hadoop jobs or processes or 
> overfull disk (I am not sure which).  In that case, I wound 
> up with missing blocks for map reduce intermediate output.
> The simplest, but almost always unsatisfactory repair is to 
> simply nuke the contents of HDFS and reload cleanly.
> It is also possible that the namenode will eventually be able 
> to repair the situation.
> You may also be able to repair the file system piece-meal if 
> the persistent problems that you are experiencing have to do 
> with files that you don¹t care about.  To do this, you would 
> use hadoop fsck / to find what the problems really are, turn 
> off safe mode by hand (warning, Will Robinson, DANGER), and 
> delete the files that are causing problems.  This is somewhat 
> laborious.  I think that there is a ³force repair² option on 
> fsck, but I was unable to get that right.
> If you are a real cowboy, you can simply turn off safe mode 
> and go forward.
> If the goobered files are not important to you, this can let 
> you get some work.  This is a really bad idea, of course, 
> since you are circumventing some really important safe-guards.
> My own impression of having experienced this as well as 
> having watched files slooowwly be replicated more widely 
> after changing the replication count for a bunch of files is 
> that I would love to be able to tell the namenode to be very 
> aggressive about repairing replication issues.  Normally, the 
> slow pace that is used for fixing under-replication is a good 
> thing since it allows you to continue with additional work 
> while replication goes on, but there are situations where you 
> really want the issues resolved sooner.
> On 9/26/07 7:25 AM, "kate rhodes" <masukomi@gmail.com> wrote:
> >> 2007-09-26 09:58:06,472 INFO org.apache.hadoop.mapred.JobTracker:
> >> problem cleaning system directory:
> >> /home/krhodes/hadoop_files/temp/krhodes/mapred/system
> >> org.apache.hadoop.ipc.RemoteException:
> >> org.apache.hadoop.dfs.SafeModeException: Cannot delete 
> >> /home/krhodes/hadoop_files/temp/krhodes/mapred/system. 
> Name node is 
> >> in safe mode.
> >> Safe mode will be turned off automatically.
> >>         at
> >> 
> org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem
> .java:1222)
> >>         at 
> org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1200)
> >>         at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:399)
> >>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>         at

View raw message