hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: Recovery following disk full
Date Mon, 20 Jul 2009 20:39:08 GMT
Is this an area where the Offline Image Viewer might be able to help
in the future? It's not available for 0.18.3, but seems like it would
be possible to extend it as a tool to help with c) in Todd's
description.

Tom

On Mon, Jul 20, 2009 at 8:30 PM, Todd Lipcon<todd@cloudera.com> wrote:
> Hi Arv,
>
> It sounds like your edits log in dfs.name.dir is corrupted since one of its
> records got cut off by the disk filling up. When trying to replay the edit
> log, it tries to read the entirety of that record and hits the end of file
> unexpectedly - hence the EOFException.
>
> Your options at this point are:
>
> a) If you have a second copy of dfs.name.dir, it should also have a second
> "edits" file. If it's longer it's possible that that copy is not corrupted.
> I'd back up both copies, then duplicate the longer edit log into both name
> dirs and try to start the namenode.
>
> b) If you were running a secondary namenode, you should have a checkpoint of
> the fsimage from a few hours before the failure. You can recover the fsimage
> from there. You'll lose some time period's worth of metadata edits, but you
> should be able to get the FS running again.
>
> c) Last ditch attempt is to attempt to truncate the edit log at the correct
> offset such that you avoid the EOFException. To do this would probably
> involve adding some logging statements to the FSEditLog replay so you can
> see what the byte offset of the last record it's trying to read is, and then
> truncating the edit log right before that offset. This is somewhat
> complicated and I wouldn't attempt it unless you (a) really need the data
> and (b) don't have any other option.
>
> -Todd
>
> On Mon, Jul 20, 2009 at 12:27 PM, Arv Mistry <arv@kindsight.net> wrote:
>
>> Hi,
>>
>> I'm getting the following error in starting up the namenode.
>>
>> What happened was one of our disks filled up, we reclaimed the
>> disk space and tried to restart the hadoop daemons but the name node
>> is now not starting up.
>>
>> Does anybody have any clues how to recover from this? I've tried
>> searching through the Jira reports but nothing obvious.
>>
>> Appreciate any input, thanks.
>>
>> Cheers Arv
>>
>> 2009-07-20 14:57:41,712 INFO org.apache.hadoop.dfs.NameNode:
>> STARTUP_MSG:
>> /************************************************************
>> STARTUP_MSG: Starting NameNode
>> STARTUP_MSG:   host = qa-cs1/192.168.0.54
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 0.18.3-dev
>> STARTUP_MSG:   build =  -r ; compiled by 'bamboo' on Mon Nov 10 15:58:40
>> PST 2008
>> ************************************************************/
>> 2009-07-20 14:57:41,801 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
>> Initializing RPC Metrics with hostName=NameNode, port=9000
>> 2009-07-20 14:57:41,805 INFO org.apache.hadoop.dfs.NameNode: Namenode up
>> at: 192.168.0.54/192.168.0.54:9000
>> 2009-07-20 <http://192.168.0.54/192.168.0.54:9000%0A2009-07-20>14:57:41,808
INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>> Initializing JVM Metrics with processName=NameNode, sessionId=null
>> 2009-07-20 14:57:41,816 INFO org.apache.hadoop.dfs.NameNodeMetrics:
>> Initializing NameNodeMeterics using context
>> object:org.apache.hadoop.metrics.spi.NullContext
>> 2009-07-20 14:57:41,869 INFO org.apache.hadoop.fs.FSNamesystem:
>> fsOwner=hadoopadmin,hadoopadmin
>> 2009-07-20 14:57:41,869 INFO org.apache.hadoop.fs.FSNamesystem:
>> supergroup=supergroup
>> 2009-07-20 14:57:41,869 INFO org.apache.hadoop.fs.FSNamesystem:
>> isPermissionEnabled=true
>> 2009-07-20 14:57:41,877 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
>> Initializing FSNamesystemMeterics using context
>> object:org.apache.hadoop.metrics.spi.NullContext
>> 2009-07-20 14:57:41,878 INFO org.apache.hadoop.fs.FSNamesystem:
>> Registered FSNamesystemStatusMBean
>> 2009-07-20 14:57:41,908 INFO org.apache.hadoop.dfs.Storage: Number of
>> files = 1808
>> 2009-07-20 14:57:42,153 INFO org.apache.hadoop.dfs.Storage: Number of
>> files under construction = 1
>> 2009-07-20 14:57:42,157 INFO org.apache.hadoop.dfs.Storage: Image file
>> of size 256399 loaded in 0 seconds.
>> 2009-07-20 14:57:42,167 ERROR
>> org.apache.hadoop.dfs.LeaseManager:
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605290.data
>> not found in lease.paths
>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605294.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605298.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>  90720_180000_1248113605451.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605464.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>> 2009-07-20 14:57:42,167 ERROR
>> org.apache.hadoop.dfs.LeaseManager:
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605294.data
>> not found in lease.paths
>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605298.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605451.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>  90720_180000_1248113605464.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>> 2009-07-20 14:57:42,169 ERROR
>> org.apache.hadoop.dfs.LeaseManager:
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605298.data
>> not found in lease.paths
>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605290.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605294.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>  90720_180000_1248113605451.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605464.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>> 2009-07-20 14:57:42,169 ERROR
>> org.apache.hadoop.dfs.LeaseManager:
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data
>> not found in lease.paths
>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605290.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605294.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605451.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>  90720_180000_1248113605464.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>> 2009-07-20 14:57:42,171 ERROR org.apache.hadoop.fs.FSNamesystem:
>> FSNamesystem initialization failed.
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>        at org.apache.hadoop.dfs.FSImage.readString(FSImage.java:1368)
>>        at
>> org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:447)
>>        at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
>>        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
>>        at
>> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
>>        at
>> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>>        at
>> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:296)
>>        at
>> org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:275)
>>
>>
>

Mime
View raw message