hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bach Bui (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling
Date Tue, 21 Aug 2012 17:39:37 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438898#comment-13438898

Bach Bui commented on HDFS-3771:

I reproduced this case by simulating the described NN shutdown situation with an exit(0) right
after jas.startLogSegment(segmentTxId) in org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(long,

This action in effect created an edit_inprogress file that has no transaction in it. NN will
now fail to restart, because the error handling code can not handle this case.

An easy work around is to delete the edit_inprogress file. As Todd mentioned, there will be
no loss in data when we do this, am I right Todd?

Ultimately, we need to fix the error handling code in org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.LogGroup.planAllInProgressRecovery()
so that it can detect this situation. It does not seem to be very complicated as this is only
a conner case. Please correct me if I am wrong.

Could someone also tell me how the NN is shutdown? It seems to me this situation only occur
if the NN threads are killed without waiting for them to cleanup themselves.
> Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit
log rolling
> ------------------------------------------------------------------------------------------------
>                 Key: HDFS-3771
>                 URL: https://issues.apache.org/jira/browse/HDFS-3771
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.3, 2.0.0-alpha
>         Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, using Kerberos
based security
>            Reporter: patrick white
>            Priority: Critical
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty issue recently,
which resulted in the cluster's default Namenode being unable to restart, this was on a 20
node Federated cluster with security. The cause appears to be that the NN was just starting
to roll its edit log when a shutdown occurred, the shutdown was intentional to restart the
cluster as part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the cluster
was just wrapping up an adminReport subset and this failure case has not reproduce so far,
nor was it failing previously. It looks like a chance occurrence of sending the shutdown just
as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at /grid/[PATH]/edits_inprogress_0000000000000023967
as corrupt since it has no transactions in it.
> 7. NameNode: Exception in namenode join [main]java.lang.IllegalStateException: No non-corrupt
logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing on the
same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs are rolling,
does the NN have an equivalent to the conventional fs 'sync' blocking action that should be
called, or perhaps has a timing hole?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message