hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4128) 2NN gets stuck in inconsistent state if edit log replay fails in the middle
Date Wed, 31 Oct 2012 00:10:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487392#comment-13487392
] 

Colin Patrick McCabe commented on HDFS-4128:
--------------------------------------------

Aborting definitely seems like the safest thing to do-- do we know that all transactions are
applied atomically (i.e. if they fail and throw an exception in the middle, is there rollback
of whatever they did to the FSImage?)  I'm not clear on that point.
                
> 2NN gets stuck in inconsistent state if edit log replay fails in the middle
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-4128
>                 URL: https://issues.apache.org/jira/browse/HDFS-4128
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>
> We saw the following issue in a cluster:
> - The 2NN downloads an edit log segment:
> {code}
> 2012-10-29 12:30:57,433 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading
/xxxxxxx/current/edits_0000000000049136809-0000000000049176162 expecting start txid #49136809
> {code}
> - It fails in the middle of replay due to an OOME:
> {code}
> 2012-10-29 12:31:21,021 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader:
Encountered exception on operation AddOp [length=0, path=/xxxxxxxx
> java.lang.OutOfMemoryError: Java heap space
> {code}
> - Future checkpoints then fail because the prior edit log replay only got halfway through
the stream:
> {code}
> 2012-10-29 12:32:21,214 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading
/xxxxx/current/edits_0000000000049176163-0000000000049177224 expecting start txid #49144432
> 2012-10-29 12:32:21,216 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
Exception in doCheckpoint
> java.io.IOException: There appears to be a gap in the edit log.  We expected txid 49144432,
but got txid 49176163.
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message