hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2933) Always Skip Errors during Log Recovery
Date Wed, 25 Aug 2010 02:21:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902279#action_12902279

Todd Lipcon commented on HBASE-2933:

I can't remember the particular JIRA either, but it seems to me that the regionserver shouldn't
even get to the point of doing recovery if the logs haven't been completely recovered. ie
the phases should be:

1) Original RS is writing logs and dies
2) Master A notices failure and starts splitting logs. It gets halfway through writing region_1/oldlog
3) Master A dies
4) Master B takes over, and knows from ZK that RS's recovery is incomplete.
5) Master B should remove the half-written log split done by Master A, and try again from
the start.

ie no region server should attempt to open region 1 until the logs have been properly split.
Thus, the RS should never see an EOFException on log recovery, since it indicates that log
splitting is incomplete.

> Always Skip Errors during Log Recovery
> --------------------------------------
>                 Key: HBASE-2933
>                 URL: https://issues.apache.org/jira/browse/HBASE-2933
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Nicolas Spiegelberg
>            Assignee: Nicolas Spiegelberg
> While testing a cluster, we hit upon the following assert during region assigment.  We
were killing the master during a long run of splits.  We think what happened is that the HMaster
was killed while splitting, woke up & split again.  If this happens, we will have 2 files:
1 partially written and 1 complete one.  Since encountering partial log splits upon Master
failure is considered normal behavior, we should continue at the RS level if we encounter
an EOFException & not an filesystem-level exception, even with skip.errors == false.
> 2010-08-20 16:59:07,718 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error
opening MailBox_dsanduleac,57db45276ece7ce03ef7e8d9969eb189:99900000000008@facebook.com,1280960828959.7c542d24d4496e273b739231b01885e6.
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1902)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1932)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:121)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:113)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1981)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1956)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:1915)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:344)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1490)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1437)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1345)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-08-20 16:59:07,719 ERROR org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater:
Aborting open of region 7c542d24d4496e273b739231b01885e6

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message