Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Message-ID: <3546490.388471285351593005.JavaMail.jira@thor>
Date: Fri, 24 Sep 2010 14:06:33 -0400 (EDT)
From: "Nicolas Spiegelberg (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Subject: [jira] Updated: (HBASE-2933) Skip EOF Errors during Log Recovery
In-Reply-To: <18489878.482591282352837722.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Spiegelberg updated HBASE-2933:
---------------------------------------

    Attachment: HBASE-2933.patch

Handles EOFE and the IOE that was mentioned by stack.  The SequenceFile.Reader has a few more IOEs, so this isn't 100% fail-proof.  The general problem we seem to have is that we need to differentiate between a Network IOE and a File Format IOE.  A File Format IOE is idempotent, where a Network IOE may not be.

Network = we need to fail and let another server try to take over
FileFormat = our file was written or parsed incorrectly. retrying won't fix anything. We need to just open what we have and store the original file away for later analysis.

> Skip EOF Errors during Log Recovery
> -----------------------------------
>
>                 Key: HBASE-2933
>                 URL: https://issues.apache.org/jira/browse/HBASE-2933
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Nicolas Spiegelberg
>            Assignee: Nicolas Spiegelberg
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: HBASE-2933.patch
>
>
> While testing a cluster, we hit upon the following assert during region assigment.  We were killing the master during a long run of splits.  We think what happened is that the HMaster was killed while splitting, woke up & split again.  If this happens, we will have 2 files: 1 partially written and 1 complete one.  Since encountering partial log splits upon Master failure is considered normal behavior, we should continue at the RS level if we encounter an EOFException & not an filesystem-level exception, even with skip.errors == false.
> 2010-08-20 16:59:07,718 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening MailBox_dsanduleac,57db45276ece7ce03ef7e8d9969eb189:99900000000008@facebook.com,1280960828959.7c542d24d4496e273b739231b01885e6.
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:1902)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1932)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:121)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:113)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1981)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:1956)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:1915)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:344)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1490)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1437)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1345)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-08-20 16:59:07,719 ERROR org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Aborting open of region 7c542d24d4496e273b739231b01885e6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.