hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
Date Sun, 11 Jun 2017 03:24:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045790#comment-16045790
] 

stack commented on HBASE-18152:
-------------------------------

Looking at the corruption in 36.log, we are indeed missing stuff off the end of the WAL. The
missing entries look like they would have been on the end of the WAL.  There is a second at
least between their event and the master crash. I presume in this second its thread messing
around trying to persist to the WAL store. We are writing edits out-of-order in some circumstance
(see experience w/ previous log and the attached workaround patch which helps...). This makes
for possibility of their being holes if we expect events in-order as the smart verification
check does. Need to dig in on the way we log.

> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
>                 Key: HBASE-18152
>                 URL: https://issues.apache.org/jira/browse/HBASE-18152
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Region Assignment
>    Affects Versions: 2.0.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-17537.master.002.patch, HBASE-18152.master.001.patch, pv2-00000000000000000036.log,
pv2-00000000000000000047.log, reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we can get over
it but sometimes we can't. It took me a while to capture an instance of corruption. Turns
out we are write to the WAL out-of-order which undoes a basic tenet; that WAL content is ordered
in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we could write
out of order. Will try and get more insight. Meantime parking this issue here to fill data
into.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message