hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14028) DistributedLogReplay drops edits when ITBLL 125M
Date Thu, 09 Jul 2015 00:04:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619615#comment-14619615
] 

stack commented on HBASE-14028:
-------------------------------

I added logging and reran. Found another failure type beyond the above described replay over
a coincident flush.

Highlevel, region opens, we start to replay edits but well before the replay can finish, the
server hosting the newly opened region crashes. Edits in the WAL we were replaying get skipped
on second attempt.

Here is open before crash:

2015-07-08 12:45:38,317 DEBUG [RS_OPEN_REGION-c2023:16020-0] wal.WALSplitter: Wrote region
seqId=hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/467eaf13c7ce1f2e1afb1c567322c9e7/recovered.edits/760185051.seqid
to file, newSeqId=760185051, maxSeqId=720162792

Here is open after crash:

2015-07-08 12:45:49,920 DEBUG [RS_OPEN_REGION-c2025:16020-1] wal.WALSplitter: Wrote region
seqId=hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/467eaf13c7ce1f2e1afb1c567322c9e7/recovered.edits/800185051.seqid
to file, newSeqId=800185051, maxSeqId=760185051

See how newSeqId the first time around becomes the maxSeqId the second time we open. This
is broke (this is the well-padded sequence id set well in advance of any edits that could
come in during replay). See how on subsequent replay we end up skipping most of the edits:

2015-07-08 12:46:25,103 INFO  [RS_LOG_REPLAY_OPS-c2025:16020-1] wal.WALSplitter: Processed
80 edits across 0 regions; edits skipped=1583; log file=hdfs://c2020.halxg.cloudera.com:8020/hbase/WALs/c2021.halxg.cloudera.com,16020,1436383987497-splitting/c2021.halxg.cloudera.com%2C16020%2C1436383987497.default.1436384632799,
length=72993715, corrupted=false, progress failed=false

(Says 80 edits for ZERO regions... )

The maximum sequence id in the WAL to replay is 720185601 even though we did not replay all
edits. 

So, at least two issues.

Let me put this aside since it looks like it won't make hbase-1.2.0 at this late stage.






> DistributedLogReplay drops edits when ITBLL 125M
> ------------------------------------------------
>
>                 Key: HBASE-14028
>                 URL: https://issues.apache.org/jira/browse/HBASE-14028
>             Project: HBase
>          Issue Type: Bug
>          Components: Recovery
>    Affects Versions: 1.2.0
>            Reporter: stack
>
> Testing DLR before 1.2.0RC gets cut, we are dropping edits.
> Issue seems to be around replay into a deployed region that is on a server that dies
before all edits have finished replaying. Logging is sparse on sequenceid accounting so can't
tell for sure how it is happening (and if our now accounting by Store is messing up DLR).
Digging.
> I notice also that DLR does not refresh its cache of region location on error -- it just
keeps trying till whole WAL fails.... 8 retries...about 30 seconds. We could do a bit of refactor
and have the replay find region in new location if moved during DLR replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message