hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Zhong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
Date Wed, 05 Jun 2013 23:13:22 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676458#comment-13676458
] 

Jeffrey Zhong commented on HBASE-7006:
--------------------------------------

I think about the issue whole morning. Also I discussed this with other folks. Basically the
root issue is to maintain the receiving order during recovery for puts with exact same key
+ version(timestamp). Since log recovery process could work on multiple wal files at same
time, the order of replay isn't guaranteed to be in the receiving order. I'm listing several
options below to see how others think.

h5. Option one(the simplest one) 
Document this limitation in the release note. Assuming the same version update is a rare usage
pattern.

h5. Option two(still simple but hacky)
a) disallow writes during recovery
b) hold flush till all wals of a recovering region are replayed. Memstore should hold because
we only recover unflushed wal edits.

h5. Option three(multiple memstores)
a) Let splitlogworker pick wals of a failed RS in order instead of random. Say a failed RS
has WAL1, WAL2, WAL3,... WALk. a worker will only pick WAL2 if WAL1 is done(or errored) etc.
b) During replay, we pass original wal sequence ids of edits to the receiving RS
c) In receiving RS, we bucket WAL files to a different memstore during replaying and use the
original sequence Ids. Say wal1-wal4 to memstore1, wal5-wal10 to memstore2 etc. We only flush
the bucket memstore when all wals inside the bucket are replayed. all wals can be replayed
concurrently.
d) writes from normal traffic(allow writes during recovery) are put in a different memstore
as of today and flush normally with new sequenceIds.  

h5. Option four
a) During replay, we pass original wal sequence ids
b) for each wal edit, we store each edit's sequence id along with its key. 
c) during scanning, we use the original sequence id if it's present otherwise its store file
sequence Id
d) compaction can just leave put with max sequence id





     
                
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, hbase-7006-addendum.patch, hbase-7006-combined.patch,
hbase-7006-combined-v1.patch, hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch,
hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, hbase-7006-combined-v8.patch,
hbase-7006-combined-v9.patch, LogSplitting Comparison.pdf, ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 1700 WALs
to replay.  Replay took almost an hour.  It looks like it could run faster that much of the
time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message