hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8701) distributedLogReplay need to apply wal edits in the receiving order of those edits
Date Mon, 17 Jun 2013 22:10:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686081#comment-13686081

stack commented on HBASE-8701:

Sometimes the mvcc number is a sequence number (a negative one!) and other times it is an
mvcc.  This hack is spread about the code base.

On the below:

+  // in distributedLogReplay mode, we haven't read all wals so we don't know the last exact
+  // sequence number used by previous failed RS. Hence we introduce SEQNUM_SAFETY_BUMPER
to add a
+  // large enough number to be sure that the new sequence number of the just opened region
+  // overlap with old sequence numbers.
+  // Using 200 million:
+  // 1) it'd take 300+ years to overflow long integer assuming the same region recovers every
+  // 2) it'd take 2+ days for a RS receives a change every millisecond and without a single
+  static final long SEQNUM_SAFETY_BUMPER = 200 * 1024 * 1024; // 200 millions

What does a flush do to the above?  It does not effect sequence number right?  It does not
reset it.

If a RS does 1k hits a second for two days, we are almost at 200million.

The 200M here is meant to span all edits out in WAL logs?

No explaination for why we set a -seqid into mvcc:

-          kv.setMemstoreTS(localizedWriteEntry.getWriteNumber());
+          kv.setMemstoreTS(seqId == NO_SEQ_ID ? localizedWriteEntry.getWriteNumber() :
+            -seqId);

KeyValueHeap has this pollution.  It goes negativing "seqid" w/o explaination.  Yeah, this
hack is spread all over code base.  

Why move recovering state setting from openregionhandler to HRegion?

An HLogEdit doesn't have sequence number already?  What is logSeqNum?  What is relation to

+  // used in distributedLogReplay to store original log sequence number of an edit
+  private long origLogSeqNum;

Chatting w/ Himanshu, he wondered if it is possible that a memstore get flushed w/ a negative

> distributedLogReplay need to apply wal edits in the receiving order of those edits
> ----------------------------------------------------------------------------------
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>         Attachments: 8701-v3.txt, hbase-8701-v4.patch
> This issue happens in distributedLogReplay mode when recovering multiple puts of the
same key + version(timestamp). After replay, the value is nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and Put(key,t=T)
in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the receiving
> b) In receiving RS, we store negative original sequence number of wal edits into mvcc
field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator   
> d) In receiving RS, write original sequence number into an optional field of wal file
for chained RS failure situation 
> e) When opening a region, we add a safety bumper(a large number) in order for the new
sequence number of a newly opened region not to collide with old sequence numbers. 
> In the future, when we stores sequence number along with KVs, we can adjust the above
solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore should hold
because we only recover unflushed wal edits. For edits with same key + version, whichever
with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with its key. 
> c) during scanning, we use the original sequence id if it's present otherwise its store
file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message