Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 82A8D101DE for ; Tue, 18 Jun 2013 00:41:21 +0000 (UTC) Received: (qmail 72002 invoked by uid 500); 18 Jun 2013 00:41:21 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 71941 invoked by uid 500); 18 Jun 2013 00:41:21 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 71933 invoked by uid 99); 18 Jun 2013 00:41:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Jun 2013 00:41:21 +0000 Date: Tue, 18 Jun 2013 00:41:21 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-8701) distributedLogReplay need to apply wal edits in the receiving order of those edits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686243#comment-13686243 ] stack commented on HBASE-8701: ------------------------------ [~jeffreyz] When you say '...but it allows us without modifying hfile format.', what you thinking? I'm not sure why we'd need to modify hfile format to accommodate replay. On the 200M, I like Enis's suggestion that we set an upper bound on KVs per file. It is a good idea for figuring a safe step in the sequenceids. bq. It's the original log sequence number when firstly replay a wal. Storing in waledit so that we can persistent the number into hlogkey of a wal Entry to handle the case when receiving RS fails again during a replay. This seems like a hack on the hack. What do we do if the replay fails a second time? (once for original server crash, then a crash while replay, then again when replaying the replay?) bq. It's possible due to we have sequence number along with the KV. So it IS possible to have an hfile w/ a negative sequence number? (We don't sort storefiles by sequenceid any more?) If so, that'll mess us up? Or the scan merge will accomodate? How do the replays work in memstore? If a negative mvcc is added first, then a positive (because the region is open for writes), then another negative comes in, what happens? Does the negative overwrite the positive at the same coordinates? Will we flush w/ a negative sequenceid though the file has postiives in it? > distributedLogReplay need to apply wal edits in the receiving order of those edits > ---------------------------------------------------------------------------------- > > Key: HBASE-8701 > URL: https://issues.apache.org/jira/browse/HBASE-8701 > Project: HBase > Issue Type: Bug > Components: MTTR > Reporter: Jeffrey Zhong > Assignee: Jeffrey Zhong > Fix For: 0.98.0, 0.95.2 > > Attachments: 8701-v3.txt, hbase-8701-v4.patch, hbase-8701-v5.patch > > > This issue happens in distributedLogReplay mode when recovering multiple puts of the same key + version(timestamp). After replay, the value is nondeterministic of the key > h5. The original concern situation raised from [~eclark]: > For all edits the rowkey is the same. > There's a log with: [ A (ts = 0), B (ts = 0) ] > Replay the first half of the log. > A user puts in C (ts = 0) > Memstore has to flush > A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid. > Replay the rest of the Log. > Flush > The issue will happen in similar situation like Put(key, t=T) in WAL1 and Put(key,t=T) in WAL2 > h5. Below is the option(proposed by Ted) I'd like to use: > a) During replay, we pass original wal sequence number of each edit to the receiving RS > b) In receiving RS, we store negative original sequence number of wal edits into mvcc field of KVs of wal edits > c) Add handling of negative MVCC in KVScannerComparator and KVComparator > d) In receiving RS, write original sequence number into an optional field of wal file for chained RS failure situation > e) When opening a region, we add a safety bumper(a large number) in order for the new sequence number of a newly opened region not to collide with old sequence numbers. > In the future, when we stores sequence number along with KVs, we can adjust the above solution a little bit by avoiding to overload MVCC field. > h5. The other alternative options are listed below for references: > Option one > a) disallow writes during recovery > b) during replay, we pass original wal sequence ids > c) hold flush till all wals of a recovering region are replayed. Memstore should hold because we only recover unflushed wal edits. For edits with same key + version, whichever with larger sequence Id wins. > Option two > a) During replay, we pass original wal sequence ids > b) for each wal edit, we store each edit's original sequence id along with its key. > c) during scanning, we use the original sequence id if it's present otherwise its store file sequence Id > d) compaction can just leave put with max sequence id > Please let me know if you have better ideas. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira