Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0EBC110DEF for ; Mon, 10 Jun 2013 23:21:21 +0000 (UTC) Received: (qmail 76615 invoked by uid 500); 10 Jun 2013 23:21:20 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 76544 invoked by uid 500); 10 Jun 2013 23:21:20 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 76535 invoked by uid 99); 10 Jun 2013 23:21:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jun 2013 23:21:20 +0000 Date: Mon, 10 Jun 2013 23:21:20 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-8701) distributedLogReplay need to apply wal edits in the receiving order of those edits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680039#comment-13680039 ] stack commented on HBASE-8701: ------------------------------ Thanks. I see what you are referring to now. The description sounds like a mapreduce job (as Elliott suggested) but in essence is what Enis proposes (sort edits including consideration of seqid and when done spill hfiles which we then load on region open). bq. This thing can either have references (one per store), with multiple splitkey support; or, for a more involved solution w/o references, have (in the beginning, or tail) an index that points to precise locations to where each file's data starts. However in the latter case it's not clear where to store the file, as you said. Yeah sounds like a lot of file writing and then fragile links that we'd need to keep in order. Lets white board it on Weds. bq. I might be missing something... can you elaborate why? I thought I had? On 0.96 restart, we'd read current hfiles and convert all KVs to KVversion2 as we read them (KVversion2 would include sequenceid). New files would be written w/ KVversion2. Memstores would have to change to factor in seqid (and would need to figure what to do on upsert, the overwrite of a memstore value....currently we let memstore values get overwritten when same coordinates... which is fine given our current semantic that two KVs at same coordinates do NOT count as different VERSIONS; we would have to change this...). There are probably a bunch of places where we presume the old KVversion1 serialization format that we would have to hunt out. This seems like a bunch of work to me? If we were going to go this route, we might as well get 'labels' into the the KV too. Might as well move over to using Cell interface. bq. The thing where memstore overwrites the value appears to be plain incorrect to me when VERSIONS is more than 1 (whether you will get a version or not depends on when the memstore flush happens)... No. KVs at same VERSION are considered the same; you will only ever get the last one written. > distributedLogReplay need to apply wal edits in the receiving order of those edits > ---------------------------------------------------------------------------------- > > Key: HBASE-8701 > URL: https://issues.apache.org/jira/browse/HBASE-8701 > Project: HBase > Issue Type: Bug > Components: MTTR > Reporter: Jeffrey Zhong > Assignee: Jeffrey Zhong > Fix For: 0.98.0, 0.95.2 > > > This issue happens in distributedLogReplay mode when recovering multiple puts of the same key + version(timestamp). After replay, the value is nondeterministic of the key > h5. The original concern situation raised from [~eclark]: > For all edits the rowkey is the same. > There's a log with: [ A (ts = 0), B (ts = 0) ] > Replay the first half of the log. > A user puts in C (ts = 0) > Memstore has to flush > A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid. > Replay the rest of the Log. > Flush > The issue will happen in similar situation like Put(key, t=T) in WAL1 and Put(key,t=T) in WAL2 > h5. Below is the option I'd like to use: > a) During replay, we pass wal file name hash in each replay batch and original wal sequence id of each edit to the receiving RS > b) Once a wal is recovered, playing RS send a signal to the receiving RS so the receiving RS can flush > c) In receiving RS, different WAL file of a region sends edits to different memstores.(We can visualize this in high level as sending changes to a new region object with name(origin region name + wal name hash) and use the original sequence Ids.) > d) writes from normal traffic(allow writes during recovery) are put in normal memstores as of today and flush normally with new sequenceIds. > h5. The other alternative options are listed below for references: > Option one > a) disallow writes during recovery > b) during replay, we pass original wal sequence ids > c) hold flush till all wals of a recovering region are replayed. Memstore should hold because we only recover unflushed wal edits. For edits with same key + version, whichever with larger sequence Id wins. > Option two > a) During replay, we pass original wal sequence ids > b) for each wal edit, we store each edit's original sequence id along with its key. > c) during scanning, we use the original sequence id if it's present otherwise its store file sequence Id > d) compaction can just leave put with max sequence id > Please let me know if you have better ideas. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira