Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 19 Jun 2013 04:35:24 +0000 (UTC)
From: "Jeffrey Zhong (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12651415.1370543927825.138204.1371616524881@arcas>
In-Reply-To: <JIRA.12651415.1370543927825@arcas>
References: <JIRA.12651415.1370543927825@arcas>
Subject: [jira] [Updated] (HBASE-8701) distributedLogReplay need to apply
 wal edits in the receiving order of those edits
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeffrey Zhong updated HBASE-8701:
---------------------------------

    Attachment: hbase-8701-v7.patch

Thanks [~saint.ack@gmail.com] and [~himanshu@cloudera.com] for good comments. In the v7 patch, we really write negative mvcc into hfile. The patch runs all unit tests clean with & without distributedLogReplay.

{quote}
I do not get why we have to have two sequenceids in an edit; the actual and the 'original'.
{quote}
Oh, I see your question now. We don't have two sequence numbers.(there is no actual sequence number stored in WALEdit) Only this 'original' one which is introduced in the patch.  

{quote}
Doesn't mvcc make it out to hfile so that when we merge it w/ the memstore (because there was say, an ongoing scan at the time of the flush), that we still respect whatever the mvcc was at the time and not show clients' edits that are not meant for them?
{quote}
Good point. Since recovery does NOT allow reads, the situation above won't happen. A read request will get an exception before actual read logic happens. After recovery. it means all writes are committed so there should be no issue to read them all. Because of the negative mvcc value(logically equal to 0), they will be fetched. The semantics around this are same as recovered edits recovery where MVCC values are 0.

{quote}
Do you mean 'replace' in the above?
{quote}
No, there is no replace and both versions of KV with different MVCC values exist in memstore. 

{quote}
The negative mvcc is out in the hfile occupying the sequenceid-for-the-hfile location?
{quote}
No. mvcc doesn't affect sequence id of a hfile because mvcc and sequence number are independent with each other

{quote}
b) Recovery is completed, and region is available for read. There might be some scanners open and we would now have some legit min readpoint.
{quote}
Read requests will be rejected till the region is recovered by which time all writes are committed. 

{quote}
Do we need to remove that optimization now?
{quote}
indeed, we need this. Thanks for the good point.

{quote}
ii) How we handle the Deletes now.
{quote}
Deletes won't be affected because delete always win of the same version no ordering gurantee. I also checked the code and negative mvcc doesn't affect the scenario you mentioned above.

Thanks.
                
> distributedLogReplay need to apply wal edits in the receiving order of those edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: 8701-v3.txt, hbase-8701-v4.patch, hbase-8701-v5.patch, hbase-8701-v6.patch, hbase-8701-v7.patch
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts of the same key + version(timestamp). After replay, the value is nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and Put(key,t=T) in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the receiving RS
> b) In receiving RS, we store negative original sequence number of wal edits into mvcc field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator   
> d) In receiving RS, write original sequence number into an optional field of wal file for chained RS failure situation 
> e) When opening a region, we add a safety bumper(a large number) in order for the new sequence number of a newly opened region not to collide with old sequence numbers. 
> In the future, when we stores sequence number along with KVs, we can adjust the above solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore should hold because we only recover unflushed wal edits. For edits with same key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with its key. 
> c) during scanning, we use the original sequence id if it's present otherwise its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira