hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits -> DATALOSS
Date Fri, 05 Jun 2015 07:51:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574087#comment-14574087
] 

stack commented on HBASE-13811:
-------------------------------

[~Apache9] JIRA was down so it took a while to respond....


bq. Fine, I think it will work. But I still feel a little nervous to have two methods which
have same name but different behaviors...

Makes sense. In this v7 patch, I made the two overloaded methods work the same and changed
what happens in HRegion when we prepare to flush.

bq. And I remember that, when implmenting HBASE-10201 and HBASE-12405, actually I wanted to
return the flushedSeqId when calling startCacheFlush first. But there are two problems. First
is getNextSequenceId method is in HRegion, not in FSHLog, so a simple solution is return NO_SEQ_NUM
when flushing all stores and let HRegion call getNextSequenceId. 

Yes. That is how it 'works' in patch v6 but it is hard to read. We can actually tell when
we are flushing if we should do all of the region, right? If the passed in families are null
or equal in number to region stores, we are doing a full region flush so we should use the
flush sequence id, the result of the getNextSequenceId call. Otherwise, we want the getEarliest
for the region because are doing a column family only flush...


bq. But here comes the second problem, startCacheFlush may fail which means we can not start
a flush, so there are three types of return values, 'sequenceId', 'choose a sequenceId by
yourself', 'give up flushing!'. I think it is ugly to have a '-2' or a null java.lang.Long
to indicate a 'give up flushing' at that time so I gave up...

Pardon me, I don't see the problem here? Your nice TestSplitWalDataLoss test was failing for
me earlier because I was not doing the abort accounting properly; the 'restore' of old sequenceids.
Abort of the flush will 'restore' the old sequenceids. The region flush id won't be updated.
This is ok?

bq. Maybe we could consider this solution again? getEarliestMemstoreSeqNum can be used everywhere
but startCacheFlush is restricted in the flushing scope I think.

I'd like to purge getEarliestMemstoreSeqNum or narrow its usage if possible.  What do you
mean by 'startCacheFlush is restricted'.

Thanks Duo

> Splitting WALs, we are filtering out too many edits -> DATALOSS
> ---------------------------------------------------------------
>
>                 Key: HBASE-13811
>                 URL: https://issues.apache.org/jira/browse/HBASE-13811
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 2.0.0, 1.2.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0, 1.2.0
>
>         Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt,
13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt,
13811.v6.branch-1.txt, 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch,
HBASE-13811.testcase.patch
>
>
> I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler
to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place
so can only think it the cause (but cannot see how). When we split the logs, we are skipping
legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message