hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18099) FlushSnapshotSubprocedure should check the return value from Region#flush()
Date Thu, 25 May 2017 03:53:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024157#comment-16024157

Jerry He commented on HBASE-18099:

bq. This check acts as safety guard. Normally waitForFlushes() would get into loop and return
non-zero value.
There is no really a need to have this safety guard. What is the concrete case which it guards
It is also a little strange to use a duration/time-lapse as indicator if an op is successful
or not.  It can be zero but still valid.
For example, in the patch:
99	    FlushResult res = region.flush(true);
100	    if (res.getResult() == FlushResult.Result.CANNOT_FLUSH) {
101	      long duration = region.waitForFlushes();
Between line 100 and line 101, the third-party flush can be over and writestate.flushing set
to false.  Then the waitForFlushes is a no-op with duration zero. We don't want to fail the
snapshot in this case.

We can just let region.waitForFlushes() be void and call it in line 101 without checking anything.
Please add a comment above line 100 to tell why we need to do line 101 -- there may be another
ongoing flush for the same region we want to wait for.

> FlushSnapshotSubprocedure should check the return value from Region#flush()
> ---------------------------------------------------------------------------
>                 Key: HBASE-18099
>                 URL: https://issues.apache.org/jira/browse/HBASE-18099
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>         Attachments: 18099.v1.txt, 18099.v2.txt, 18099.v3.txt
> In the following thread:
> http://search-hadoop.com/m/HBase/YGbbMXkeHlI9zo
> Jacob described the scenario where data from certain region were missing in the snapshot.
> Here was related region server log:
> https://pastebin.com/1ECXjhRp
> He pointed out that concurrent flush from MemStoreFlusher.1 thread was not initiated
from the thread pool for snapshot.
> In RegionSnapshotTask#call() method there is this:
> {code}
>           region.flush(true);
> {code}
> The return value is not checked.
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
> {code}
>           String msg = "Not flushing since "
>               + (writestate.flushing ? "already flushing"
>               : "writes not enabled");
> {code}
> This implies that FlushSnapshotSubprocedure may incorrectly skip waiting for the concurrent
flush to complete.

This message was sent by Atlassian JIRA

View raw message