hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yu Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16960) RegionServer hang when aborting
Date Fri, 28 Oct 2016 13:31:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615418#comment-15615418
] 

Yu Li commented on HBASE-16960:
-------------------------------

Thanks for chiming in [~ram_krish]

Note that here we already encounter an exception and set {{exception!=null}}, and according
to below codes all succeeding appends will just return:
{code}
            if (this.exception != null) {
              // We got an exception on an earlier attempt at append. Do not let this append
              // go through. Fail it but stamp the sequenceid into this append though failed.
              // We need to do this to close the latch held down deep in WALKey...that is
waiting
              // on sequenceid assignment otherwise it will just hang out (The #append method
              // called below does this also internally).
              entry.stampRegionSequenceId();
              // Return to keep processing events coming off the ringbuffer
              return;
            }
{code}

So there won't be any real append happen before a new sync truck handled by the {{RingBufferEventHandler}},
and when the new sync arrives, it will also goes to the below lines and *also* cleans all
{{syncFutures}} that haven't been offered to {{SyncRunner}}:
{code}
        // We may have picked up an exception above trying to offer sync
        if (this.exception != null) {
          cleanupOutstandingSyncsOnException(sequence,
            this.exception instanceof DamagedWALException?
              this.exception:
              new DamagedWALException("On sync", this.exception));
        }
{code}
And the only difference is that this cleanup will include this new sync itself.

In my understanding we just return when append fails and wait for the next sync to cleanup
the syncs because we must make sure the failed append won't be synced and returned as success.
But the problem in this JIRA is some case that there's no further syncs after append fails,
and causing an isolated sync then infinite wait. The proposal will try to clean previous non-synced
syncFutures so it won't leave any isolated one, and don't break any existing logic.

Actually [~aoxiang] and I also observed more questions on whether the current implementation
could assure the semantic that "failed appends won't get synced successfully", and we're still
digging into it. Will open another JIRA if any solution.

> RegionServer hang when aborting
> -------------------------------
>
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, RingBufferEventHandler.png, RingBufferEventHandler_exception.png,
SyncFuture.png, SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on this regionserver
out of service and then all affected applications stop works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message