hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16960) RegionServer hang when aborting
Date Sun, 30 Oct 2016 06:03:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619357#comment-15619357

stack commented on HBASE-16960:

I can't say for sure that the patch will fix the problem and so think we should add on the
long wait on sync and call abort if we time out just-in-case.

An append throws an exception (usually never happens). We set exception in onEvent so all
subsequent appends will get this exception but we keep pulling on the ringbuffer to clear
it out.

We schedule a roll of the log. The roll fails because many (8k? Is that possible?) appends
have gone into the log and they have not been ACK'd with a sync so we will fail the roll and
call for an ABORT of the server to replay logs.

Now, I can't tell for sure what state we are in. Batching in the RingBuffer is basic. It is
just whatever is there since the last time we went to pull from the ringbuffer. A batch would
have to have been something like append, append, sync, sync, append.... i..e. an append came
in after some syncs... which is possible of course. In this case, I think your patch will
help clearing out unoffered syncrunners ... the syncs that came in before the append that
failed. If no new sync comes around the ringbuffer, these are just going to hang out. It looks
like we are so busy trying to ABORT, we neglect to schedule these SyncFutures.

Can you reproduce?

Thanks for digging in on this one [~carp84] and [~aoxiang]

> RegionServer hang when aborting
> -------------------------------
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, RingBufferEventHandler.png, RingBufferEventHandler_exception.png,
SyncFuture.png, SyncFuture_exception.png, rs1081.jstack
> We see regionserver hang when aborting several times and cause all regions on this regionserver
out of service and then all affected applications stop works.

This message was sent by Atlassian JIRA

View raw message