hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16960) RegionServer hang when aborting
Date Sat, 29 Oct 2016 05:35:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617495#comment-15617495

stack commented on HBASE-16960:

I looked at the patch.

I was concerned that it was cancelling already running syncs but it does not seem to do that.
We do not want to stop currently running syncs. They were started before this failed append.
 If they succeed, no dataloss. A few handlers are going to get IOExceptions but all up to
the failed append will have been synced.  If they do not succeed, then could be data loss
but syncrunner should be screaming to kill the RegionServer so it will replay logs.

bq. But the problem in this JIRA is some case that there's no further syncs after append fails,
and causing an isolated sync then infinite wait. 

Yes. We seem to keep turning up corner cases that can bring about this stuck state. It is
a weakness of the implementation that every append must be followed by a sync else the machinery
gets stuck. [~aoxiang] suggests a timeout. I think a long timeout that takes a look around
to see what the state of things is and rethrows an abort if appropriate is something that
I wanted to avoid but it seems sensible after seeing this the second or third lockup that
has been caught out in the wild.

Thanks lads for digging in on this tough one.

> RegionServer hang when aborting
> -------------------------------
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, RingBufferEventHandler.png, RingBufferEventHandler_exception.png,
SyncFuture.png, SyncFuture_exception.png, rs1081.jstack
> We see regionserver hang when aborting several times and cause all regions on this regionserver
out of service and then all affected applications stop works.

This message was sent by Atlassian JIRA

View raw message