hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16960) RegionServer hang when aborting
Date Mon, 31 Oct 2016 05:34:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15621313#comment-15621313
] 

ramkrishna.s.vasudevan commented on HBASE-16960:
------------------------------------------------

bq.ut the problem in this JIRA is some case that there's no further syncs after append fails,
and causing an isolated sync then infinite wait. The proposal will try to clean previous non-synced
syncFutures so it won't leave any isolated one, and don't break any existing logic.
This is true. Infact am also looking out this possibility only for the AsyncWAL case. 
bq.. It is a weakness of the implementation that every append must be followed by a sync else
the machinery gets stuck.
This is what I am getting when I tried to use ring buffer with AsyncWAL. But reading this
FSHLOg code I found things are much better because every time the head of the queue was removed
and we were setting the highestSyncID with that current syncid. 
So any other sync in the syncFuture were checked and if their txid is greater than this we
were skipping it from marking done. But the failure case am not very sure. But this append
followed by sync mechanism is causing such bugs.

> RegionServer hang when aborting
> -------------------------------
>
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, HBASE-16960_master_v2.patch, RingBufferEventHandler.png,
RingBufferEventHandler_exception.png, SyncFuture.png, SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on this regionserver
out of service and then all affected applications stop works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message