hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die
Date Fri, 19 Sep 2014 17:33:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140931#comment-14140931

stack commented on HBASE-12028:

bq. I'm still unclear about the root cause for HBASE-11813

A query that returned thousands of empty results turned what was thought an harmless recursion
(triggered by empty result) pathological... it recursed so much it overflowed the allocated

You have a couple of points both that we are mortally wounded if we lose a handler and that
killing the RS could end up taking down the whole cluster.

> Abort the RegionServer, when one of it's handler threads die
> ------------------------------------------------------------
>                 Key: HBASE-12028
>                 URL: https://issues.apache.org/jira/browse/HBASE-12028
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Sudarshan Kadambi
> Over in HBase-11813, a user identified an issue where in all the RPC handler threads
would exit with StackOverflow errors due to an unchecked recursion-terminating condition.
Our clusters demonstrated the same trace. While the patch posted for HBASE-11813 got our clusters
to be merry again, the breakdown surfaced some larger issues.
> When the RegionServer had all it's RPC handler threads dead, it continued to have regions
assigned it. Clearly, it wouldn't be able to serve reads and writes on those regions. A second
issue was that when a user tried to disable or drop a table, the master would try to communicate
to the regionserver for region unassignment. Since the same handler threads seem to be used
for master <-> RS communication as well, the master ended up hanging on the RS indefinitely.
Eventually, the master stopped responding to all table meta-operations.
> A handler thread should never exit, and if it does, it seems like the more prudent thing
to do would be for the RS to abort. This way, atleast recovery can be undertaken and the regions
could be reassigned elsewhere. I also think that the master<->RS communication should
get its own exclusive threadpool, but I'll wait until this issue has been sufficiently discussed
before opening an issue ticket for that.

This message was sent by Atlassian JIRA

View raw message