hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Heng Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
Date Fri, 17 Jul 2015 06:29:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630851#comment-14630851

Heng Chen commented on HBASE-14059:

So  I think just add this RS to dead server list can't solve the problem, because the bad
region could be transited into other RS and cause it's call queue maxed out.

I think a better solution is to add the bad region into blacklist, and skip the request on
this region to avoid the rs been blocked.

> We should add a RS to the dead servers list if admin calls fail more than a threshold
> -------------------------------------------------------------------------------------
>                 Key: HBASE-14059
>                 URL: https://issues.apache.org/jira/browse/HBASE-14059
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver, rpc
>    Affects Versions: 0.98.13
>            Reporter: Esteban Gutierrez
>            Assignee: Esteban Gutierrez
>            Priority: Critical
> I ran into this problem twice this week: calls from the HBase master to a RS can timeout
since the RS call queue size has been maxed out, however since the RS is not dead (ephemeral
znode still present) the master keeps attempting to perform admin tasks like trying to open
or close a region but those operations eventually fail after we run out of retries or the
assignment manager attempts to re-assign to other RSs. From the side effects of this I've
noticed master operations to be fully blocked or RITs since we cannot close the region and
open the region in a new location since RS is not dead. 
> A potential solution for this is to add the RS to the list of dead RSs after certain
number of calls from the master to the RS fail.
> I've noticed only the problem in 0.98.x but it should be present in all versions.

This message was sent by Atlassian JIRA

View raw message