hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Esteban Gutierrez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
Date Fri, 10 Jul 2015 22:43:04 GMT
Esteban Gutierrez created HBASE-14059:
-----------------------------------------

             Summary: We should add a RS to the dead servers list if admin calls fail more
than a threshold
                 Key: HBASE-14059
                 URL: https://issues.apache.org/jira/browse/HBASE-14059
             Project: HBase
          Issue Type: Bug
          Components: master, regionserver, rpc
    Affects Versions: 0.98.13
            Reporter: Esteban Gutierrez
            Assignee: Esteban Gutierrez
            Priority: Critical


I ran into this problem twice this week: calls from the HBase master to a RS can timeout since
the RS call queue size has been maxed out, however since the RS is not dead (ephemeral znode
still present) the master keeps attempting to perform admin tasks like trying to open or close
a region but those operations eventually fail after we run out of retries or the assignment
manager attempts to re-assign to other RSs. From the side effects of this I've noticed master
operations to be fully blocked or RITs since we cannot close the region and open the region
in a new location since RS is not dead. 

A potential solution for this is to add the RS to the list of dead RSs after certain number
of calls from the master to the RS fail.

I've noticed only the problem in 0.98.x but it should be present in all versions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message