hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhiyuan.dai (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5075) regionserver crashed and failover
Date Mon, 20 Feb 2012 06:39:36 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211691#comment-13211691
] 

zhiyuan.dai commented on HBASE-5075:
------------------------------------

@stack @Lars Hofhansl
First the rpc method getRSPidAndRsZknode is to fetch PID and znode which includes domain and
service port,this way is reliable. If we use processes list, there may be some misjudgment.

Second there is a supervisor called RegionServerFailureDetection,we first start regionserver,
and then start RegionServerFailureDetection.RegionServerFailureDetection is a watchdog of
RegionServer.

Then the supervisor(RegionServerFailureDetection) of regionserver fetch PID and znode by getRSPidAndRsZknode.

RegionServerFailureDetection doesn't have any relationship with long GC.

RegionServerFailureDetection first check whether PID is alive and the check service port is
alive.
                
> regionserver crashed and failover
> ---------------------------------
>
>                 Key: HBASE-5075
>                 URL: https://issues.apache.org/jira/browse/HBASE-5075
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring, regionserver, replication, zookeeper
>    Affects Versions: 0.92.1
>            Reporter: zhiyuan.dai
>             Fix For: 0.90.5
>
>         Attachments: Degion of Failure Detection.pdf, HBase-5075-src.patch
>
>
> regionserver crashed,it is too long time to notify hmaster.when hmaster know regionserver's
shutdown,it is long time to fetch the hlog's lease.
> hbase is a online db, availability is very important.
> i have a idea to improve availability, monitor node to check regionserver's pid.if this
pid not exsits,i think the rs down,i will delete the znode,and force close the hlog file.
> so the period maybe 100ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message