hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pankaj Kumar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-14000) Region server failed to report Master and stuck in reportForDuty retry loop
Date Fri, 10 Jul 2015 14:03:04 GMT

     [ https://issues.apache.org/jira/browse/HBASE-14000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pankaj Kumar updated HBASE-14000:
---------------------------------
    Attachment: HM_RS-Log_snippet.txt

In HBase HA cluster 2 HMasters and 3 RegionServers are deployed in Machine-1(vm-M1), Machine-2(vm-M2)
and Machine-3(vm-M3).
{noformat}
Machine-1/(vm-M1) -> HM1, RS1
Machine-2/(vm-M2) -> HM2, RS2
Machine-3/(vm-M3) -> RS3

HM1 -> Active Master 
HM2 -> Standby Master
{noformat}

There was a power breakdown, all machines went down. When they came up, cluster (any process
can come up in any sequence) will start automatically by the operation management system.

In our scenario all regionservers were started before master. when they come up and tried
to contact HM1, but "error telling master we are up (Connection refused)" was reported 1-2
times. Only in this case we are resetting rssStub.

{noformat}
RS1:
	2015-06-26 10:04:08,683 | WARN  | regionserver/vm-M1/Machine-1-IP:21302 | error telling master
we are up | org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2277)
	com.google.protobuf.ServiceException: java.net.ConnectException: Connection refused
		at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
{noformat}

Same exception was thrown 1-2 times, later on it was stuck in loop

{noformat}
RS1:
	2015-06-26 10:04:08,688 | WARN  | regionserver/vm-M1/Machine-1-IP:21302 | reportForDuty failed;
sleeping and then retrying. 
	2015-06-26 10:04:11,689 | INFO  | regionserver/vm-M1/Machine-1-IP:21302 | reportForDuty to
master=vm-M1,21300,1435049929004
	2015-06-26 10:04:13,202 | WARN  | regionserver/vm-M1/Machine-1-IP:21302 | reportForDuty failed;
sleeping and then retrying. 
	2015-06-26 10:04:16,203 | INFO  | regionserver/vm-M1/Machine-1-IP:21302 | reportForDuty to
master=vm-M2,21300,1435284238757
	2015-06-26 10:04:16,208 | WARN  | regionserver/vm-M1/Machine-1-IP:21302 | reportForDuty failed;
sleeping and then retrying. 
	2015-06-26 10:04:19,208 | INFO  | regionserver/vm-M1/Machine-1-IP:21302 | reportForDuty to
master=vm-M2,21300,1435284238757
{noformat}

{noformat}
RS2:
	2015-06-26 10:04:08,140 | WARN  | regionserver/vm-M2/Machine-2-IP:21302 | reportForDuty failed;
sleeping and then retrying. 
	2015-06-26 10:04:11,140 | INFO  | regionserver/vm-M2/Machine-2-IP:21302 | reportForDuty to
master=vm-M1,21300,1435049929004
	2015-06-26 10:04:12,964 | WARN  | regionserver/vm-M2/Machine-2-IP:21302 | reportForDuty failed;
sleeping and then retrying. 
	2015-06-26 10:04:15,964 | INFO  | regionserver/vm-M2/Machine-2-IP:21302 | reportForDuty to
master=vm-M2,21300,1435284238757
	2015-06-26 10:04:15,969 | WARN  | regionserver/vm-M2/Machine-2-IP:21302 | reportForDuty failed;
sleeping and then retrying. 
	2015-06-26 10:04:18,970 | INFO  | regionserver/vm-M2/Machine-2-IP:21302 | reportForDuty to
master=vm-M2,21300,1435284238757
{noformat}

You can see, initially regionserver tried to connect to vm-M1 and even though after "error
telling master we are up"

During this time HM2 became active,
{noformat}
	2015-06-26 10:04:14,735 | INFO  | vm-M2:21300.activeMasterManager | Deleting ZNode for /hbase/backup-masters/vm-M2,21300
	2015-06-26 10:04:14,749 | INFO  | vm-M2:21300.activeMasterManager | Registered Active Master=vm-M2,21300,1435284238757
|
{noformat}

And when HM1 finish initialization it became standby master,
{noformat}
	2015-06-26 10:04:14,963 | INFO  | vm-M1:21300.activeMasterManager | Current master has this
master's address, vm-M1,21300,1435049929004; master was restarted? Deleting node.
	2015-06-26 10:04:14,973 | INFO  | vm-M1:21300.activeMasterManager | Another master is the
active master, vm-M2,21300,1435284238757; waiting to become the next active master 
{noformat}


1) Since rssStub was not reset, regionservers were actually trying to report HM1(standby master)
and same time HM2(active master) was waiting for region servers count to settle.

2) Log message is also not proper as it says , "its trying to connect to HM2" becuase everytime
it will retrieve the active master address from ZK and its using only for log purpose and
not reseting the restub. 

> Region server failed to report Master and stuck in reportForDuty retry loop
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-14000
>                 URL: https://issues.apache.org/jira/browse/HBASE-14000
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Pankaj Kumar
>            Assignee: Pankaj Kumar
>         Attachments: HBASE-14000.patch, HM_RS-Log_snippet.txt
>
>
> In a HA cluster, region server got stuck in reportForDuty retry loop if the active master
is restarting and later on master switch happens before it reports successfully.
> Root cause is same as HBASE-13317, but the region server tried to connect master when
it was starting, so rssStub reset didnt happen as
> {code}
>   if (ioe instanceof ServerNotRunningYetException) {
> 	LOG.debug("Master is not running yet");
>   }
> {code}
> When master starts, master switch happened. So RS always tried to connect to standby
master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message