hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails
Date Fri, 04 Nov 2011 14:05:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144036#comment-13144036
] 

Ted Yu commented on HBASE-4749:
-------------------------------

Thanks for the finding Jinchao.

>From log of build 105:
{code}
Killing RS juno.apache.org,60001,1320357166142 


2011-11-03 21:52:56,007 FATAL [Thread-986] regionserver.HRegionServer(1523): ABORTING region
server juno.apache.org,60001,1320357166142: Killing for unit test
...
2011-11-03 21:52:56,011 WARN  [Thread-986] regionserver.HRegionServer(1545): Unable to report
fatal error to master
java.lang.reflect.UndeclaredThrowableException
	at $Proxy16.reportRSFatalError(Unknown Source)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1541)
...
2011-11-03 21:52:57,356 INFO [Master:0;juno.apache.org,51313,1320357176029] master.HMaster(464):
Registering server found up in zk: juno.apache.org,60001,1320357166142
2011-11-03 21:52:57,357 INFO [Master:0;juno.apache.org,51313,1320357176029] master.ServerManager(239):
Registering server=juno.apache.org,60001,1320357166142
...
2011-11-03 21:52:57,586 INFO  [Thread-986-EventThread] zookeeper.RegionServerTracker(93):
RegionServer ephemeral node deleted, processing expiration [juno.apache.org,60001,1320357166142]
2011-11-03 21:52:57,588 INFO  [RegionServer:1;juno.apache.org,60001,1320357166142] regionserver.HRegionServer(744):
stopping server juno.apache.org,60001,1320357166142; zookeeper connection closed.
{code}
We can see that there was 570ms delay for the completion of region server shutdown handler.
That was why re-registration of the dead region server happened.

Since reportRSFatalError() encountered exception, we cannot rely on this callback to reach
master.

We have two options:
1. devise a mechanism to tell the new master the identity of the dead region server
2. insert a sleep of say 1 second before starting the new master

Option 1 introduces extra complexity into Master. I am not sure if it is worth it just for
test purposes.
Many people wouldn't like option 2.

More discussion is welcome.
                
> TestMasterFailover case occasional fails
> ----------------------------------------
>
>                 Key: HBASE-4749
>                 URL: https://issues.apache.org/jira/browse/HBASE-4749
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Priority: Minor
>             Fix For: 0.92.0
>
>
> look this logs:
> https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message