hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10210) during master startup, RS can be you-are-dead-ed by master in error
Date Sat, 04 Jan 2014 03:35:53 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862189#comment-13862189
] 

Hudson commented on HBASE-10210:
--------------------------------

SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #51 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/51/])
HBASE-10210 during master startup, RS can be you-are-dead-ed by master in error (sershe: rev
1555302)
* /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerOnCluster.java
* /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterNoCluster.java


> during master startup, RS can be you-are-dead-ed by master in error
> -------------------------------------------------------------------
>
>                 Key: HBASE-10210
>                 URL: https://issues.apache.org/jira/browse/HBASE-10210
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.96.1, 0.99.0, 0.96.1.1
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>             Fix For: 0.98.0, 0.99.0
>
>         Attachments: HBASE-10210.01.patch, HBASE-10210.02.patch, HBASE-10210.03.patch,
HBASE-10210.04.patch, HBASE-10210.05.patch, HBASE-10210.patch
>
>
> Not sure of the root cause yet, I am at "how did this ever work" stage.
> We see this problem in 0.96.1, but didn't in 0.96.0 + some patches.
> It looks like RS information arriving from 2 sources - ZK and server itself, can conflict.
Master doesn't handle such cases (timestamp match), and anyway technically timestamps can
collide for two separate servers.
> So, master YouAreDead-s the already-recorded reporting RS, and adds it too. Then it discovers
that the new server has died with fatal error!
> Note the threads.
> Addition is called from master initialization and from RPC.
> {noformat}
> 2013-12-19 11:16:45,290 INFO  [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager:
Finished waiting for region servers count to settle; checked in 2, slept for 18262 ms, expecting
minimum of 1, maximum of 2147483647, master is running.
> 2013-12-19 11:16:45,290 INFO  [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager:
Registering server=h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,290 INFO  [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.HMaster:
Registered server found up in zk but who has not yet reported in: h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,380 INFO  [RpcServer.handler=4,port=60000] master.ServerManager:
Triggering server recovery; existingServer h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
looks stale, new server:h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,380 INFO  [RpcServer.handler=4,port=60000] master.ServerManager:
Master doesn't enable ServerShutdownHandler during initialization, delay expiring server h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> ...
> 2013-12-19 11:16:46,925 ERROR [RpcServer.handler=7,port=60000] master.HMaster: Region
server h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 reported a
fatal error:
> ABORTING region server h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 as dead server
> {noformat}
> Presumably some of the recent ZK listener related changes b



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message