hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10210) during master startup, RS can be you-are-dead-ed by master in error
Date Mon, 30 Dec 2013 00:58:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858502#comment-13858502
] 

Jimmy Xiang commented on HBASE-10210:
-------------------------------------

bq. it is not 100% safe, timestamps can collide (I am assuming you mean don't restart on equals)

Yes, I meant don't restart on equals. Why isn't it safe? If the master gets there, it means
the RS has already started. On the same host, port pair, there could be only one instance
started at a given time stamp, right? Do you mean sometimes, it should restart?

> during master startup, RS can be you-are-dead-ed by master in error
> -------------------------------------------------------------------
>
>                 Key: HBASE-10210
>                 URL: https://issues.apache.org/jira/browse/HBASE-10210
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.96.1, 0.99.0, 0.96.1.1
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HBASE-10210.patch
>
>
> Not sure of the root cause yet, I am at "how did this ever work" stage.
> We see this problem in 0.96.1, but didn't in 0.96.0 + some patches.
> It looks like RS information arriving from 2 sources - ZK and server itself, can conflict.
Master doesn't handle such cases (timestamp match), and anyway technically timestamps can
collide for two separate servers.
> So, master YouAreDead-s the already-recorded reporting RS, and adds it too. Then it discovers
that the new server has died with fatal error!
> Note the threads.
> Addition is called from master initialization and from RPC.
> {noformat}
> 2013-12-19 11:16:45,290 INFO  [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager:
Finished waiting for region servers count to settle; checked in 2, slept for 18262 ms, expecting
minimum of 1, maximum of 2147483647, master is running.
> 2013-12-19 11:16:45,290 INFO  [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager:
Registering server=h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,290 INFO  [master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.HMaster:
Registered server found up in zk but who has not yet reported in: h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,380 INFO  [RpcServer.handler=4,port=60000] master.ServerManager:
Triggering server recovery; existingServer h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
looks stale, new server:h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> 2013-12-19 11:16:45,380 INFO  [RpcServer.handler=4,port=60000] master.ServerManager:
Master doesn't enable ServerShutdownHandler during initialization, delay expiring server h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
> ...
> 2013-12-19 11:16:46,925 ERROR [RpcServer.handler=7,port=60000] master.HMaster: Region
server h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 reported a
fatal error:
> ABORTING region server h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 as dead server
> {noformat}
> Presumably some of the recent ZK listener related changes b



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message