hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8537) Dead region server pulled in from ZK
Date Mon, 13 May 2013 18:55:17 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656256#comment-13656256
] 

Jean-Daniel Cryans commented on HBASE-8537:
-------------------------------------------

bq. In your test, the new region server instance is rejected actually, right? It should be
fixed.

No, it does the right thing. The master figures the old region server is dead since it's coming
back (from the dead!) so as you can see it triggers a SSH (ServerManager: Added=172.21.3.117,60020,1368469063154
to dead servers). This is the rest of the log:

{noformat}
2013-05-13 11:18:36,474 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Splitting logs for 172.21.3.117,60020,1368469063154
2013-05-13 11:18:36,477 DEBUG org.apache.hadoop.hbase.master.MasterFileSystem: Renamed region
directory: file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting
2013-05-13 11:18:36,477 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog
workers [172.21.3.117,60020,1368469063154]
2013-05-13 11:18:36,479 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch
of logs to split
2013-05-13 11:18:36,480 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting
logs in [file:/tmp/hbase-jdcryans/hbase/.logs/172.21.3.117,60020,1368469063154-splitting]
2013-05-13 11:18:36,485 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog
task at znode /hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703
2013-05-13 11:18:36,486 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: task not yet
acquired /hbase/splitlog/file%3A%2Ftmp%2Fhbase-jdcryans%2Fhbase%2F.logs%2F172.21.3.117%2C60020%2C1368469063154-splitting%2F172.21.3.117%252C60020%252C1368469063154.1368469068703
ver = 0
2013-05-13 11:18:37,419 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1 unassigned = 1
2013-05-13 11:18:38,420 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1 unassigned = 1
2013-05-13 11:18:39,421 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1 unassigned = 1
2013-05-13 11:18:39,480 DEBUG org.apache.hadoop.hbase.master.ServerManager: STARTUP: Server
172.21.3.117,60020,1368469116206 came back up, removed it from the dead servers list
2013-05-13 11:18:39,480 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=172.21.3.117,60020,1368469116206
{noformat}

In this case I just killed -9 the region server, not the whole cluster.
                
> Dead region server pulled in from ZK
> ------------------------------------
>
>                 Key: HBASE-8537
>                 URL: https://issues.apache.org/jira/browse/HBASE-8537
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.98.0
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>            Priority: Minor
>         Attachments: trunk-8537.patch
>
>
> When a cluster restarts quickly after it's crashed, although a new region server is reported
in, the master still pulls in the dead region server from the zk.
> {noformat}
> 2013-05-12 18:32:52,996 INFO  [IPC Server handler 6 on 36000] org.apache.hadoop.hbase.master.ServerManager:
Registering server=a1217.halxg.cloudera.com,36020,1368408767773
> ....
> 2013-05-12 18:32:54,653 INFO  [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.HMaster:
Registering server found up in zk but who has not yet reported in: a1217.halxg.cloudera.com,36020,1368378273768
> 2013-05-12 18:32:54,653 INFO  [master-a1220.halxg.cloudera.com,36000,1368408767520] org.apache.hadoop.hbase.master.ServerManager:
Registering server=a1217.halxg.cloudera.com,36020,1368378273768
> {noformat}
> We should not pull in the second region server instance from zk.  It is actually dead.
 We can figure this out by the hostname, and the port.  We can assume no two region server
instances can be alive on the same host, the same port.  To be more cautious, we can check
the timestamp as well.  The live one should be that with the late timestamp, not pulled in
from zk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message