hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9563) Autorestart doesn't work if zkcleaner fails
Date Thu, 10 Oct 2013 22:30:43 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792053#comment-13792053
] 

Enis Soztutar commented on HBASE-9563:
--------------------------------------

Today, we also saw a related thing in one of the tests. 

CM killed the active master, but was not able to start that back again:  
{code}
2013-10-09 11:27:11,911|beaver.machine|INFO|2013-10-09 11:27:11,910 INFO  [Thread-6] actions.Action:
Performing action: Restart active master
2013-10-09 11:27:11,918|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  [Thread-6] actions.Action:
Killing master:ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:11,919|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  [Thread-6] hbase.HBaseCluster:
Aborting Master: ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:11,919|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  [Thread-6] hbase.ClusterManager:
Executing remote command: ps aux | grep proc_master | grep -v grep | tr -s ' ' | cut -d '
' -f2 | xargs kill -s SIGKILL , hostname:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:11,919|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  [Thread-6] util.Shell:
Executing full command [/usr/bin/ssh  -o StrictHostKeyChecking=no ambari-unsec-1381298460-hbase-1.cs1cloud.internal
"sudo su - hbase -c \"ps aux | grep proc_master | grep -v grep | tr -s ' ' | cut -d ' ' -f2
| xargs kill -s SIGKILL\""]
2013-10-09 11:27:16,684|beaver.machine|INFO|2013-10-09 11:27:16,683 INFO  [Thread-6] hbase.ClusterManager:
Executed remote command, exit code:0 , output:
2013-10-09 11:27:16,685|beaver.machine|INFO|2013-10-09 11:27:16,683 INFO  [Thread-6] hbase.HBaseCluster:
Waiting service:master to stop: ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:16,685|beaver.machine|INFO|2013-10-09 11:27:16,683 INFO  [Thread-6] hbase.ClusterManager:
Executing remote command: ps aux | grep proc_master | grep -v grep | tr -s ' ' | cut -d '
' -f2 , hostname:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:16,685|beaver.machine|INFO|2013-10-09 11:27:16,684 INFO  [Thread-6] util.Shell:
Executing full command [/usr/bin/ssh  -o StrictHostKeyChecking=no ambari-unsec-1381298460-hbase-1.cs1cloud.internal
"sudo su - hbase -c \"ps aux | grep proc_master | grep -v grep | tr -s ' ' | cut -d ' ' -f2\""]
2013-10-09 11:27:17,183|beaver.machine|INFO|2013-10-09 11:27:17,183 INFO  [Thread-6] hbase.ClusterManager:
Executed remote command, exit code:0 , output:
2013-10-09 11:27:17,184|beaver.machine|INFO|2013-10-09 11:27:17,183 INFO  [Thread-6] actions.Action:
Killed master server:ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:17,184|beaver.machine|INFO|2013-10-09 11:27:17,183 INFO  [Thread-6] actions.Action:
Sleeping for:5000

2013-10-09 11:27:22,184|beaver.machine|INFO|2013-10-09 11:27:22,183 INFO  [Thread-6] actions.Action:
Starting master:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:22,184|beaver.machine|INFO|2013-10-09 11:27:22,183 INFO  [Thread-6] hbase.HBaseCluster:
Starting Master on: ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:22,185|beaver.machine|INFO|2013-10-09 11:27:22,184 INFO  [Thread-6] hbase.ClusterManager:
Executing remote command: /usr/lib/hbase/bin/../bin/hbase-daemon.sh --config /tmp/hbaseConf
start master , hostname:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:22,185|beaver.machine|INFO|2013-10-09 11:27:22,184 INFO  [Thread-6] util.Shell:
Executing full command [/usr/bin/ssh  -o StrictHostKeyChecking=no ambari-unsec-1381298460-hbase-1.cs1cloud.internal
"sudo su - hbase -c \"/usr/lib/hbase/bin/../bin/hbase-daemon.sh --config /tmp/hbaseConf start
master\""]
2013-10-09 11:27:23,006|beaver.machine|INFO|2013-10-09 11:27:23,006 WARN  [Thread-6] policies.Policy:
Exception occured during performing action: org.apache.hadoop.util.Shell$ExitCodeException:
2013-10-09 11:27:23,007|beaver.machine|INFO|at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
2013-10-09 11:27:23,007|beaver.machine|INFO|at org.apache.hadoop.util.Shell.run(Shell.java:379)
2013-10-09 11:27:23,007|beaver.machine|INFO|at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
2013-10-09 11:27:23,007|beaver.machine|INFO|at org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111)
2013-10-09 11:27:23,007|beaver.machine|INFO|at org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187)
2013-10-09 11:27:23,008|beaver.machine|INFO|at org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:196)
2013-10-09 11:27:23,008|beaver.machine|INFO|at org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:201)
2013-10-09 11:27:23,008|beaver.machine|INFO|at org.apache.hadoop.hbase.DistributedHBaseCluster.startMaster(DistributedHBaseCluster.java:148)
2013-10-09 11:27:23,008|beaver.machine|INFO|at org.apache.hadoop.hbase.chaos.actions.Action.startMaster(Action.java:77)
2013-10-09 11:27:23,008|beaver.machine|INFO|at org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartMaster(RestartActionBaseAction.java:45)
2013-10-09 11:27:23,009|beaver.machine|INFO|at org.apache.hadoop.hbase.chaos.actions.RestartActiveMasterAction.perform(RestartActiveMasterAction.java:35)
2013-10-09 11:27:23,009|beaver.machine|INFO|at org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59)
2013-10-09 11:27:23,009|beaver.machine|INFO|at org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
2013-10-09 11:27:23,009|beaver.machine|INFO|at org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
2013-10-09 11:27:23,009|beaver.machine|INFO|at java.lang.Thread.run(Thread.java:722)
{code}

>From the master logs you can see that it is not started at all: 
{code}
2013-10-09 11:26:33,458 INFO  [AM.ZK.Worker-pool2-t365] master.RegionStates: Onlined 5ec605643a410b2f38dfb5e9806b0f87
on ambari-unsec-1381298460-hbase-7.cs1cloud.internal,60020,1381342526428
2013-10-09 11:26:33,458 INFO  [AM.ZK.Worker-pool2-t365] master.RegionStates: Offlined 5ec605643a410b2f38dfb5e9806b0f87
from ambari-unsec-1381298460-hbase-3.cs1cloud.internal,60020,1381342212884
2013-10-09 11:26:51,906 DEBUG [RpcServer.handler=4,port=60000] master.HMaster: Skipping move
of region IntegrationTestBigLinkedList,\xEF\xF0\xB0\xED\xBC\xA4\xB8\x88\xB6\x18\x8C\xD7nd%\x9B,1381339413686.61ee6a62bd598c8d3846b2348a48bee2.
because region already assigned to the same server ambari-unsec-1381298460-hbase-4.cs1cloud.internal,60020,1381342368480.
2013-10-09 11:27:41,642 INFO  [main] util.VersionInfo: HBase 0.96.0.2.0.6.0-67-hadoop2
<removed>
2013-10-09 11:27:47,950 INFO  [main] zookeeper.RecoverableZooKeeper: Process identifier=clean
znode for master connecting to ZooKeeper ensemble=ambari-unsec-1381298460-hbase-3.cs1cloud.internal:2181,ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181,ambari-unsec-1381298460-hbase-4.cs1cloud.internal:2181
2013-10-09 11:27:48,117 INFO  [main-SendThread(ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181)]
zookeeper.ClientCnxn: Opening socket connection to server ambari-unsec-1381298460-hbase-1.cs1cloud.internal/172.18.145.170:2181.
Will not attempt to authenticate using SASL (unknown error)
2013-10-09 11:27:48,241 INFO  [main-SendThread(ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181)]
zookeeper.ClientCnxn: Socket connection established to ambari-unsec-1381298460-hbase-1.cs1cloud.internal/172.18.145.170:2181,
initiating session
2013-10-09 11:27:48,408 INFO  [main-SendThread(ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181)]
zookeeper.ClientCnxn: Session establishment complete on server ambari-unsec-1381298460-hbase-1.cs1cloud.internal/172.18.145.170:2181,
sessionid = 0x1419d4bb0cb0138, negotiated timeout = 30000
{code}

In hbase-daemon.sh, we wait for the process to exit, then call clean for znode: 
{code}
(internal_start)
    # Add to the command log file vital stats on our environment.
    echo "`date` Starting $command on `hostname`" >> $loglog
    echo "`ulimit -a`" >> $loglog 2>&1
    nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase \
        --config "${HBASE_CONF_DIR}" \
        $command "$@" start >> "$logout" 2>&1 &
    echo $! > $pid
    wait
    cleanZNode
{code}

If you notice the timings for last log from master, and the first log from master znode task,
you would see that former is at 11:26:51,906, and the latter is at 11:27:41,642 which is like
50 sec. In the CM logs, you can see that CM tried to start the master at 11:27:22,184. It
seems that we kill -9 the master process, and ps does not show it when we connect for waitForMasterToStop(),
but indeed the pid lingers on, and the master start on the next run will fail. We do not print
out the process stdout in case of an error. 

> Autorestart doesn't work if zkcleaner fails
> -------------------------------------------
>
>                 Key: HBASE-9563
>                 URL: https://issues.apache.org/jira/browse/HBASE-9563
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Elliott Clark
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.98.0, 0.96.0
>
>         Attachments: 9563.txt, 9563v2.txt, categorization.txt, categorization.txt
>
>
> I've seen this several times where a master didn't autorestart because zk cleaner failed.
 We should still restart the daemon even if it's not possible to clean the zk nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message