hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuti Awasthi <stutiawas...@hcl.com>
Subject Unexpected shutdown of Zookeeper
Date Mon, 19 Sep 2011 05:15:02 GMT
Hi All,

I was running a 2 node cluster with 1 zookeeper node and 2 region server node. I had also
setup cluster replication with another single node Hbase-Hadoop cluster. Replication was successful
and I left the cluster running over the weekend with no data for replication.

Today I can see that in  Master cluster Zookeeper is dead. 1 region server which was running
on slave machine is also dead. The cluster to which I was replicating is running fine.

My queries are :

1.       Can zookeeper be dead because there is no replication over the network for long time
?

2.       How to cater to these situations ? Running 3-4 zookeeper node will help ?

3.       If I run multiple Zookeeper node, then will the cluster keep on running normally
even if 2-3 zookeeper are dead?

4.       In my case, out of 2 region server, 1 is dead but 1 is still working, if my zookeeper
node was running, will I able to access hbase properly.

Logs :
hbase-root-zookeeper-master.log :

2011-09-19 10:07:55,753 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection
from /10.33.64.235:44706
2011-09-19 10:07:55,758 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting
to establish new session at /10.33.64.235:44706
2011-09-19 10:07:55,761 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session
0x13271b6c4f1000c with negotiated timeout 180000 for client /10.33.64.235:44706
2011-09-19 10:10:48,318 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException:
Unable to read additional data from client sessionid 0x13271b6c4f1000c, likely client has
closed socket
2011-09-19 10:10:48,319 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection
for client /10.33.64.235:44706 which had sessionid 0x13271b6c4f1000c
2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session
0x13271b6c4f1000c, timeout of 180000ms exceeded
2011-09-19 10:12:57,002 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x13271b6c4f1000c

hbase-root-regionserver-slave.log:

2011-09-16 16:00:50,354 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020:
readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes
read: 0
java.io.IOException: Connection reset by peer
       at sun.nio.ch.FileDispatcher.read0(Native Method)
       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
       at sun.nio.ch.IOUtil.read(IOUtil.java:175)
       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
2011-09-16 16:00:51,058 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Opening log for replication slave%3A60020.1316168146136 at 663246
2011-09-16 16:00:51,064 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
currentNbOperations:5003 and seenEntries:0 and size: 0
2011-09-16 16:00:51,064 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log #slave%3A60020.1316168146136 for position 663246 in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427/slave%3A60020.1316168146136
2011-09-16 16:00:51,066 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Removing 0 logs in the list: []
2011-09-16 16:00:51,066 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Nothing to replicate, sleeping 1000 times 2
2011-09-16 16:00:53,068 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Opening log for replication slave%3A60020.1316168146136 at 663246
..................................
2011-09-16 17:14:49,440 WARN org.apache.zookeeper.ClientCnxn: Session 0x13271b5395c0007 for
server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection timed out
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
       at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2011-09-16 17:14:51,039 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
/hbase/rs/master,60020,1316167798366 znode expired, trying to lock it
2011-09-16 17:14:51,088 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server slave1/172.28.96.239:2181
2011-09-16 17:14:51,089 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to slave1/172.28.96.239:2181, initiating session
2011-09-16 17:14:51,093 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper
service, session 0x13271b5395c0007 has expired, closing socket connection
2011-09-16 17:14:51,094 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server serverName=slave,60020,1316168145427, load=(requests=0, regions=6, usedHeap=29,
maxHeap=996): connection to cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
       at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2011-09-16 17:14:51,094 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requests=0, regions=6, stores=6, storefiles=5, storefileIndexSize=0, memstoreSize=0, compactionQueueSize=0,
flushQueueSize=0, usedHeap=29, maxHeap=996, blockCacheSize=982352, blockCacheFree=208064384,
blockCacheCount=2, blockCacheHitCount=31, blockCacheMissCount=2, blockCacheEvictedCount=0,
blockCacheHitRatio=93, blockCacheHitCachingRatio=93
2011-09-16 17:14:51,094 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
connection to cluster: 1-0x13271b5395c0007 connection to cluster: 1-0x13271b5395c0007 received
expired from ZooKeeper, aborting
2011-09-16 17:14:51,094 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2011-09-16 17:14:51,114 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Source exiting 1
2011-09-16 17:14:52,476 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on
60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on
60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on
60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on
60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on
60020: exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60020:
exiting
2011-09-16 17:14:52,477 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60020:
exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020:
exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020:
exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on
60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on
60020: exiting
2011-09-16 17:14:52,478 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
infoServer
2011-09-16 17:14:52,478 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener
on 60020
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on
60020: exiting
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on
60020: exiting
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
2011-09-16 17:14:52,479 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on
60020: exiting
2011-09-16 17:14:52,481 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
2011-09-16 17:14:52,585 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor
exiting
2011-09-16 17:14:52,585 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver60020.cacheFlusher
exiting
2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2011-09-16 17:14:52,586 INFO org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
regionserver60020.majorCompactionChecker exiting
2011-09-16 17:14:52,587 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer
interrupted while waiting for sync requests
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing backup,,1315992791196.e5ff1d9eb66e1157d0ca8bfaaf493480.:
disabling compactions & flushes
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of testArchiveBackup,,1315915407547.e05ec3159a022f28aa92e1a01ca50fec.
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.
2011-09-16 17:14:52,589 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: regionserver60020.logSyncer
exiting
2011-09-16 17:14:52,588 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
Processing close of -ROOT-,,0.70236052
2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog
writer in hdfs://master:54310/hbase/.logs/slave,60020,1316168145427
2011-09-16 17:14:52,589 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing replication,,1316166014290.5937efd76493915556d3641aa9c0b6df.:
disabling compactions & flushes
............................
2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2011-09-16 17:14:52,602 INFO org.apache.zookeeper.ZooKeeper: Session: 0x13271b6c4f10003 closed
2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2011-09-16 17:14:52,605 INFO org.apache.zookeeper.ZooKeeper: Session: 0x13271b6c4f10005 closed
2011-09-16 17:14:52,605 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Closing source 1 because: Region server is closing
2011-09-16 17:14:52,605 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
exiting
2011-09-16 17:14:53,040 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Not transferring queue since we are shutting down
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main]
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
Shutdown hook
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
shutdown hook thread.
2011-09-16 17:14:53,042 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
finished.

Please suggest.

Thanks

________________________________
::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or
opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of
HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and
/ or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited.
If you have
received this email in error please delete it and notify the sender immediately. Before opening
any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message