hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: hbase region server shutdown after datanode connection exception
Date Thu, 23 May 2013 16:52:32 GMT
You are looking at it the wrong way. Per
http://hbase.apache.org/book.html#trouble.general, always walk up the
log to the first exception. In this case it's a session timeout.
Whatever happens next is most probably a side effect of that.

To help debug your issue, I would suggest reading this section of the
reference guide: http://hbase.apache.org/book.html#trouble.rs.runtime

J-D

On Tue, May 21, 2013 at 7:17 PM, Cheng Su <scarcer.cn@gmail.com> wrote:
> Hi all.
>
>
>
>          I have a small hbase cluster with 3 physical machines.
>
>          On 192.168.1.80, there are HMaster and a region server. On 81 & 82,
> there is a region server on each.
>
>          The region server on 80 can't sync HLog after a datanode access
> exception, and started to shutdown.
>
>          The datanode itself was not shutdown and response other requests
> normally. I'll paste logs below.
>
>          My question is:
>
>          1. Why this exception causes region server shutdown? Can I prevent
> it?
>
>          2. Is there any tools(shell command is best, like hadoop dfsadmin
> -report) can monitor hbase region server? to check whether it is alive or
> dead?
>
>            I have done some research that nagios/ganglia can do such things.
>
>
>       But actually I just want know the region server is alive or dead, so
> they are a little over qualify.
>
>            And I'm not using CDH, so I can't use Cloudera Manager I think.
>
>
>
>          Here are the logs.
>
>
>
>          HBase master:
> 2013-05-21 17:03:32,675 ERROR org.apache.hadoop.hbase.master.HMaster: Region
> server ^@^@hadoop01,60020,1368774173179 reported a fatal error:
>
> ABORTING region server hadoop01,60020,1368774173179:
> regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
> received expired from ZooKeeper, aborting
>
> Cause:
>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
> Watcher.java:369)
>
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
> java:266)
>
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
> )
>
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
>
>
>
>          Region Server:
>
> 2013-05-21 17:00:16,895 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 120000ms for sessionid
> 0x3eb14c67540002, closing socket connection and attempting re
>
> connect
>
> 2013-05-21 17:00:35,896 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 120000ms for sessionid
> 0x13eb14ca4bb0000, closing socket connection and attempting r
>
> econnect
>
> 2013-05-21 17:03:31,498 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_9188414668950016309_4925046java.net.SocketTimeoutException: 63000 millis
> timeout
>
>  while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.1.80:57020
> remote=/192.168.1.82:50010]
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>
>         at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.
> readFields(DataTransferProtocol.java:124)
>
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSCl
> ient.java:2784)
>
>
>
> 2013-05-21 17:03:31,520 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_9188414668950016309_4925046 bad datanode[0]
> 192.168.1.82:50010
>
> 2013-05-21 17:03:32,315 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server /192.168.1.82:2100
>
> 2013-05-21 17:03:32,316 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop03/192.168.1.82:2100, initiating session
>
> 2013-05-21 17:03:32,317 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoop03/192.168.1.82:2100, sessionid =
> 0x13eb14ca4bb0000, negotiated timeout = 180000
>
> 2013-05-21 17:03:32,497 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog:
> Could not sync. Requesting close of hlog
>
> java.io.IOException: Reflection
>
>         at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence
> FileLogWriter.java:230)
>
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1091)
>
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1195)
>
>         at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.
> java:1057)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> Caused by: java.lang.reflect.InvocationTargetException
>
>         at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
>
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>         at java.lang.reflect.Method.invoke(Method.java:597)
>
>         at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Sequence
> FileLogWriter.java:228)
>
>         ... 4 more
>
> Caused by: java.io.IOException: DFSOutputStream is closed
>
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3483)
>
>         at
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
>
>         at
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
>
>         ... 8 more
>
> 2013-05-21 17:03:32,497 DEBUG
> org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
>
> 2013-05-21 17:03:32,498 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: new
> createWriter -- HADOOP-6840 -- not available
>
> 2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server hadoop03/192.168.1.82:2100
>
> 2013-05-21 17:03:32,548 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop03/192.168.1.82:2100, initiating session
>
> 2013-05-21 17:03:32,549 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x3eb14c67540002 has expired,
> closing socket connection
>
> 2013-05-21 17:03:32,549 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> hadoop01,60020,1368774173179: regionserver:60020-0x3eb14c67540002
> regionserver:60020-0x3eb14c6754
>
> 0002 received expired from ZooKeeper, aborting
>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
> Watcher.java:369)
>
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
> java:266)
>
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
> )
>
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
>
> 2013-05-21 17:03:32,549 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
> loaded coprocessors are: []
>
> 2013-05-21 17:03:32,675 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requestsPerSecond=0, numberOfOnlineRegions=111, numberOfStores=111,
> numberOfStorefiles=109, storefileInd
>
> exSizeMB=1, rootIndexSizeKB=1141, totalStaticIndexSizeKB=50806,
> totalStaticBloomSizeKB=0, memstoreSizeMB=412, readRequestsCount=302086613,
> writeRequestsCount=1802402, compactionQueueSize=0, flushQueueSi
>
> ze=0, usedHeapMB=1631, maxHeapMB=2991, blockCacheSizeMB=988.97,
> blockCacheFreeMB=207.71, blockCacheCount=15710,
> blockCacheHitCount=2579547598, blockCacheMissCount=4348105,
> blockCacheEvictedCount=2734969
>
> , blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%,
> hdfsBlocksLocalityIndex=100
>
> 2013-05-21 17:03:33,014 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
> Path=hdfs://hadoop01:51001/hbase/.logs/hadoop01,60020,1368774173179/hadoop01
> %2C60020%2C1368774173179.1369127
>
> 012498, syncFs=true, hflush=false
>
> 2013-05-21 17:03:33,014 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog:
> Failed close of HLog writer
>
> java.io.IOException: All datanodes 192.168.1.82:50010 are bad. Aborting...
>
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli
> ent.java:2913)
>
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:
> 2406)
>
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.
> java:2610)
>
> 2013-05-21 17:03:33,015 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> Riding over HLog close failure! error count=1
>
> 2013-05-21 17:03:33,015 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
> Roll
> /hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.1
> 369075567280, entries=0, filesize=124.  for /
>
> hbase/.logs/hadoop01,60020,1368774173179/hadoop01%2C60020%2C1368774173179.13
> 69127012498
>
> 2013-05-21 17:03:37,586 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
> regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
> received expired from ZooKeeper, abortin
>
> g
>
> 2013-05-21 17:03:38,386 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
>
>
>
>          datanode:
>
> 2013-05-21 17:03:36,431 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_9188414668950016309_4925046 java.io.IOException: Connection
> reset by peer
>
> 2013-05-21 17:03:36,411 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.1.82:50010, dest: /192.168.1.82:1430, bytes: 6662742, op:
> HDFS_READ, cliID: DFSClient_attempt_2013
>
> 05161127_1389_m_000234_0, offset: 0, srvID:
> DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
> blk_4644468068464468915_1594795, duration: 1103797113
>
> 2013-05-21 17:03:36,671 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_9188414668950016309_4925046 0 : Thread is interrupted.
>
> 2013-05-21 17:03:36,671 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
> blk_9188414668950016309_4925046 terminating
>
> 2013-05-21 17:03:36,671 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_9188414668950016309_4925046 received exception java.io.IOException:
> Connection reset by peer
>
> 2013-05-21 17:03:36,671 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.1.82:50010,
> storageID=DS-1178868297-192.168.1.82-50010-1350976945636, infoPort=50075,
> ipcPort=
>
> 50020):DataXceiver
>
> java.io.IOException: Connection reset by peer
>
>         at sun.nio.ch.FileDispatcher.read0(Native Method)
>
>         at sun.nio.ch.SocketDispatcher.read(Unknown Source)
>
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
>
>         at sun.nio.ch.IOUtil.read(Unknown Source)
>
>         at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
>
>         at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.j
> ava:55)
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>
>         at java.io.BufferedInputStream.fill(Unknown Source)
>
>         at java.io.BufferedInputStream.read1(Unknown Source)
>
>         at java.io.BufferedInputStream.read(Unknown Source)
>
>         at java.io.DataInputStream.read(Unknown Source)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver
> .java:265)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockRec
> eiver.java:312)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockRece
> iver.java:376)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockRecei
> ver.java:532)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.ja
> va:398)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>
>         at java.lang.Thread.run(Unknown Source)
>
> 2013-05-21 17:03:36,693 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.1.82:50010, dest: /192.168.1.82:1428, bytes: 14134272, op:
> HDFS_READ, cliID: DFSClient_attempt_201
>
> 305161127_1389_m_000230_1, offset: 0, srvID:
> DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
> blk_8190430407487842768_2648292, duration: 2088034034
>
> 2013-05-21 17:03:36,784 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.1.82:50010, dest: /192.168.1.82:1432, bytes: 8982528, op:
> HDFS_READ, cliID: DFSClient_attempt_2013
>
> 05161127_1389_m_000228_1, offset: 0, srvID:
> DS-1178868297-192.168.1.82-50010-1350976945636, blockid:
> blk_4425180518832382171_1731508, duration: 1284097425
>

Mime
View raw message