hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: region server crashed several times in a week
Date Fri, 22 Apr 2016 16:50:25 GMT
I have two questions:
>>1. Why JVM pause while NO GCs?
>>2016-04-22 14:32:54,330 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
>>37360ms
>>No GCs detected

Your system probably swaps

>> 2. We have mapreduce jobs running on the cluster.  Is that related with
the crash of region server?How?

Check swap-in/swap-out events when you run M/R job.

-Vlad


On Fri, Apr 22, 2016 at 1:55 AM, jiantao <yujiantao@star-v.com.cn> wrote:

> Hi Guys,
> I need your kind help. Region servers in my cluster crashed several times
> in a week. I have been tracking this problem for a long time, but do not
> root the cause.
> From the region server log, we can see DFSOutputStream ResponseProcessor
> exception  for block
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 happened in
> dfsClient.
> From the datanode log, datanode did not received the block
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 in 30
> minitues and recovery of the block failed as thread was interruped while
> waiting for IO on channel java.nio.channels.SocketChannel.
>
> I have two questions:
> 1. Why JVM pause while NO GCs?
> 2016-04-22 14:32:54,330 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 37360ms
> No GCs detected
>
> 2. We have mapreduce jobs running on the cluster.  Is that related with
> the crash of region server?How?
>
> Thanks!
>
> hadoop version:2.6.0         hbase version:1.0.1.1
> ###########region server log###########
> 2016-04-22 14:28:46,409 WARN [regionserver/slave3/192.168.68.154:16020]
> util.Sleeper: We slept 15033ms instead of 3000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-04-22 14:28:46,409 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 12600ms
> No GCs detected
> 2016-04-22 14:30:47,076 INFO  [LruBlockCacheStatsExecutor]
> hfile.LruBlockCache: totalSize=11.75 MB, freeSize=3.95 GB, max=3.96 GB,
> blockCount=144, accesses=152144, hits=152144, hitRatio=100.00%, ,
> cachingAccesses=152086, cachingHits=152086, cachingHitsRatio=100.00%,
> evictions=1644, evicted=62, evictedPerRun=0.03771289438009262
> 2016-04-22 14:30:56,218 INFO  [BucketCacheStatsExecutor]
> bucket.BucketCache: failedBlockAdditions=0, totalSize=20.00 GB,
> freeSize=19.95 GB, usedSize=47.56 MB, cacheSize=33.54 MB, accesses=426765,
> hits=12205, IOhitsPerSecond=0, IOTimePerHit=0.00, hitRatio=2.86%,
> cachingAccesses=13317, cachingHits=12063, cachingHitsRatio=90.58%,
> evictions=0, evicted=142, evictedPerRun=Infinity
> 2016-04-22 14:32:54,329 WARN [regionserver/slave3/192.168.68.154:16020.periodicFlusher]
> util.Sleeper: We slept 37919ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-04-22 14:32:54,330 WARN [regionserver/slave3/192.168.68.154:16020.compactionChecker]
> util.Sleeper: We slept 37922ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-04-22 14:32:54,330 WARN  [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 37360ms
> No GCs detected
> 2016-04-22 14:32:54,330 INFO [regionserver/slave3/192.168.68.154:16020-SendThread(slave5:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 47741ms for sessionid 0x554360adead0239, closing socket connection and
> attempting reconnect
> 2016-04-22 14:32:54,330 INFO  [main-SendThread(slave4:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 47733ms for sessionid 0x453d16e69e530c0, closing socket connection and
> attempting reconnect
> 2016-04-22 14:32:54,330 INFO [regionserver/slave3/192.168.68.154:16020-SendThread(slave2:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 47741ms for sessionid 0x2542aa736c111cd, closing socket connection and
> attempting reconnect
> 2016-04-22 14:32:54,330 WARN [regionserver/slave3/192.168.68.154:16020]
> util.Sleeper: We slept 40647ms instead of 3000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2016-04-22 14:32:54,337 WARN  [ResponseProcessor for block
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910
> java.io.EOFException: Premature EOF: no length prefix available
>         at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2000)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
>         at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:798)
> 2016-04-22 14:32:54,337 FATAL [regionserver/slave3/192.168.68.154:16020]
> regionserver.HRegionServer: ABORTING region server
> slave3,16020,1461290143957: org.apache.hadoop.hbase.YouAreDeadException:
> Server REPORT rejected; currently processing slave3,16020,1461290143957 as
> dead server
>         at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:384)
>         at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:289)
>         at
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:278)
>         at
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7912)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>         at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> ############data node log#############
> 2016-04-22 13:57:40,942 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 src: /
> 192.168.68.154:52251 dest: /192.168.68.154:50010
> ...30mins...
> 2016-04-22 14:32:48,221 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at master2/
> 192.168.68.151:8020 calls
> recoverBlock(BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910,
> targets=[192.168.68.154:50010, 192.168.68.157:50010],
> newGenerationStamp=804049)
> 2016-04-22 14:32:48,221 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
> initReplicaRecovery: blk_1074544298_803910, recoveryId=804049,
> replica=ReplicaBeingWritten, blk_1074544298_803910, RBW
>   getNumBytes()     = 83
>   getBytesOnDisk()  = 83
>   getVisibleLength()= 83
>   getVolume()       = /data6/current
>   getBlockFile()    =
> /data6/current/BP-454889425-192.168.68.150-1452620137989/current/rbw/blk_1074544298
>   bytesAcked=83
>   bytesOnDisk=83
> 2016-04-22 14:32:48,221 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910
> java.io.InterruptedIOException: Interruped while waiting for IO on
> channel java.nio.channels.SocketChannel[connected local=/
> 192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis timeout
> left.
>         at org.apache.hadoop.net
> .SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
>         at org.apache.hadoop.net
> .SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>         at org.apache.hadoop.net
> .SocketInputStream.read(SocketInputStream.java:161)
>         at org.apache.hadoop.net
> .SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-04-22 14:32:48,222 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910,
> type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
> 2016-04-22 14:32:48,222 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 2016-04-22 14:32:48,222 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock
> BP-454889425-192.168.68.150-1452620137989:blk_1074544298_803910 received
> exception java.io.InterruptedIOException: Interruped while waiting for IO
> on channel java.nio.channels.SocketChannel[connected local=/
> 192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis timeout
> left.
> 2016-04-22 14:32:48,222 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: slave3:50010:DataXceiver
> error processing WRITE_BLOCK operation src: /192.168.68.154:52251 dst: /
> 192.168.68.154:50010
> java.io.InterruptedIOException: Interruped while waiting for IO on
> channel java.nio.channels.SocketChannel[connected local=/
> 192.168.68.154:50010 remote=/192.168.68.154:52251]. 142739 millis timeout
> left.
>         at org.apache.hadoop.net
> .SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
>         at org.apache.hadoop.net
> .SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>         at org.apache.hadoop.net
> .SocketInputStream.read(SocketInputStream.java:161)
>         at org.apache.hadoop.net
> .SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>         at java.io.DataInputStream.read(DataInputStream.java:149)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message