hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re[2]: region servers stuck
Date Fri, 24 Jul 2015 11:49:24 GMT

Is it possible for you to upgrade to 0.98.10+ ?

I will take a look at your logs later. 

Thanks

Friday, July 24, 2015, 7:15 PM +0800 from Konstantin Chudinov  <kchudinov@griddynamics.com>:
>Hello Ted,
>Thank you for your answer!
>Hadoop and HBase versions are: 
>2.3.0-cdh5.1.0 - версия хадупа (и hdfs)
>hbase-0.98.1
>About hdfs.. i don’t see anything special in the logs. I’ve attached them to this
message. Btw, it’s another server, which is also crashed (I’ve lost hdfs logs of previous
server), so hbase logs are in archive as well.
>
>Best regards,
>
>Konstantin Chudinov
>
>On 23 Jul 2015, at 20:44, Ted Yu < yuzhihong@gmail.com > wrote:
>>
>>What release of HBase do you use ?
>>
>>I looked at the two log files but didn't find such information. 
>>In the log for node 118, I saw something such as the following:
>>Failed to connect to /10.0.229.16:50010 for block, add to deadNodes and continue 
>>
>>Was hdfs healthy around the time region server got stuck ?
>>
>>Cheers
>>
>>
>>Friday, July 24, 2015, 12:21 AM +0800 from Konstantin Chudinov  < kchudinov@griddynamics.com
>:
>>>Hi all,
>>>Our team faced cascading server's stuck. RS logs are similar to that in HBASE-10499
(  https://issues.apache.org/jira/browse/HBASE-10499 ) except there is no RegionTooBusyException
before flush loop:
>>>2015-07-19 07:32:41,961 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed
major compaction of 2 file(s) in s of table4,\xC7 ,1390920313296.9f554d5828cfa9689de27c1a42d844e3.
into 65dae45c82264b4d80fc7ed0818a4094(size=1.2 M), total size for store is 1.2 M. This selection
was in queue for 0sec, and took 0sec to execute.
>>>2015-07-19 07:32:41,961 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Completed compaction: Request = regionName=table4,\xC7 ,1390920313296.9f554d5828cfa9689de27c1a42d844e3.,
storeName=s, fileCount=2, fileSize=1.2 M, priority=998, time=24425664829680753; duration=0sec
>>>2015-07-19 07:32:41,962 INFO org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy:
Default compaction algorithm has selected 1 files from 1 candidates
>>>2015-07-19 07:32:44,764 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 18943
>>>2015-07-19 07:32:54,765 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 4851
>>>2015-07-19 07:33:04,764 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 7466
>>>2015-07-19 07:33:14,764 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 4940
>>>2015-07-19 07:33:24,765 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 12909
>>>2015-07-19 07:33:34,764 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 5897
>>>2015-07-19 07:33:44,764 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 9110
>>>2015-07-19 07:33:54,764 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver60020.periodicFlusher requesting flush for region webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
after a delay of 7109
>>>....
>>>until we've rebooted RS at 10:08.
>>>8 servers got in stuck at the same time.
>>>I haven't found anything in hmaster's logs. Thread dumps shows, that many theads
(and flush thread) are waiting for read lock during access to HDFS:
>>>"RpcServer.handler=19,port=60020" - Thread t@90
>>>  java.lang.Thread.State: WAITING
>>>at java.lang.Object.wait(Native Method)
>>>- waiting on <77770184> (a org.apache.hadoop.hbase.util.IdLock$Entry)
>>>at java.lang.Object.wait(Object.java:503)
>>>at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
>>>at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:319)
>>>at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
>>>at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
>>>at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542)
>>>at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
>>>at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
>>>at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
>>>at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
>>>at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>>>at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.requestSeek(NonLazyKeyValueScanner.java:39)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:311)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3987)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805)
>>>at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136)
>>>- locked <1623a240> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
>>>at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
>>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
>>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>>>at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>at java.lang.Thread.run(Thread.java:745)
>>>"RpcServer.handler=29,port=60020" - Thread t@100
>>>  java.lang.Thread.State: BLOCKED
>>>at org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354)
>>>- waiting to lock <399a6ff3> (a org.apache.hadoop.hdfs.DFSInputStream) owned
by "RpcServer.handler=21,port=60020" t@92
>>>at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1270)
>>>at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
>>>at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224)
>>>at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432)
>>>at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
>>>at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
>>>at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
>>>at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
>>>at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542)
>>>at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
>>>at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
>>>at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
>>>at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
>>>at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
>>>at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
>>>at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3866)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateFromJoinedHeap(HRegion.java:3840)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3995)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814)
>>>at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805)
>>>at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136)
>>>- locked <3af54140> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
>>>at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
>>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
>>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>>>at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>at java.lang.Thread.run(Thread.java:745)
>>>  Locked ownable synchronizers:
>>>- locked <5320bfc4> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
>>>I have zipped all logs and dumps and attached to this mail.
>>>This problem occurs once a month on out cluster.
>>>Does anybody know what the reason of this cascading servers failure? 
>>>Thank you in advance!
>>>
>>>Konstantin Chudinov
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message