hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vikas Vishwakarma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13592) RegionServer sometimes gets stuck during shutdown in case of cache flush failures
Date Wed, 29 Apr 2015 07:43:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518913#comment-14518913
] 

Vikas Vishwakarma commented on HBASE-13592:
-------------------------------------------

Adding details related to the stack and log trace

{noformat}
=====================
STACK TRACE:
=====================
"regionserver60020" prio=10 tid=0x00007f70d153a000 nid=0xa105 in Object.wait() [0x00007f68cba79000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:115)
<---------
        - locked <0x00007f69137f0270> (a org.apache.hadoop.hbase.util.DrainBarrier)
        at org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:85)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.close(FSHLog.java:926)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.closeWAL(HRegionServer.java:1250)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1041)
        at java.lang.Thread.run(Thread.java:745)

===================
LOG Trace
====================
2015-04-28 09:00:48,939 DEBUG [7,queue=2,port=60020] regionserver.HRegion - Flush requested
on AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:00:48,939 INFO  [MemStoreFlusher.0] regionserver.HRegion - Started memstore
flush for AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.,
current region memstore size 999.2 M
2015-04-28 09:00:48,939 DEBUG [MemStoreFlusher.0] wal.FSHLog - vikas: beginOp startCacheFlush
for 0f6b76ceb0efaf2d52c211021f4ebde3
2015-04-28 09:00:48,939 WARN  [MemStoreFlusher.0] regionserver.MemStore - Snapshot called
again without clearing previous. Doing nothing. Another ongoing flush or did we fail last
attempt?
2015-04-28 09:00:48,964 ERROR [MemStoreFlusher.0] regionserver.MemStoreFlusher - Cache flush
failed for region AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/10.230.229.13:45046 remote=/10.230.229.28:50010]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1985)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1063)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1031)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1174)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
....
multiple attempts and failures
.....
Finally the region is closed by CloseHandler 
.....
2015-04-28 09:02:39,035 DEBUG [-dnds1-2-crd:60020-0] handler.CloseRegionHandler - Processing
close of AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:02:39,035 DEBUG [-dnds1-2-crd:60020-0] regionserver.HRegion - Closing AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.:
disabling compactions & flushes
2015-04-28 09:02:39,035 DEBUG [-dnds1-2-crd:60020-0] regionserver.HRegion - Updates disabled
for region AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:02:39,039 INFO  [-dnds1-2-crd:60020-0] regionserver.HRegion - Closed AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
2015-04-28 09:02:39,039 DEBUG [-dnds1-2-crd:60020-0] handler.CloseRegionHandler - Closed AA_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1430210541446.0f6b76ceb0efaf2d52c211021f4ebde3.
.....
The Shutdown sequence however hangs after the following lines in the log. So here in close
asyncNotifier is closed followed by asyncSyncers followed by asyncWriter after which closeBarrier.stopAndDrainOps()
is called where it gets stuck
.....
2015-04-28 10:56:12,489 INFO  [regionserver60020] regionserver.HRegionServer - stopping server
blitzhbase01-dnds1-2-crd.eng.sfdc.net,60020,1430210003779; all regions closed.
2015-04-28 10:56:12,489 DEBUG [20-WAL.AsyncNotifier] wal.FSHLog - regionserver60020-WAL.AsyncNotifier
interrupted while waiting for  notification from AsyncSyncer thread
2015-04-28 10:56:12,489 INFO  [20-WAL.AsyncNotifier] wal.FSHLog - regionserver60020-WAL.AsyncNotifier
exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer0] wal.FSHLog - regionserver60020-WAL.AsyncSyncer0
interrupted while waiting for notification from AsyncWriter thread
2015-04-28 10:56:12,490 INFO  [020-WAL.AsyncSyncer0] wal.FSHLog - regionserver60020-WAL.AsyncSyncer0
exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer1] wal.FSHLog - regionserver60020-WAL.AsyncSyncer1
interrupted while waiting for notification from AsyncWriter thread
2015-04-28 10:56:12,490 INFO  [020-WAL.AsyncSyncer1] wal.FSHLog - regionserver60020-WAL.AsyncSyncer1
exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer2] wal.FSHLog - regionserver60020-WAL.AsyncSyncer2
interrupted while waiting for notification from AsyncWriter thread
2015-04-28 10:56:12,490 INFO  [020-WAL.AsyncSyncer2] wal.FSHLog - regionserver60020-WAL.AsyncSyncer2
exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer3] wal.FSHLog - regionserver60020-WAL.AsyncSyncer3
interrupted while waiting for notification from AsyncWriter thread
2015-04-28 10:56:12,490 INFO  [020-WAL.AsyncSyncer3] wal.FSHLog - regionserver60020-WAL.AsyncSyncer3
exiting
2015-04-28 10:56:12,490 DEBUG [020-WAL.AsyncSyncer4] wal.FSHLog - regionserver60020-WAL.AsyncSyncer4
interrupted while waiting for notification from AsyncWriter thread
2015-04-28 10:56:12,490 INFO  [020-WAL.AsyncSyncer4] wal.FSHLog - regionserver60020-WAL.AsyncSyncer4
exiting
2015-04-28 10:56:12,491 DEBUG [0020-WAL.AsyncWriter] wal.FSHLog - regionserver60020-WAL.AsyncWriter
interrupted while waiting for newer writes added to local buffer
2015-04-28 10:56:12,491 INFO  [0020-WAL.AsyncWriter] wal.FSHLog - regionserver60020-WAL.AsyncWriter
exiting
2015-04-28 10:56:12,491 INFO  [regionserver60020] wal.FSHLog - closeBarrier.stopAndDrainOps
 <----- (this log added for testing)
2015-04-28 10:56:13,925 DEBUG [ver60020-EventThread] regionserver.SplitLogWorker - tasks arrived
or departed
{noformat}

> RegionServer sometimes gets stuck during shutdown in case of cache flush failures
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-13592
>                 URL: https://issues.apache.org/jira/browse/HBASE-13592
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.10
>            Reporter: Vikas Vishwakarma
>            Assignee: Vikas Vishwakarma
>
> Observed that RegionServer sometimes gets stuck during shutdown in case of cache flush
failures. On adding few debug logs and looking through the stack trace RegionServer process
looks stuck in closeWAL -> hlog.close -> closeBarrier.stopAndDrainOps(); during the
shutdown sequence in the run method
> From the RegionServer logs we see there are multiple attempts to flush cache for a particular
region which increments the beginOp count in DrainBarrier but all the flush attempts fails
somewhere in wal sync and the DrainBarrier endOp count decrement never happens. Later on when
shutdown is initiated RegionServer process is permanently stuck here
> In this case hbase stop also does not work and RegionServer process has to be explicitly
killed using kill -9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message