hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Victor Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
Date Fri, 05 Sep 2014 08:46:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122682#comment-14122682
] 

Victor Xu commented on HBASE-11902:
-----------------------------------

Yes, stack. The rs main thread is waiting at org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps,
but the main cause of the aborting is DataNodes. You can find the details in the log:
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error while
AsyncSyncer sync, request close of hlog 
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache
flush failed for region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,801 ERROR org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter:
Got IOException while writing trailer
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,802 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close
of HLog writer
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,802 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: Riding over
HLog close failure! error count=1
2014-09-03 13:38:03,804 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Rolled WAL /hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722420708
with entries=32565, filesize=118.6 M; new WAL /hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722683780
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is
ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707475254
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is
ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707722202
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is
ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707946159
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is
ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409708155788
2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested
on page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore
flush for page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c., current
region memstore size 218.5 M
2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested
on page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore
flush for page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c., current
region memstore size 218.5 M
2014-09-03 13:38:03,897 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested
on page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:04,699 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
One or more threads are no longer alive -- stop
2014-09-03 13:38:04,699 INFO org.apache.hadoop.ipc.RpcServer: Stopping server on 60020

> RegionServer was blocked while aborting
> ---------------------------------------
>
>                 Key: HBASE-11902
>                 URL: https://issues.apache.org/jira/browse/HBASE-11902
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 0.98.4
>         Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
>            Reporter: Victor Xu
>         Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log
>
>
> Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes
got blocked while aborting. I saved the jstack and logs, and found out that it was caused
by datanodes failures. The "regionserver60020" thread was blocked while closing WAL. 
> This issue doesn't happen so frequently, but if it happens, it always leads to huge amount
of requests failure. The only way to do is KILL -9.
> I think it's a bug, but I haven't found a decent solution. Does anyone have the same
problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message