hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vikas Vishwakarma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13418) Regions getting stuck in PENDING_CLOSE state infinitely in high load HA scenarios
Date Wed, 22 Apr 2015 05:41:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506464#comment-14506464
] 

Vikas Vishwakarma commented on HBASE-13418:
-------------------------------------------

[~esteban], also in the disruptive case the issue is that the regions continue to be stuck
forever even after all the DataNodes are back up and the HDFS layer has recovered completely.
I am checking with DFS timeout fix provided by [~apurtell] which is a clone of HDFS-7005,
for the reproducible case first, then possibly check if it fixes other similar scenarios also.


> Regions getting stuck in PENDING_CLOSE state infinitely in high load HA scenarios
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-13418
>                 URL: https://issues.apache.org/jira/browse/HBASE-13418
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.10
>            Reporter: Vikas Vishwakarma
>
> In some heavy data load cases when there are multiple RegionServers going up/down (HA)
or when we try to shutdown/restart the entire HBase cluster, we are observing that some regions
are getting stuck in PENDING_CLOSE state infinitely. 
> On going through the logs for a particular region stuck in PENDING_CLOSE state, it looks
like for this region two memstore flush got triggered within few milliseconds as given below
and after sometime there is Unrecoverable exception while closing region. I am suspecting
this could be some kind of race condition but need to check further
> Logs:
> ================
> ......
> 2015-04-06 11:47:33,309 INFO  [2,queue=0,port=60020] regionserver.HRegionServer - Close
884fd5819112370d9a9834895b0ec19c, via zk=yes, znode version=0, on blitzhbase01-dnds1-4-crd.eng.sfdc.net,60020,1428318111711
> 2015-04-06 11:47:33,309 DEBUG [-dnds3-4-crd:60020-0] handler.CloseRegionHandler - Processing
close of RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion - Closing RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.:
disabling compactions & flushes
> 2015-04-06 11:47:33,319 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegion - Running close
preflush of RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,319 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegion - Started memstore
flush for RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
current region memstore size 70.0 M
> 2015-04-06 11:47:33,327 DEBUG [-dnds3-4-crd:60020-0] regionserver.HRegion - Updates disabled
for region RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.
> 2015-04-06 11:47:33,328 INFO  [-dnds3-4-crd:60020-0] regionserver.HRegion - Started memstore
flush for RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
current region memstore size 70.0 M
> 2015-04-06 11:47:33,328 WARN  [-dnds3-4-crd:60020-0] wal.FSHLog - Couldn't find oldest
seqNum for the region we are about to flush: [884fd5819112370d9a9834895b0ec19c]
> 2015-04-06 11:47:33,328 WARN  [-dnds3-4-crd:60020-0] regionserver.MemStore - Snapshot
called again without clearing previous. Doing nothing. Another ongoing flush or did we fail
last attempt?
> 2015-04-06 11:47:33,334 FATAL [-dnds3-4-crd:60020-0] regionserver.HRegionServer - ABORTING
region server blitzhbase01-dnds3-4-crd.eng.sfdc.net,60020,1428318082860: Unrecoverable exception
while closing region RMHA_1,\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1428318937003.884fd5819112370d9a9834895b0ec19c.,
still finishing close



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message