hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Marc Spaggiari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8912) [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE
Date Sun, 29 Dec 2013 18:37:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858392#comment-13858392
] 

Jean-Marc Spaggiari commented on HBASE-8912:
--------------------------------------------

I tried the patch, and I think that it just moved the issue further :(

First, I restored default balancer to get normal behaviour.
{code}
2013-12-29 13:20:24,408 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server node1.domain.com,60020,1388341141398: Exception refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00,
context=post_region_open
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for
/hbase/unassigned/87dc596f763bd1b43a63c4afd93e4f00
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848)
    at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811)
    at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747)
    at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:674)
    at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
    at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
2013-12-29 13:20:24,413 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
abort: loaded coprocessors are: []
2013-12-29 13:20:24,420 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open
2013-12-29 13:20:24,421 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f
Attempt to transition the unassigned node for 404a7ac95dc8ce89826206453c501e2a from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the node existed and was in the expected state but then when
setting data we got a version mismatch
2013-12-29 13:20:24,423 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
2013-12-29 13:20:24,434 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f
Attempt to transition the unassigned node for 87dc596f763bd1b43a63c4afd93e4f00 from RS_ZK_REGION_OPENING
to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE
set by the server node1.domain.com,60020,1388341141398
2013-12-29 13:20:24,435 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Unable to mark region {NAME => 'page,moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull,1379303806726.87dc596f763bd1b43a63c4afd93e4f00.',
STARTKEY => 'moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull', ENDKEY
=> 'moc.nuhc9.iahgnahs\x1Fhttp\x1F-1\x1F/travels/23865/\x1Fnull', ENCODED => 87dc596f763bd1b43a63c4afd93e4f00,}
as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus
another RS already has the region.
2013-12-29 13:20:24,435 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable
while processing event M_RS_OPEN_REGION
java.io.IOException: Aborting flush because server is abortted...
    at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1556)
    at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1539)
    at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1034)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:982)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:947)
    at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.cleanupFailedOpen(OpenRegionHandler.java:365)
    at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:115)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
{code}

Ir crashed on region server.

I stopped the cluster, restarted it, and then I got one region pending transition for more
than 5 minutes.

{code}
2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb
Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING
to RS_ZK_REGION_OPENING failed, the node existed but was version 7 not the expected version
6
2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed refreshing OPENING; region=75c96fb5c15793e04fb71d553a51619b, context=post_region_open
2013-12-29 13:22:37,749 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb
Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING
to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE
set by the server node1.domain.com,60020,1388341328265
2013-12-29 13:22:37,751 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Unable to mark region {NAME => 'page,ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011,1384385444837.75c96fb5c15793e04fb71d553a51619b.',
STARTKEY => 'ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011',
ENDKEY => 'ac.efilthgin\x1Fhttp\x1F-1\x1F/directory/all/all/all-virtuelle+four-bois+sport+piano+ecrans-geants+europeen+sandwichs+bar-etudiant+desserts+bluegrass+open-bar+jam\x1Fnull',
ENCODED => 75c96fb5c15793e04fb71d553a51619b,} as FAILED_OPEN. It's likely that the master
already timed out this open attempt, and thus another RS already has the region.
{code}

Then I stopped the master again, and this time it went well.

So just to test, with default balancer, I tried to balancer again and again, like every 3
minutes to give it a breath between 2 balancing, and I got again a region stuck in transition.

> [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8912
>                 URL: https://issues.apache.org/jira/browse/HBASE-8912
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Priority: Critical
>             Fix For: 0.94.16
>
>         Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test - testRetrying
[Jenkins].html, log.txt, org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
>
>
> AM throws this exception which subsequently causes the master to abort: 
> {code}
> java.lang.IllegalStateException: Unexpected state : testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b.
state=PENDING_OPEN, ts=1372891751912, server=hemera.apache.org,39064,1372891746132 .. Cannot
transit it to OFFLINE.
> 	at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
> 	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
> 	at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> 	at java.lang.Thread.run(Thread.java:662)
> {code}
> This exception trace is from the failing test TestMetaReaderEditor which is failing pretty
frequently, but looking at the test code, I think this is not a test-only issue, but affects
the main code path. 
> https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message