hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5119) Set the TimeoutMonitor's timeout back down
Date Wed, 04 Jan 2012 00:46:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179176#comment-13179176
] 

Jean-Daniel Cryans commented on HBASE-5119:
-------------------------------------------

I think I will retract from my statement that it "used to be extremely racy and caused more
troubles than it fixed", on my first test I got a stuck region in transition instead of being
able to recover. The timeout was set to 2 minutes to be sure I hit it.

First the region gets closed
{quote}
2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE
to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
{quote}

2 minutes later it times out:

{quote}
2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in
transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE,
ts=1325636185810, server=null
2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has
been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment
of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
{quote}

100ms later the master finally gets the event:

{quote}
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED,
server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more
than 15 seconds late
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler:
Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being
disabled so deleting ZK node and removing from regions in transition, skipping assignment
of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587
Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected
state RS_ZK_REGION_CLOSED
2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587
Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected
state RS_ZK_REGION_CLOSED
{quote}

At this point everything is fine, the region was processed as closed. But wait, remember that
line where it said it was going to force an unassign?

{quote}
2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587
Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null
returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791
{quote}

Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore.
It even tries to shut it down but is blocked by NPEs. Now this is what's going on.

The late ZK notification that the znode was deleted (but it got recreated after):

{quote}
2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode
of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.
{quote}

Then it prints this, and much later tries to unassign it again:

{quote}
2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting
on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
state=PENDING_CLOSE, ts=1325636310328, server=null
...
2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting
on  region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
state=PENDING_CLOSE, ts=1325636310328, server=null
2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in
transition timed out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE,
ts=1325636310328, server=null
2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has
been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment
of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted
to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is
not currently assigned anywhere
{quote}

And this is still ongoing.
                
> Set the TimeoutMonitor's timeout back down
> ------------------------------------------
>
>                 Key: HBASE-5119
>                 URL: https://issues.apache.org/jira/browse/HBASE-5119
>             Project: HBase
>          Issue Type: Task
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.92.0, 0.94.0
>
>
> The TimeoutMonitor used to be extremely racy and caused more troubles than it fixed,
but most of this has been fixed I believe in the context of 0.92 so I think we should set
it down back to a useful level. Currently it's 30 minutes, what should the new value be?
> I think 5 minutes should be good, will do some testing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message