hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent
Date Mon, 13 Feb 2012 23:33:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207340#comment-13207340
] 

stack commented on HBASE-5200:
------------------------------

Attached unit test stands up an AssignmentManager and then manufactures the condition that
Ram describes.  The test gets stuck and timesout after five seconds because the znode is not
cleared on master failover (as per Ram description).

Ram, your patch no longer applies to TRUNK seemingly.

Why you make a hash w/ preset size of 1?

{code}
+  private Set<String> regionsProcessed = new HashSet<String>(1);
{code}

Is this the right name for this hash?  Should it be regionsProcessedJoiningCluster or some
such?

The regionsProcessed hash is of a String.  I see in handleRegionWhileFailOverInProgress that
we always get the regioninfo from meta.  Isn't possible that in processRegionInTransition
we may have done this already?  That it may be non-null?  If so, shouldn't we keep it around
so we don't have to go to the .META. every time but only for those cases where regioninfo
is indeed null?  Would that mean changing regionsProcessed to be a Map of String to HRI?

Isn't getHRegionInfo repeating code from earlier up in processRegionInTransition?

If so, change it so that there is only one place where we go to meta... have both places call
your new getRegionInfo method.

Why do this:

{code}
+      hri = p.getFirst();
+      return hri;
{code}

Why not just do return p.getFirst();?

Is everything shifted right because of this test?

{code}
+      if (regionState == null
+          && !regionsProcessed.contains(encodedRegionName)) {

{code}

If so, shouldn't we just take the opposite of the above and return immediately if regionState
is non-null and in regionsProcesed as in:

{code}
if (regionsState != null && regionsProcessed.contains(encodedRegionName)) return;
{code}

This would make your change less substantial.

It seems wrong that we are putting stuff into RIT in two places; in processRegionsInTransition
and in handlRegion if we happen to be fielding a call back before failover has had a chance
to run.

Would the fb trick of NOT processing callbacks during master failover help here?  At least
for the scope of the AM.joinCluster?

Is this a good name for this  method?  handleRegionWhileFailOverInProgress  Should it be checkFailover
or some such?

The test I attached only checks the CLOSING state.  We should extend it to do the other states
OPENING, etc.?

I can help with this.

Also, how did you figure out this bug.  It must have taken a bunch of head banging to figure
that this was indeed what was going on.  Good stuff Ram.




                
> AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment
inconsistent
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5200
>                 URL: https://issues.apache.org/jira/browse/HBASE-5200
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.5
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.94.0, 0.90.7, 0.92.1
>
>         Attachments: 5200-test.txt, 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch,
TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml, hbase-5200_90_latest.patch
>
>
> This is the scenario
> Consider a case where the balancer is going on thus trying to close regions in a RS.
> Before we could close a master switch happens.  
> On Master switch the set of nodes that are in RIT is collected and we first get Data
and start watching the node
> After that the node data is added into RIT.
> Now by this time (before adding to RIT) if the RS to which close was called does a transition
in AM.handleRegion() we miss the handling saying RIT state was null.
> {code}
> 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region a66d281d231dfcaea97c270698b26b6f from server HOST-192-168-47-205,20020,1326363111288
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region c12e53bfd48ddc5eec507d66821c4d23 from server HOST-192-168-47-205,20020,1326363111288
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region 59ae13de8c1eb325a0dd51f4902d2052 from server HOST-192-168-47-205,20020,1326363111288
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region f45bc9614d7575f35244849af85aa078 from server HOST-192-168-47-205,20020,1326363111288
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region cc3ecd7054fe6cd4a1159ed92fd62641 from server HOST-192-168-47-204,20020,1326342744518
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region 3af40478a17fee96b4a192b22c90d5a2 from server HOST-192-168-47-205,20020,1326363111288
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region e6096a8466e730463e10d3d61f809b92 from server HOST-192-168-47-204,20020,1326342744518
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region 4806781a1a23066f7baed22b4d237e24 from server HOST-192-168-47-204,20020,1326342744518
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received
CLOSED for region d69e104131accaefe21dcc01fddc7629 from server HOST-192-168-47-205,20020,1326363111288
but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> {code}
> In branch the CLOSING node is created by RS thus leading to more inconsistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message