Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 13 Feb 2012 23:33:00 +0000 (UTC)
From: "stack (Commented) (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: 
 <2084777657.34307.1329175980251.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <680121196.40756.1326535719505.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and
 AM.handleRegion() race thus leaving the region assignment inconsistent
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207340#comment-13207340 ] 

stack commented on HBASE-5200:
------------------------------

Attached unit test stands up an AssignmentManager and then manufactures the condition that Ram describes.  The test gets stuck and timesout after five seconds because the znode is not cleared on master failover (as per Ram description).

Ram, your patch no longer applies to TRUNK seemingly.

Why you make a hash w/ preset size of 1?

{code}
+  private Set<String> regionsProcessed = new HashSet<String>(1);
{code}

Is this the right name for this hash?  Should it be regionsProcessedJoiningCluster or some such?

The regionsProcessed hash is of a String.  I see in handleRegionWhileFailOverInProgress that we always get the regioninfo from meta.  Isn't possible that in processRegionInTransition we may have done this already?  That it may be non-null?  If so, shouldn't we keep it around so we don't have to go to the .META. every time but only for those cases where regioninfo is indeed null?  Would that mean changing regionsProcessed to be a Map of String to HRI?

Isn't getHRegionInfo repeating code from earlier up in processRegionInTransition?

If so, change it so that there is only one place where we go to meta... have both places call your new getRegionInfo method.

Why do this:

{code}
+      hri = p.getFirst();
+      return hri;
{code}

Why not just do return p.getFirst();?

Is everything shifted right because of this test?

{code}
+      if (regionState == null
+          && !regionsProcessed.contains(encodedRegionName)) {

{code}

If so, shouldn't we just take the opposite of the above and return immediately if regionState is non-null and in regionsProcesed as in:

{code}
if (regionsState != null && regionsProcessed.contains(encodedRegionName)) return;
{code}

This would make your change less substantial.

It seems wrong that we are putting stuff into RIT in two places; in processRegionsInTransition and in handlRegion if we happen to be fielding a call back before failover has had a chance to run.

Would the fb trick of NOT processing callbacks during master failover help here?  At least for the scope of the AM.joinCluster?

Is this a good name for this  method?  handleRegionWhileFailOverInProgress  Should it be checkFailover or some such?

The test I attached only checks the CLOSING state.  We should extend it to do the other states OPENING, etc.?

I can help with this.

Also, how did you figure out this bug.  It must have taken a bunch of head banging to figure that this was indeed what was going on.  Good stuff Ram.


> AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5200
>                 URL: https://issues.apache.org/jira/browse/HBASE-5200
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.5
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.94.0, 0.90.7, 0.92.1
>
>         Attachments: 5200-test.txt, 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml, hbase-5200_90_latest.patch
>
>
> This is the scenario
> Consider a case where the balancer is going on thus trying to close regions in a RS.
> Before we could close a master switch happens.  
> On Master switch the set of nodes that are in RIT is collected and we first get Data and start watching the node
> After that the node data is added into RIT.
> Now by this time (before adding to RIT) if the RS to which close was called does a transition in AM.handleRegion() we miss the handling saying RIT state was null.
> {code}
> 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region a66d281d231dfcaea97c270698b26b6f from server HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region c12e53bfd48ddc5eec507d66821c4d23 from server HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 59ae13de8c1eb325a0dd51f4902d2052 from server HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region f45bc9614d7575f35244849af85aa078 from server HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region cc3ecd7054fe6cd4a1159ed92fd62641 from server HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 3af40478a17fee96b4a192b22c90d5a2 from server HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region e6096a8466e730463e10d3d61f809b92 from server HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4806781a1a23066f7baed22b4d237e24 from server HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region d69e104131accaefe21dcc01fddc7629 from server HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
> {code}
> In branch the CLOSING node is created by RS thus leading to more inconsistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira