hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Yuan Jiang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-12464) meta table region assignment stuck in the FAILED_OPEN state due to region server not fully ready to serve
Date Wed, 19 Nov 2014 21:43:34 GMT

     [ https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stephen Yuan Jiang updated HBASE-12464:
---------------------------------------
    Description: 
meta table region assignment could reach to the 'FAILED_OPEN' state, which makes the region
not available unless the target region server shutdown or manual resolution.  This is undesirable
state for meta tavle region.



Here is the sequence how this could happen (the code is in AssignmentManager#assign()):

Step 1: Master detects a region server (RS1) that hosts one meta table region is down, it
changes the meta region state from 'online' to 'offline'

Step 2: In a loop (with configuable maximumAttempts count, default is 10, and minimal is 1),
AssignmentManager tries to find a RS to host the meta table region.  If there is no RS available,
it would loop forver by resetting the loop count (BUG#1 from this logic - a small bug) 

{code}
           if (region.isMetaRegion()) {
              try {
                Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
                if (i == maximumAttempts) i = 1; // ==> BUG: if maximumAttempts is 1, then
the loop will end.
                continue;
              } catch (InterruptedException e) {
              ...
           }
{code}

Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, AssignmentManager tries
to assign the meta region to RS2 (OFFLINE, RS1 => PENDING_OPEN, RS2).  If for some reason
that opening the region in RS2 failed (eg. the target RS2 is not ready to serve - ServerNotRunningYetException),
AssignmentManager would change the state from (PENDING_OPEN, RS2) to (FAILED_OPEN, RS2). 
then it would retry (and even change the RS server to go to).  The retry is up to maximumAttempts.
 Once the maximumAttempts is reached, the meta region will be in the 'FAILED_OPEN' state,
unless either (1).  RS2 shutdown to trigger region assignment again or (2). it is reassigned
by an operator via HBase Shell.  

Based on the document ( http://hbase.apache.org/book/regions.arch.html ), this is by design
- "17. For regions in FAILED_OPEN or FAILED_CLOSE states , the master tries to close them
again when they are reassigned by an operator via HBase Shell.".  

However, this is bad design, espcially for meta table region (it is arguable that the design
is good for regular table - for this ticket, I am more focus on fixing the meta region availablity
issue).  



I propose 2 possible fixes:

Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta table region,
reset the loop count so that it would not leave the loop with meta table region in FAILED_OPEN
state.

Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide a way to automatically
trigger AssignmentManager::assign() after a short period of time (leaving any region in FAILED_OPEN
state or other states like 'FAILED_CLOSE' is undesirable, should have some way to retrying
and auto-heal the region).

I think at least for 1.0.0, Fix#1 is good enough.  We can open a task-type of JIRA for Fix#2
in future release.

  was:
meta table region assignment could reach to the 'FAILED_OPEN' state, which makes the region
not available unless the target region server shutdown or manual resolution.  This is undesirable
state for meta tavle region.



Here is the sequence how this could happen (the code is in AssignmentManager::assign()):

Step 1: Master detects a region server (RS1) that hosts one meta table region is down, it
changes the meta region state from 'online' to 'offline'

Step 2: In a loop (with configuable maximumAttempts count, default is 10, and minimal is 1),
AssignmentManager tries to find a RS to host the meta table region.  If there is no RS available,
it would loop forver by resetting the loop count (BUG#1 from this logic - a small bug) 

{code}
           if (region.isMetaRegion()) {
              try {
                Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
                if (i == maximumAttempts) i = 1; // ==> BUG: if maximumAttempts is 1, then
the loop will end.
                continue;
              } catch (InterruptedException e) {
              ...
           }
{code}

Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, AssignmentManager tries
to assign the meta region to RS2 (OFFLINE, RS1 => PENDING_OPEN, RS2).  If for some reason
that opening the region in RS2 failed (eg. the target RS2 is not ready to serve - ServerNotRunningYetException),
AssignmentManager would change the state from (PENDING_OPEN, RS2) to (FAILED_OPEN, RS2). 
then it would retry (and even change the RS server to go to).  The retry is up to maximumAttempts.
 Once the maximumAttempts is reached, the meta region will be in the 'FAILED_OPEN' state,
unless either (1).  RS2 shutdown to trigger region assignment again or (2). it is reassigned
by an operator via HBase Shell.  

Based on the document ( http://hbase.apache.org/book/regions.arch.html ), this is by design
- "17. For regions in FAILED_OPEN or FAILED_CLOSE states , the master tries to close them
again when they are reassigned by an operator via HBase Shell.".  

However, this is bad design, espcially for meta table region (it is arguable that the design
is good for regular table - for this ticket, I am more focus on fixing the meta region availablity
issue).  



I propose 2 possible fixes:

Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta table region,
reset the loop count so that it would not leave the loop with meta table region in FAILED_OPEN
state.

Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide a way to automatically
trigger AssignmentManager::assign() after a short period of time (leaving any region in FAILED_OPEN
state or other states like 'FAILED_CLOSE' is undesirable, should have some way to retrying
and auto-heal the region).

I think at least for 1.0.0, Fix#1 is good enough.  We can open a task-type of JIRA for Fix#2
in future release.


> meta table region assignment stuck in the FAILED_OPEN state due to region server not
fully ready to serve
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-12464
>                 URL: https://issues.apache.org/jira/browse/HBASE-12464
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 1.0.0, 2.0.0, 0.99.1
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>             Fix For: 1.0.0, 2.0.0, 0.99.2
>
>         Attachments: HBASE-12464.v1-1.0.patch, HBASE-12464.v1-2.0.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> meta table region assignment could reach to the 'FAILED_OPEN' state, which makes the
region not available unless the target region server shutdown or manual resolution.  This
is undesirable state for meta tavle region.
> Here is the sequence how this could happen (the code is in AssignmentManager#assign()):
> Step 1: Master detects a region server (RS1) that hosts one meta table region is down,
it changes the meta region state from 'online' to 'offline'
> Step 2: In a loop (with configuable maximumAttempts count, default is 10, and minimal
is 1), AssignmentManager tries to find a RS to host the meta table region.  If there is no
RS available, it would loop forver by resetting the loop count (BUG#1 from this logic - a
small bug) 
> {code}
>            if (region.isMetaRegion()) {
>               try {
>                 Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
>                 if (i == maximumAttempts) i = 1; // ==> BUG: if maximumAttempts is
1, then the loop will end.
>                 continue;
>               } catch (InterruptedException e) {
>               ...
>            }
> {code}
> Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, AssignmentManager
tries to assign the meta region to RS2 (OFFLINE, RS1 => PENDING_OPEN, RS2).  If for some
reason that opening the region in RS2 failed (eg. the target RS2 is not ready to serve - ServerNotRunningYetException),
AssignmentManager would change the state from (PENDING_OPEN, RS2) to (FAILED_OPEN, RS2). 
then it would retry (and even change the RS server to go to).  The retry is up to maximumAttempts.
 Once the maximumAttempts is reached, the meta region will be in the 'FAILED_OPEN' state,
unless either (1).  RS2 shutdown to trigger region assignment again or (2). it is reassigned
by an operator via HBase Shell.  
> Based on the document ( http://hbase.apache.org/book/regions.arch.html ), this is by
design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states , the master tries to close
them again when they are reassigned by an operator via HBase Shell.".  
> However, this is bad design, espcially for meta table region (it is arguable that the
design is good for regular table - for this ticket, I am more focus on fixing the meta region
availablity issue).  
> I propose 2 possible fixes:
> Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta table region,
reset the loop count so that it would not leave the loop with meta table region in FAILED_OPEN
state.
> Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide a way to
automatically trigger AssignmentManager::assign() after a short period of time (leaving any
region in FAILED_OPEN state or other states like 'FAILED_CLOSE' is undesirable, should have
some way to retrying and auto-heal the region).
> I think at least for 1.0.0, Fix#1 is good enough.  We can open a task-type of JIRA for
Fix#2 in future release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message