hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] Reopened: (HBASE-921) region close and open processed out of order; makes for disagreement between master and regionserver on region state
Date Tue, 25 Nov 2008 20:57:44 GMT

     [ https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Purtell reopened HBASE-921:
----------------------------------

      Assignee:     (was: Jim Kellerman)

I believe I saw an instance of this issue again today. 

What about Jim Firby's original suggestion: "Clients should be able to say "I've asked for
a regions location 10 times now and Mr. Master, you've given me the same response ten times
in a row and each time, the answer was wrong. Revisit any notion that said region is at said
location". Mr. Master would then go off and do something drastic like close and reassign the
region."

Or the Master can sanity check on its own. If my latest patch to HBASE-1018 goes in, the master
can look at the HServerLoad and note that not all expected regions are found there. 

> region close and open processed out of order; makes for disagreement between master and
regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it wants it
opened elsewhere as part of region rebalancing.  Both the open and close operations are reported
back to the master.  Both have operation processing components that are added to the todo
list to be processed in another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main processing
loop.  Its todo component does some work if the region is offlined but otherwise nothing of
consequence whereas the open in its todo does the important meta catalog table update with
the new location information.
> Its been fairly common here on our cluster where the master todo queue is occupied processing
the shutdown of a regionserver.  It takes a long time to process the shutdown of a regionserver
when thousands of regions   This latter delays the processing of the open and close todos.
 In effect the open is running after the close.  The region goes into limbo.  Only a restart
of the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot here on
our cluster so will hack up a fix for this and get it into TRUNK and backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be able to say
"I've asked for a regions location 10 times now and Mr. Master, you've given me the same response
ten times in a row and each time, the answer was wrong.  Revisit any notion that said region
is at said location".  Mr. Master would then go off and do something drastic like close and
reassign the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message