hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Peterson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2486) Add simple "anti-entropy" for region assignment
Date Mon, 10 May 2010 23:47:31 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865982#action_12865982
] 

Kevin Peterson commented on HBASE-2486:
---------------------------------------

Wanted to document killing and restarting the master as a workaround. This solves at least
some circumstances where the master and .META. disagree with the regionserver about what regions
it is hosting. I did the following and it worked for me:

1. Tail log on master to ensure that the master is not doing anything.
2. Kill -9 the master to exit without setting shutdown node in ZK.
3. Restart the master.

This ended the repeated NotServingRegionException I had been seeing without needing to bring
down the cluster. YMMV.

> Add simple "anti-entropy" for region assignment
> -----------------------------------------------
>
>                 Key: HBASE-2486
>                 URL: https://issues.apache.org/jira/browse/HBASE-2486
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.5
>            Reporter: Todd Lipcon
>            Assignee: Eugene Koontz
>             Fix For: 0.20.5
>
>
> We've seen a number of bugs where a region server thinks it should not be serving a region,
but the master and META think it should be. I'd like to propose a very simple way of fixing
this issue:
> 1) whenever a regionserver throws a NotServingRegionException, it also marks that region
id in an RS-wide Set
> 2) when a region sends a heartbeat, include a message for each of these regions, MSG_REPORT_NSRE
or somesuch, and then clear the set
> 3) when the master receives MSG_REPORT_NSRE, it does the following checks:
> a) if the region is assigned elsewhere according to META, the NSRE was due to a stale
client, ignore
> b) if the region is in transition, ignore
> c) otherwise, we have an inconsistency, and we should take some steps to resolve (eg
mark the region unassigned, or exit the master if we are in "paranoid mode")
> Whatever we do, we need to make sure that this is loudly logged, and causes unit tests
to fail, when it's detected. This should *not* happen, but when it does, it would be good
to recover without addtable.rb, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message