hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5094) The META can hold an entry for a region with a different server name from the one actually in the AssignmentManager thus making the region inaccessible.
Date Wed, 04 Jan 2012 04:08:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179258#comment-13179258
] 

Ming Ma commented on HBASE-5094:
--------------------------------

It is a tricky bug. I tend to agree with Stack here. Perhaps we can enforce synchronization
for region assignment. Here is some additional background info.

1. How the bug was found. Rolling restart RSs with regular shutdown(not kill -9). After running
for couple hours, one user region is missing. I then identified the event sequence based on
the logs across machines.

2. I put some quick fix couple weeks to AssignmentManager and ServerShutDownHandler(not submitted
to open source). That reduces the chance of such error, but didn't completely address the
synchronization issue. 

3. How we can verify if the fix works. Besides code review and unit test, I think it is better
to run rolling restart RS script for a long period of time say couple days.
                
> The META can hold an entry for a region with a different server name from the one actually
in the AssignmentManager thus making the region inaccessible.
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5094
>                 URL: https://issues.apache.org/jira/browse/HBASE-5094
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>         Attachments: HBASE-5094_1.patch
>
>
> {code}
> RegionState rit = this.services.getAssignmentManager().isRegionInTransition(e.getKey());
>             ServerName addressFromAM = this.services.getAssignmentManager()
>                 .getRegionServerOfRegion(e.getKey());
>             if (rit != null && !rit.isClosing() && !rit.isPendingClose())
{
>               // Skip regions that were in transition unless CLOSING or
>               // PENDING_CLOSE
>               LOG.info("Skip assigning region " + rit.toString());
>             } else if (addressFromAM != null
>                 && !addressFromAM.equals(this.serverName)) {
>               LOG.debug("Skip assigning region "
>                     + e.getKey().getRegionNameAsString()
>                     + " because it has been opened in "
>                     + addressFromAM.getServerName());
>               }
> {code}
> In ServerShutDownHandler we try to get the address in the AM.  This address is initially
null because it is not yet updated after the region was opened .i.e. the CAll back after node
deletion is not yet done in the master side.
> But removal from RIT is completed on the master side.  So this will trigger a new assignment.
> So there is a small window between the online region is actually added in to the online
list and the ServerShutdownHandler where we check the existing address in AM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message