hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7327) Assignment Timeouts: Remove the code from the master
Date Wed, 12 Dec 2012 20:49:21 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530313#comment-13530313

nkeywal commented on HBASE-7327:

I've got some doubts on TestMasterFailover.
The way the code is written on a master failover is to look for what is in zk, and, if the
regionserver is down, force a reassign, if not, put it in the RIT list.

Many tests in TestMasterFailover put a given state in ZK, but keep the regionserver up. This
way, it's actually the timeout that is managing the region status. It's fast because the timeout
is set to a few seconds. But we should have a test with a real failover, with standard cases,
and they should be fast without setting a timeout to 2 seconds or so.

- this test shows a specific usage of the timeout: being a garbage collector when we put ourselves
in an unexpected situation
- doesn't prove that we're effectively recovering quickly when we have a master failover,
because the very short timeout hides the problem.

As an example, it seems that if the master fails just after creating a offline znode (before
contacting the region server), we need the timeout to recover the region (i.e. 10 minutes).
If confirmed (I will recheck tomorrow), it would be a bug (not that simple to fix actually),
but we don't see it because of this short timeout.

And so, I'm thinking about:
- refactoring the tests to express the tests that can occurs during a master failover (including
a region server crash, but may be it does exist already)
- keeping the timeout, but as a security only, without doing anything if it's allocated to
a live region server. May be we will need extra cases here, I need to study the code more.
- May be add extra code if we identify a region opening for too long on a live server: calling
it to check its status, release it or something alike. To be discussed :-)

> Assignment Timeouts: Remove the code from the master
> ----------------------------------------------------
>                 Key: HBASE-7327
>                 URL: https://issues.apache.org/jira/browse/HBASE-7327
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 7327.v1.uncomplete.patch
> As per HBASE-7247...

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message