hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7407) TestMasterFailover under tests some cases and over tests some others
Date Mon, 07 Jan 2013 19:30:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546185#comment-13546185

nkeywal commented on HBASE-7407:

bq. We have a DoNotRetryIOException. Does this mean all other exceptions are retriable? 
Unfortunately no. For example, ServerNotRunningYetException is retriable but RegionServerStoppedException
is not. Still they both extends IOException.

bq. Do we need a PleaseRetryException?
If we want to distinguish the different cases between:
- retriable
- not retriable
- don't know 

then yes :-). And the don't know can means 'not coded' or 'I really don't know'.

bq. Why do we need another lock? The caller of processRegionsInTransition should already have
the lock, right?
Yes you're right. I will fix this.

bq. One enhancement to the original logic we can do, is that we can time out those region
transitions earlier so that timeout monitor can reassign them earlier, if needed.
I'm not a big fan, it's adding an extra case. Resending seems much better. What's the issue
you're seeing?

> TestMasterFailover under tests some cases and over tests some others
> --------------------------------------------------------------------
>                 Key: HBASE-7407
>                 URL: https://issues.apache.org/jira/browse/HBASE-7407
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment, test
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 7407.v1.patch, 7407.v2.patch, 7407.v3.patch
> The tests are done with this settings:
>     conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000);
>     conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000);
> As a results:
> 1) some tests seems to work, but in real life, the recovery would take 5 minutes or more,
as in production there always higher. So we don't see the real issues.
> 2) The tests include specific cases that should not happen in production. It works because
the timeout catches everything, but these scenarios do not need to be optimized, as they cannot

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message