hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7407) TestMasterFailover under tests some cases and over tests some others
Date Mon, 07 Jan 2013 20:08:14 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546216#comment-13546216

nkeywal commented on HBASE-7407:

bq. I see. I think by retriable, you mean it is retriable on the same region server, right?
Some exception may not be retriable on the same region server, but retriable on another region
server. Is that possible?

Yes. That's what we do for an open: don't retry on this one. That's the case for open: don't
retry on this one.

bq. My concern is that we are going to have some duplicated logic, and it will make the code
hard to understand/maintain. For the scenarios targeted by the patch, I think timeout monitor
can handle it too. If so, then why not let TM do the work? If TM doesn't handle it fast enough,
we can time it out quickly.

If the master fails, we should reopen the regions in progress as soon as possible, relying
on a timeout seems dangerous to me:
- in such scenarios, the system is likely overloaded: so reassigning regions aggressively
would be adding load on top of on already complex situation
- not being aggressive means waiting for a long time: time to detect that the master is dead
+ time for the workaround. It means we're over the minute best case.

The TM is an expensive feature, we should not need it on a standard workflow (and failures
are standard :-) (but its function of 'last chance checker' is useful: it's a safety net,
but it should not be anything else). 

> TestMasterFailover under tests some cases and over tests some others
> --------------------------------------------------------------------
>                 Key: HBASE-7407
>                 URL: https://issues.apache.org/jira/browse/HBASE-7407
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment, test
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 7407.v1.patch, 7407.v2.patch, 7407.v3.patch
> The tests are done with this settings:
>     conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000);
>     conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000);
> As a results:
> 1) some tests seems to work, but in real life, the recovery would take 5 minutes or more,
as in production there always higher. So we don't see the real issues.
> 2) The tests include specific cases that should not happen in production. It works because
the timeout catches everything, but these scenarios do not need to be optimized, as they cannot

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message