hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover
Date Fri, 20 Jul 2012 04:46:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418919#comment-13418919

nkeywal commented on HBASE-5843:

bq. I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180,
but that is against 0.92, which doesn't have HBASE-5970, right? Could you clarify?
Yes, it's because with a clean stop, the RS unregisters itself in ZK, so the recovery starts
immediately. With a kill -9, the RS remains registered in ZK. So if you don't have HBASE-5844
or HBASE-5926, you wait for the ZK timeout.

bq. Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? 
bq. Has a JIRA been filed?
Not yet. I'm writing specific unit tests for this, I found issues that I have not yet fully
analyzed, and I need to create the jiras. Also, may be my test was not good for this part:
as I was doing the test without a datanode, it could be that the recovery was not working
for this reason (I wonder if the sync works with the local file system for example).

bq. Test to be changed to get a real difference when we need to replay the wal.
bq. Could you clarify what you mean here?
It's does not last long enough, so I won't be able to see much difference even if there is
one. So I need to redo the work with a real datanode, check that it recovers, then check that
I measure something meaningful.
I will also redo the first tests with a DN to see if there is still a gap.

> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever
the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message