hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover
Date Wed, 23 Jan 2013 08:22:17 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560472#comment-13560472
] 

nkeywal commented on HBASE-5843:
--------------------------------

bq. What is the application bug(AB) mentioned in your design doc? Do you mean hbase bug? or
hbase client application code bug? 
Mainly HBase, but it could be as well a coprocessor issue. HBase can be configured to stop
the regionserver if a coprocessor sends unexpected exceptions, but it's quite easy to write
buggy stuff, like a coprocessor that takes resources without freeing them. Here you may need
to stop the region server.


bq. If it is hbase client application code bug, does that need stop/start region server to
fix the issue? 
For a pure client (i.e. a user of the hbase.client package), it would be an HBase bug imho:
HBase/a regionserver should be resistant to any client behavior.
For a coprocessor, it's client code executed within the regionserver process. Thanks to Java,
many coprocessors bugs will have a limited effect, but as said above there are some cases
that cannot be handled simply.

bq. If it is hbase code bug, do you refer to hbase bug that cause region server einter some
bad state like deadlock, and so on? I think that could benefit from restarting region server
to fix the problem. 
Yes.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever
the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message