hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Liochon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover
Date Tue, 15 Oct 2013 09:02:07 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795019#comment-13795019
] 

Nicolas Liochon commented on HBASE-5843:
----------------------------------------

Marking solved in 0.96.
- MTTR has decreased from 10 minutes to less than one minute, ~30 seconds in many cases.
- when a machine fails, the other machines in the cluster are still available.
- detection time is zero when there is a crash.
- log replay scales well, allowing a minimal replay time.
- thanks to the new distributed wal replay, "puts" are not impacted by the recovery. Client
applications can continue to stream their writes when there is a machine failure.
- Some of the improvement were backported to 0.94 / HDFS 1.x, are now used in production and
work as expected.

The remaining ideas mentioned in this umbrella ticket will be tracked independently. There
is still room for improvement.

> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.95.2
>            Reporter: Nicolas Liochon
>            Assignee: Nicolas Liochon
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever
the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message