jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "Resilience" by MichaelDürig
Date Wed, 21 May 2014 09:43:09 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "Resilience" page has been changed by MichaelDürig:
https://wiki.apache.org/jackrabbit/Resilience

Comment:
Initial draft or resilience goals for Oak

New page:
== Resilience goals for Oak ==

This page is an effort to clarify the concept of resilience and related and to define 
goals for Oak to that respect. 

=== Resilience ===

''Resilience'' refers to the ability to ''withstand'', ''contain'' and ''recover'' from 
''failures''. 

A ''single failure'' refers to a single component failing at any given time while 
''multiple failures'' means that more than one component may fail at the same time. 

To ''withstand'' a failure means to stays operational and sufficiently responsive 
during the time a failure occurs.

To ''contain'' a failure means its adverse effect does not spread beyond its 
initial scope. I.e. there is no collateral damage. 

To ''recover'' from a failure means undoing (e.g. automatically or by manual 
intervention) the ''impact'' that has been caused by a failure and return to 
normal operation.

The ''impact'' of a failure roughly falls into one of six levels where each level 
is worse than its predecessor:

 * (0) no impact at all,
 * (1) temporary degradation with automatic recovery,
 * (2) temporary degradation that needs manual intervention for recovery,
 * (3) temporary outage with automatic recovery,
 * (4) temporary outage that needs manual intervention for recovery,
 * (5) complete outage that needs rebuilding from scratch.

=== Goals for Oak ===

Oak should be resilient against simple failures such that complete outages 
(level 5) do not occur. Oak is not resilient against multiple failures though 
and sufficient redundancy needs to be built into the system to cope with such. 

==== Failures and their impact ====

 * Temporary outage of database connection
  * For no more than a few seconds: impact <= 1. Automatic recovery once database connection
is back.
  * For more than a few seconds: impact <= 3. Automatic recovery once database connection
is back.

 * Resource drainage
  * Out of disk space: impact <= 4. Providing more disk space should be sufficient for
recovery. 
  * Out of memory: impact <= 4. Providing more memory should be sufficient for recovery.
  * Network / disk bandwidth saturated: impact <= 2. Automatic recovery once sufficient
bandwidth is provided.

 * Small scale data corruption (e.g. bit flip on disk, network, memory) 
  * On primary data (e.g. document, segment, data store, ...): impact <= 2. Repairing the
corrupted data should be sufficient for recovery.
  * On secondary (derived) data (e.g. index, ...): impact <= 1. Secondary data should automatically
be recreated once corruption has been detected.

 * Large scale data corruption (e.g. corrupt data unit like file, document, index, ...)
  * On primary data: impact <= 4. Repairing the corrupted data should be sufficient for
recovery.
  * On secondary data: impact <= 3. Secondary data should automatically be recreated once
corruption has been detected.

 * Hardware failure (e.g. disk, CPU, memory, ... break down): impact <= 4. Repairing the
hardware and restoring from backup in the case of a disk loss should be sufficient for recovery.
  
 * Software failure (database, Oak, JVM, OS, ... process crash): impact <= 4. Restarting
the crashed process should be sufficient for recovery.
  

Mime
View raw message