hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chendihao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9802) A new failover test framework for HBase
Date Mon, 21 Oct 2013 08:50:43 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800475#comment-13800475
] 

chendihao commented on HBASE-9802:
----------------------------------

We don't use IT test a lot and think it's less aggressive. As [~eclark] said, the class IntegrationTestBigLinkedListWithChaosMonkey
may have verified data, but we treat this framework as a external tool. We impl a DataValidateTool
to randomly read/put/delete data(simulate a real client), then read the value from HBase and
compared with the expected value which is stored in memory and reliable. It's an easy way
for us to validate data whenever we want(before/during/after failover test), and ensure the
availability and data correctness.

> A new failover test framework for HBase
> ---------------------------------------
>
>                 Key: HBASE-9802
>                 URL: https://issues.apache.org/jira/browse/HBASE-9802
>             Project: HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.94.3
>            Reporter: chendihao
>            Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will restart regionserver,
force balancer and perform other actions randomly and periodically. However, we need a more
extensible and full-featured framework for our failover test and we find ChaosMonkey cant'
suit our needs since it has the following drawbacks.
> 1) Only process-level actions can be simulated, not support machine-level/hardware-level/network-level
actions.
> 2) No data validation before and after the test, the fatal bugs such as that can cause
data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the reason.
> Therefore, we have developed a new framework to satisfy the need of failover test. We
extended ChaosMonkey and implement the function to validate data and to replay failed actions.
Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action makes it easier
to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause machine failure
and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to ensure the
availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table as well.
> 5) The set of actions that caused test failure can be replayed, and the reproducibility
of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were exposed and
fixed by running this test framework. Moreover, we have a monitor program which shows the
progress of failover test and make sure our cluster is as stable as we want. Now we are trying
to make it more general and will opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message