hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Pan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster
Date Thu, 01 Dec 2011 04:01:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160635#comment-13160635
] 

Thomas Pan commented on HBASE-4925:
-----------------------------------

Here is the list of fault injection test cases that we've collected:
1. Kill -9 one region server and kill -9 the region server that serves the .META. table 
2. While BES is writing data to HBase table, kill -9 the region server that holds the .META.
table 
3. Kill -9 the region server that serves the .META. table. Then, kill -9 the region server
that serves the -ROOT- table. [Thomas: Is it the case in our environment?] 
4. A large number of region servers get killed. After restoration, there is no data loss.

5. No job impact while shifting from the primary HBase master to the secondary HBase master.

6. Shift from the primary HBase master to the secondary HBase master after multiple region
servers fail. 
7. Shift from the primary HBase master to the secondary HBase master after new region servers
are added. 
8. Repeatedly stop and restart the primary HBase master. There should be no major impact as
the secondary HBase master kicks in automatically. 
9. Shift from the primary HBase master to the secondary HBase master while a table is creating
with 3600 regions. 
10. Disable network access for the node running the region server that serves the .META. table

11. Disable network access for the node running the primary HBase master 
12. Disable network access for the node running the secondary HBase master 
13. Trigger short-lived network interruption for the node running the region server that serves
the .META. table 
14. Trigger short-lived network interruption for the node running the primary HBase master

15. Trigger short-lived network interruption for the node running the secondary HBase master

16. While BES is writing to a table heavily with high CPU usage in the cluster. 
17. Restart one RS with high CPU usage in the cluster. 
18. Offline data nodes with high CPU usage in the cluster. 
19. While BES is writing to a table heavily with high memory usage in the cluster. 
20. Restart one RS with high memory usage in the cluster. 
21. Offline data nodes with high memory usage in the cluster. 
22. With no load in the cluster, test failover of the primary NN to the secondary NN 
23. With jobs running in the cluster, test failover of the primary NN to the secondary NN

24. Repeatedly stop and restart the primary NN to make sure that the NN failover works fine

25. Kill -9 the primary zookeeper. The failover to the second NN should be in time with no
job failure. 
26. Kill -9 the primary zookeeper and the primary NN, the cluster should quickly fail over
to the secondary ZK and NN 
27. Restart the node that holds the primary NN 
28. Disable network access for the node running the primary NN 
29. Trigger short-lived network interruption for the node running the primary NN 
30. Disable network access for the node running the primary ZK 
31. Trigger short-lived network interruption for the node running the primary ZK 
32. Disable network access for the node running ZK in follower state 
33. Trigger short-lived network interruption for the node running ZK in follower state 
34. Offline multiple data nodes at once. Keep them offline for a while. 
35. Offline multiple data nodes at once. Keep them offline for a while. Put them back at once.

37. Offline multiple data nodes at once. Put them back at once, instantly. 
38. Offline a data node at once. Keep it offline for a while. 
39. Offline a data node at once. Keep it offline for a while. Put it back at once. 
40. Offline a data node at once. Put it back at once, instantly. 
41. Hard disk failure in the primary NN triggers NN failover. 
42. The directory dfs.data.dir on data node gets corrupted 
43. Corrupted dfs.name.dir on the primary NN gets detected and triggers NN failover. 
44. Corrupted dfs.name.dir on the secondary NN gets detected. 
45. A data node runs out of disk space. 
46. Under heavy IO on data nodes, BES writes to a table heavily. 
47. Under heavy IO on data nodes, offline multiple data nodes.
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster.
This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information
would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message