hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sharad Agarwal (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2483) Large-scale reliability tests
Date Thu, 13 Nov 2008 13:05:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647279#action_12647279
] 

Sharad Agarwal commented on HADOOP-2483:
----------------------------------------

Perhaps we can install the error injection code with each daemon (Datanode/TaskTracker). This
code gets triggered based on random function which works on say cluster error injection ratio
= no of nodes to inject error /total nodes in cluster. The default package will inject system
level errors. If required each daemon can extend it to inject its own more granular errors.
This way error generation could be decentralized and can be controlled via config params;
avoiding the need to get the slaves list for a cluster and injecting it from a single client.
The question is do we need that kind of extensibility or we are fine with a few types. Thoughts?


> Large-scale reliability tests
> -----------------------------
>
>                 Key: HADOOP-2483
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2483
>             Project: Hadoop Core
>          Issue Type: Test
>          Components: mapred
>            Reporter: Arun C Murthy
>            Assignee: Devaraj Das
>             Fix For: 0.20.0
>
>
> The fact that we do not have any large-scale reliability tests bothers me. I'll be first
to admit that it isn't the easiest of tasks, but I'd like to start a discussion around this...
especially given that the code-base is growing to an extent that interactions due to small
changes are very hard to predict.
> One of the simple scripts I run for every patch I work on does something very simple:
run sort500 (or greater), then it randomly picks n tasktrackers from ${HADOOP_CONF_DIR}/conf/slaves
and then kills them, a similar script one kills and restarts the tasktrackers. 
> This helps in checking a fair number of reliability stories: lost tasktrackers, task-failures
etc. Clearly this isn't good enough to cover everything, but a start.
> Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message