hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2483) Large-scale reliability tests
Date Wed, 12 Nov 2008 16:55:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646950#action_12646950
] 

Steve Loughran commented on HADOOP-2483:
----------------------------------------

I'm assuming that the problems here are to see how the system handles network, host and disk
failures. A good first step would be: what problems happen most often? And which problems
are traumatic enough to send everyones pagers off, as they are the ones to care about.

* disk failures could be mocked with some filesystem which simulates problems: bad data, missing
data, even hanging reads and writes. 
* network failures are harder to simulate as there are so many kinds. DNS failures, exceptions
on every single part of an IO operation, are all candidates. Perhaps we could have a special
mock IPC client that raises these exceptions in test runs.
* This is the kind of thing that virtualized clusters are good for, but they have odd timing
quirks to make you worry about what is going on. 

> Large-scale reliability tests
> -----------------------------
>
>                 Key: HADOOP-2483
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2483
>             Project: Hadoop Core
>          Issue Type: Test
>          Components: mapred
>            Reporter: Arun C Murthy
>            Assignee: Devaraj Das
>             Fix For: 0.20.0
>
>
> The fact that we do not have any large-scale reliability tests bothers me. I'll be first
to admit that it isn't the easiest of tasks, but I'd like to start a discussion around this...
especially given that the code-base is growing to an extent that interactions due to small
changes are very hard to predict.
> One of the simple scripts I run for every patch I work on does something very simple:
run sort500 (or greater), then it randomly picks n tasktrackers from ${HADOOP_CONF_DIR}/conf/slaves
and then kills them, a similar script one kills and restarts the tasktrackers. 
> This helps in checking a fair number of reliability stories: lost tasktrackers, task-failures
etc. Clearly this isn't good enough to cover everything, but a start.
> Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message