reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1674) Random Failures in Broadcast and Reduce Fault Tolerance tests
Date Sat, 19 Nov 2016 01:27:58 GMT

    [ https://issues.apache.org/jira/browse/REEF-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678249#comment-15678249
] 

Shravan Matthur Narayanamurthy edited comment on REEF-1674 at 11/19/16 1:27 AM:
--------------------------------------------------------------------------------

I will be adding a map function that launches a thread in the background in the constructor
that will call Environment.Exit() after a random timeout. This test will accept three additional
parameters apart from the generic ones: 
# Failure Probability, 
# A minimum timeout in seconds & 
# The expected throughput in MBps. 

The background thread is launched only in a fraction of the map tasks, controlled by failure
probability. With every retry attempt a different task can be chosen to fail.

The minimum timeout ensures that failure does not happen before the specified timeout has
elapsed.

The expected throughput is a parameter that controls the maximum timeout. This is our expected
throughput we observe per iteration of IMRU. A rough estimate is fine and default is set to
1 MBps which is quite low and leads to generous max timeouts. Values of 5 to 10 are also good.
The random timeout is picked uniformly between min timeout & max timeout.

This seems to me like a good model to simulate real failure.


was (Author: shravanmn):
I will be adding a map function that launches a thread in the background in the constructor
that will call Environment.Exit() after a random timeout. This test will accept three additional
parameters apart from the generic ones: 
# Failure Probability, 
# A minimum timeout in seconds & 
# The expected throughput in MBps. 

The background thread is launched only in a fraction of the map tasks, controlled by failure
probability.

The minimum timeout ensures that failure does not happen before the specified timeout has
elapsed.

The expected throughput is a parameter that controls the maximum timeout. This is our expected
throughput we observe per iteration of IMRU. A rough estimate is fine and default is set to
1 MBps which is quite low and leads to generous max timeouts. Values of 5 to 10 are also good.
The random timeout is picked uniformly between min timeout & max timeout.

This seems to me like a good model to simulate real failure.

> Random Failures in Broadcast and Reduce Fault Tolerance tests
> -------------------------------------------------------------
>
>                 Key: REEF-1674
>                 URL: https://issues.apache.org/jira/browse/REEF-1674
>             Project: REEF
>          Issue Type: Improvement
>          Components: REEF.NET IO
>    Affects Versions: 0.16
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>            Priority: Minor
>             Fix For: 0.16
>
>
> The current fault tolerance tests inject simulated failure in a controlled manner and
hence are not the right failure model to test our fault tolerance work. It would be good to
have failures injected randomly than only at specific points as is done in the current code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message