[ https://issues.apache.org/jira/browse/MAPREDUCE2162?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12965936#action_12965936
]
Joydeep Sen Sarma commented on MAPREDUCE2162:

here's the reasoning behind capping stddev at mean/3. we speculate if:
* rate < mean  stddev
implies
* 1/rate > 1/(mean  stddev)
implies
* 1/rate > 1/mean + (1/(mean  stddev)  1/mean)
implies
# projectedTime > meanTime + Delta
where
* Delta = (1/(mean  stddev)  1/mean)
if
* stddev <= mean/3 // for example
then
* Delta > (1/(mean  mean/3)  1/mean) ==>
* Delta > 0.5/mean = 0.5 * MeanTime
now our our equation _1_ becomes:
# projectedTime > MeanTime + 0.5*MeanTime
two observations:
* by capping stddev  we have converted the rate check into a meaningful check on the running
time of a task  tasks that run longer than a certain time (relative to the mean) will be
guaranteed to be speculated.
* the Meantime + 0.5*Meantime slack over the mean is same as the heuristic discussed in the
jira where two rules were discussed:
** dont speculate if runningTime <= MeanTime * 0.5
** dont speculate if remainingTime < MeanTime
* if we add these two together  since runningTime + remainingTime == projectedTime  this
becomes (roughly):
** speculate only if projectedTime > MeanTime + MeanTime*0.5
so the heuristics in the jira are structurally similar to capping the stddev at mean/3.
as explained earlier  the percentile stuff is actually (approximately) being done by speculativeCap
(no more than 10% of the tasks can be speculated and tasks are sorted (by latest finish time)
before speculating).
> speculative execution does not handle cases where stddev > mean well
> 
>
> Key: MAPREDUCE2162
> URL: https://issues.apache.org/jira/browse/MAPREDUCE2162
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Joydeep Sen Sarma
> Assignee: Joydeep Sen Sarma
>
> the new speculation code only speculates tasks whose progress rate deviates from the
mean progress rate of a job by more than some multiple (typically 1.0) of stddev. stddev can
be larger than mean. which means that if we ever get into a situation where this condition
holds true  then a task with even 0 progress rate will not be speculated.
> it's not clear that this condition is selfcorrecting. if a job has thousands of tasks
 then one laggard task, inspite of not being speculated for a long time, may not be able
to fix the condition of stddev > mean.
> we have seen jobs where tasks have not been speculated for hours and this seems one explanation
why this may have happened. here's an example job with stddev > mean:
> DataStatistics: count is 6, sum is 1.7141054797775723E8, sumSquares is 2.9381575958035014E16
mean is 2.8568424662959537E9 std() is 6.388093955645905E9

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
