hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly
Date Wed, 31 Oct 2007 13:50:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539081

Amar Kamat commented on HADOOP-1984:

Ant test passed on my system. This error is due to the hase test  {{TestRegionServerExit}}.

The change here is the backoff function used for retrying when the map output fetch fails.
Currently we are using {{60 + random(0,300)}} sec as the backoff interval. By using exponential
backoff the penalty for first few backoffs is not much but then for the later ones the penalty
is huge. The initial backoff is 2 sec and the function is
backoff (n) = init_value * base^(n-1)
n = no of retries
base is set to 2
init_value is set to 2 sec
Any suggestions on the formulation of the backoff algorithm and the initial values ?

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.16.0
>         Attachments: HADOOP-1984.patch
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise
well tuned map/red jobs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message