hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted
Date Wed, 07 Nov 2012 19:02:15 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Robert Joseph Evans updated MAPREDUCE-4772:

    Attachment: MR-4772-trunk.txt

This patch changes the AM to restart a map task if 50% of the shuffling reducers report errors
for a given map task instead of 50% of the running reducers.

It also changes how often a reduce reports fetch failures.  If a ConnectionException happens
then it will report the error immediately and will not wait. A ConnectionException indicates
that there is no one listening on the remote port.  This is very different from a timeout
where the port is overrun and no one is able to get through.

It also adds in a maximum delay between fetch retries.  In the original code every time a
fetch failure happened the reducer would add 30% to the delay.  It would also only report
every 10th failure.  This means that the first failure would be reported after about 6 min,
the second after 90 min and the third after 20 hours.  This is really bad when there is only
one reducer because the AM requires at least three reports for the map task to be restarted.

The default maximum delay is set to 1 min which would change the numbers to be 6 min, 15 min,
and 25 min respectively.  25 min still seems very long to wait, but is much better then 20
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
> In one particular case we saw a NM go down at just the right time, that most of the reducers
got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could
not fetch from the NM, but because the other reducers were still running the AM would not
relaunch the map task because there weren't more than 50% of the running reducers that had
reported fetch failures.  Then because of the exponential back-off for fetches on the reducers
it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report
in again. At that point the other reducers had finished and the job relaunched the map task.
 If the reducers had still been running at 1:45 I have no idea how long it would have taken
for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage
of reducers running, we also need to have a maximum limit of the back off, so that we don't
ever have the reducer waiting for days to try and fetch map output.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message