hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1246) Ignored IOExceptions from MapOutputLocation.java:getFile lead to hung reduces
Date Wed, 11 Apr 2007 13:05:32 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Arun C Murthy updated HADOOP-1246:
----------------------------------

    Description: 
Ignoring IOExceptions during fetching of map outputs in MapOutputLocation.java:getFile (e.g.
content-length doesn't match actual data recieved) leads to hung reduces since the MapOutputCopier
puts the host in the penalty box and retries forever.

Possible steps:
a) Distinguish between failure to fetch output v/s lost maps. (related to HADOOP-1158)
b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to HADOOP-1183)
c) On detection of 'failure to fetch' we probably should have exponential back-offs (versus
the same order back-offs as currently) for hosts in the 'penalty box'.
d) If fetches still fail for say 4 times (after exponential backoffs), we should declare the
Reduce as 'failed'.

This situation could also arise from situations like full-disks on the reducer, whereby it
isn't possible to save the map output on the local disk (say for large map outputs).

Thoughts?

  was:
Ignoring exceptions during fetching of map outputs in MapOutputLocation.java:getFile (e.g.
content-length doesn't match actual data recieved) leads to hung reduces since the MapOutputCopier
just ignores them, puts the host in the penalty box and retries forever.

Possible steps:
a) Distinguish between failure to fetch output v/s lost maps. (related to HADOOP-1158)
b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to HADOOP-1183)
c) On detection of 'failure to fetch' we probably should have exponential back-offs (versus
the same order back-offs as currently) for hosts in the 'penalty box'.
d) If fetches still fail for say 4 times (after exponential backoffs), we should declare the
Reduce as 'failed'.

This situation could also arise from situations like full-disks on the reducer, whereby it
isn't possible to save the map output on the local disk (say for large map outputs).

Thoughts?

        Summary: Ignored IOExceptions from MapOutputLocation.java:getFile lead to hung reduces
 (was: Ignored exceptions from MapOutputLocation.java:getFile lead to hung reduces)

> Ignored IOExceptions from MapOutputLocation.java:getFile lead to hung reduces
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-1246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1246
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Arun C Murthy
>
> Ignoring IOExceptions during fetching of map outputs in MapOutputLocation.java:getFile
(e.g. content-length doesn't match actual data recieved) leads to hung reduces since the MapOutputCopier
puts the host in the penalty box and retries forever.
> Possible steps:
> a) Distinguish between failure to fetch output v/s lost maps. (related to HADOOP-1158)
> b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to HADOOP-1183)
> c) On detection of 'failure to fetch' we probably should have exponential back-offs (versus
the same order back-offs as currently) for hosts in the 'penalty box'.
> d) If fetches still fail for say 4 times (after exponential backoffs), we should declare
the Reduce as 'failed'.
> This situation could also arise from situations like full-disks on the reducer, whereby
it isn't possible to save the map output on the local disk (say for large map outputs).
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message