hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5251) Reducer should not implicate map attempt if it has insufficient space to fetch map output
Date Tue, 18 Jun 2013 21:36:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13687227#comment-13687227

Jason Lowe commented on MAPREDUCE-5251:

Thanks for the patch, Ashwin.  Unfortunately I didn't get to it in time, and it's gone stale.
 Could you please refresh it?  Couple of comments on the existing patch:

* reportLocalDiskError shouldn't assume that the disk error is due to lack of space.  If DiskErrorException
is ever thrown for other reasons or future code calls reportLocalDiskError for other kinds
of errors, the log message could be very misleading to a user.  Probably best to simply report
a disk error let the exception message/traceback do most of the talking about specifics.
* Do we want to only catch DiskErrorException when trying to reserve space for a map output?
 Other IOExceptions will also cause us to blame the map when the map is unlikely to be the
problem.  Seems like we want to report a local error for any IOException and retry (i.e.:
kill) the reducer in those cases.  In that sense maybe reportLocalDiskError should just be
> Reducer should not implicate map attempt if it has insufficient space to fetch map output
> -----------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-5251
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5251
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.7, 2.0.4-alpha
>            Reporter: Jason Lowe
>            Assignee: Ashwin Shankar
>         Attachments: MAPREDUCE-5251-2.txt
> A job can fail if a reducer happens to run on a node with insufficient space to hold
a map attempt's output.  The reducer keeps reporting the map attempt as bad, and if the map
attempt ends up being re-launched too many times before the reducer decides maybe it is the
real problem the job can fail.
> In that scenario it would be better to re-launch the reduce attempt and hopefully it
will run on another node that has sufficient space to complete the shuffle.  Reporting the
map attempt is bad and relaunching the map task doesn't change the fact that the reducer can't
hold the output.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message