hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5042) Reducer unable to fetch for a map task that was recovered
Date Mon, 04 Mar 2013 17:23:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592379#comment-13592379
] 

Jason Lowe commented on MAPREDUCE-5042:
---------------------------------------

Sorry, I was wrong.  It appears it will happen without security as well.  The problem is that
the job token is rolled from scratch each time the AM starts up, so the subsequent AM attempt
has no idea what job token was used by the previous attempt.  My non-secure cluster was only
one node, and any node that launches a container for the new AM attempt will smash the old
shuffle token with the new one.  Any node that only ran tasks for the old AM attempt will
report shuffle verification failures from reduce tasks launched by the new AM attempt.
                
> Reducer unable to fetch for a map task that was recovered
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5042
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5042
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, security
>    Affects Versions: 0.23.7, 2.0.4-beta
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> If an application attempt fails and is relaunched the AM will try to recover previously
completed tasks.  If a reducer needs to fetch the output of a map task attempt that was recovered
then it will fail with a 401 error like this:
> {noformat}
> java.io.IOException: Server returned HTTP response code: 401 for URL: http://xx:xx/mapOutput?job=job_1361569180491_21845&reduce=0&map=attempt_1361569180491_21845_m_000016_0
> 	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1615)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:231)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:156)
> {noformat}
> Looking at the corresponding NM's logs, we see the shuffle failed due to "Verification
of the hashReply failed".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message