hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5042) Reducer unable to fetch for a map task that was recovered
Date Wed, 06 Mar 2013 01:58:16 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated MAPREDUCE-5042:
----------------------------------

    Attachment: MAPREDUCE-5042.patch

This is complicated by the fact that the job token currently serves a dual-role to authenticate
both the shuffle *and* the task umbilical.  The former is something that should persist across
app attempts, while the latter should not.  We don't want old task attempts authenticating
with the new app attempt, at least not at this point.  It would only serve to confuse the
new app attempt.

Therefore I propose the following:

* The current job token remains primarily as-is for the authenticating of the task umbilical,
and each AM attempt continues to generate its own job token.
* A new secret key, the shuffle secret, will be generated by the job client when the job is
submitted as part of the job's credentials.  Each app attempt will extract the shuffle secret
from the job's credentials and use it as the shared secret to authenticate the shuffle

Attaching the first draft of a patch that implements that proposal.  It needs unit tests,
but I've manually tested that it can recover map tasks and successfully shuffle their data.
                
> Reducer unable to fetch for a map task that was recovered
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5042
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5042
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, security
>    Affects Versions: 0.23.7, 2.0.4-beta
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-5042.patch
>
>
> If an application attempt fails and is relaunched the AM will try to recover previously
completed tasks.  If a reducer needs to fetch the output of a map task attempt that was recovered
then it will fail with a 401 error like this:
> {noformat}
> java.io.IOException: Server returned HTTP response code: 401 for URL: http://xx:xx/mapOutput?job=job_1361569180491_21845&reduce=0&map=attempt_1361569180491_21845_m_000016_0
> 	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1615)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:231)
> 	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:156)
> {noformat}
> Looking at the corresponding NM's logs, we see the shuffle failed due to "Verification
of the hashReply failed".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message