hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Victor Kim (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-2090) If Kerberos Authentication is enabled, MapReduce job is failing on reducer phase
Date Wed, 21 May 2014 21:12:39 GMT
Victor Kim created YARN-2090:

             Summary: If Kerberos Authentication is enabled, MapReduce job is failing on reducer
                 Key: YARN-2090
                 URL: https://issues.apache.org/jira/browse/YARN-2090
             Project: Hadoop YARN
          Issue Type: Bug
          Components: applications, nodemanager
    Affects Versions: 2.4.0
         Environment: hadoop:
            Reporter: Victor Kim
            Priority: Critical

I have 3-node cluster configuration: 1 ResourceManager and 3 NodeManagers, Kerberos is enabled,
have hdfs, yarn, mapred principals\keytabs. ResourceManager and NodeManager are ran under
yarn user, using yarn Kerberos principal. 
Use case 1: WordCount, submit job using yarn UGI (i.e. superuser, the one having Kerberos
principal on all boxes). Result: job successfully completed.
Use case 2: WordCount, submit job using LDAP user impersonation via yarn UGI. Result: Map
tasks are completed SUCCESSfully, Reduce task fails with ShuffleError Caused by: java.io.IOException:
Exceeded MAX_FAILED_UNIQUE_FETCHES (see the stack trace below).
The use case with user impersonation used to work on earlier versions, without YARN (with

I found similar issue with Kerberos AUTH involved here: https://groups.google.com/forum/#!topic/nosql-databases/tGDqs75ACqQ
And here https://issues.apache.org/jira/browse/MAPREDUCE-4030 it's marked as resolved, which
is not the case when Kerberos Authentication is enabled.

The exception trace from YarnChild JVM:
2014-05-21 12:49:35,687 FATAL [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
Shuffle failed with too many fetch failures and insufficient progress!
2014-05-21 12:49:35,688 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running
child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in
        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:323)
        at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:245)
        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)

This message was sent by Atlassian JIRA

View raw message