hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuan Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1229) Shell$ExitCodeException could happen if AM fails to start
Date Mon, 23 Sep 2013 22:09:07 GMT

    [ https://issues.apache.org/jira/browse/YARN-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13775711#comment-13775711
] 

Xuan Gong commented on YARN-1229:
---------------------------------

[~vinodkv], [~bikassaha], [~hitesh], [~sseth], [~jlowe], [~cnauroth]

The bug shows an error in launch_container.sh while trying to export NM_AUX_SERVICE_mapreduce.shuffle.
The problem is that '.' is not considered a valid character in an environment variable. In
order to solve this, we might need to rename the service name.
There are three places need to rename (use mapreduce_shuffle instead of mapreduce.shuffle):
{code}
  public static final String MAPREDUCE_SHUFFLE_SERVICEID =
      "mapreduce.shuffle";
{code}
in ShuffleHandler.java.

The other two places are in yarn_site.xml
{code}
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce.shuffle</value>
        <description>shuffle service that needs to be set for Map Reduce to run </description>
    </property>
    
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
{code}

We can just simply replace all three places with mapreduce_shuffle, or we can split the shuffle
service out of the aux_services, say, create a new property called mapreduce_shuffle_service.
The ShuffleHandler can read this property instead of defining MAPREDUCE_SHUFFLE_SERVICEID
by itself. And AuxService#init() will need to read both mapreduce_shuffle_service and yarn.nodemanager.aux-services
to do the initialization. 

An alternate is to convert all special characters to "_" - and AuxServiceHelpers becomes the
public API to access this data.

Since we're trying to rename variables, this can be considered backward incompatible. I would
like get in touch with folks who are already using it.
                
> Shell$ExitCodeException could happen if AM fails to start
> ---------------------------------------------------------
>
>                 Key: YARN-1229
>                 URL: https://issues.apache.org/jira/browse/YARN-1229
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.1.1-beta
>            Reporter: Tassapol Athiapinya
>            Assignee: Xuan Gong
>            Priority: Critical
>             Fix For: 2.1.1-beta
>
>
> I run sleep job. If AM fails to start, this exception could occur:
> 13/09/20 11:00:23 INFO mapreduce.Job: Job job_1379673267098_0020 failed with state FAILED
due to: Application application_1379673267098_0020 failed 1 times due to AM Container for
appattempt_1379673267098_0020_000001 exited with  exitCode: 1 due to: Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException: /myappcache/application_1379673267098_0020/container_1379673267098_0020_01_000001/launch_container.sh:
line 12: export: `NM_AUX_SERVICE_mapreduce.shuffle=AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
> ': not a valid identifier
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
> at org.apache.hadoop.util.Shell.run(Shell.java:379)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
> at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:270)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:78)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> .Failing this attempt.. Failing the application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message