hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Travis Thompson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5792) When mapreduce.jobhistory.intermediate-done-dir isn't writable, application fails with generic error
Date Thu, 13 Mar 2014 21:51:50 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13934153#comment-13934153
] 

Travis Thompson commented on MAPREDUCE-5792:
--------------------------------------------

I agree, the specific issue of the directory permissions being wrong is not really the issue
here.  It becomes very hard to find AM logs for a failed AM because clicking the "logs" link
from the RM page takes you to the NM it executed on and with log aggregation, the logs get
pushed to HDFS very quickly, and then the NM just throws an error that the container doesn't
exist.  So the only way to find your logs is to browse HDFS and find them manually.  So maybe
the better fix here is to get the RM to pull the logs off of HDFS instead of linking to the
NM?  I'm not sure who's supposed to be handling log viewing besides the JHS which is specific
to M/R jobs.

Also I got in this situation after setting up a new cluster from scratch and missing the permissions
on a dir that didn't have world r/x.  The only reason I noticed it was because I knew to check
those as a possible reason why an AM wouldn't launch from past experience.  The AM did properly
throw the error, but it just never made it back to the user because the stderr is redirected
to a file that is pushed to HDFS after it exits.

> When mapreduce.jobhistory.intermediate-done-dir isn't writable, application fails with
generic error
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5792
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5792
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 2.3.0
>            Reporter: Travis Thompson
>            Assignee: Mohammad Kamrul Islam
>
> When trying to run an application and the permissions are wrong on {{mapreduce.jobhistory.intermediate-done-dir}},
the MapReduce AM fails with a non-descriptive error message:
> {noformat}
> Application application_1394227890066_0004 failed 2 times due to AM Container for appattempt_1394227890066_0004_000002
exited with exitCode: 1 due to: Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
> at org.apache.hadoop.util.Shell.run(Shell.java:418)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
> at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> main : command provided 1
> main : user is tthompso
> main : requested yarn user is tthompso
> Container exited with a non-zero exit code 1
> .Failing this attempt.. Failing the application. 
> {noformat}
> When permissions are corrected on this dir, applications are able to run.  There should
probably be some sort of check on this dir before launching the AM so a more meaningful error
message can be thrown.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message