hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5792) When mapreduce.jobhistory.intermediate-done-dir isn't writable, application fails with generic error
Date Wed, 12 Mar 2014 14:25:45 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931805#comment-13931805
] 

Jason Lowe commented on MAPREDUCE-5792:
---------------------------------------

The main issue is it adds yet more stuff for the job client to do before submitting the job
when the AM is already doing the work in this area (i.e.: trying to create the directory in
question).  This should be a relatively rare occurrence as the intermediate base directory
not being writable indicates the cluster wasn't setup properly.  Actually I'm a bit curious
as to how this even occurred in the first place.  The history server should have setup the
proper permissions when it started.  [~tthompso] can you elaborate more on how the intermediate
directory happened to have the wrong permissions?  I'm wondering if there's a related bug
in the history server that needs to be fixed.

Arguably this wouldn't be a big deal if we solved the larger issue of diagnostics from the
AM crash not making it back to the job client.  The AM logs should already be stating what's
going wrong (e.g.: "Error creating user intermediate history done directory" exception and
cause).  If the user saw that error message from the client when the job crashed then it would
be clearer to the user why the job failed.  YARN-675 was supposed to help this at least somewhat,
and providing proper diagnostics would also help similar issues like the AM crashing when
metainfo split size is exceeded, see MAPREDUCE-4937.

> When mapreduce.jobhistory.intermediate-done-dir isn't writable, application fails with
generic error
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5792
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5792
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 2.3.0
>            Reporter: Travis Thompson
>            Assignee: Mohammad Kamrul Islam
>
> When trying to run an application and the permissions are wrong on {{mapreduce.jobhistory.intermediate-done-dir}},
the MapReduce AM fails with a non-descriptive error message:
> {noformat}
> Application application_1394227890066_0004 failed 2 times due to AM Container for appattempt_1394227890066_0004_000002
exited with exitCode: 1 due to: Exception from container-launch:
> org.apache.hadoop.util.Shell$ExitCodeException:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
> at org.apache.hadoop.util.Shell.run(Shell.java:418)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
> at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> main : command provided 1
> main : user is tthompso
> main : requested yarn user is tthompso
> Container exited with a non-zero exit code 1
> .Failing this attempt.. Failing the application. 
> {noformat}
> When permissions are corrected on this dir, applications are able to run.  There should
probably be some sort of check on this dir before launching the AM so a more meaningful error
message can be thrown.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message