hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmed Radwan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
Date Fri, 25 May 2012 23:42:23 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283821#comment-13283821
] 

Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------

Thanks Arun, 

Let me add more details. I think it's not just the tasklogs and this is why this property
exists. We have seen cases where inspecting the contents of the containers' localized file
directories and log directories were extremely useful in troubleshooting problems (e.g. AM
failure to start issues).

I think easily controlling this property is equally important in production clusters. Consider
the following scenario:

* A job failing on a production cluster.
* Tasklogs are not showing much, and it is required to inspect the containers' files for any
clues.
* It is now required to change this configuration property (e.g. set it to 1 day) and restart
every NM in the cluster (see how expensive this is).
* The problem for this job is solved, but now these directories are kept for every submitted
job, which is an unneeded and expensive storage problem. To solve that, we need to change
back the property and restart NMs on all nodes again.

Also thinking about this issue more: YARN is a general framework, and applications other than
MapReduce need to considered, and their ability to hint to yarn to keep these files. So we
can't generalize assumptions about information available through specific application services
(e.g. MapReduce JobHistoryServer). I think the new proposed property above can be generalized
across applications (or the Application interface could be extended).

bq. Your proposal doesn't work because the NodeManager doesn't load jobConf of the container...
this would require changes to ContainerManager protocol.

Yes, I only wrote how the new delay will be calculated, but how this new jobConf property
is communicated to the DeletionService will require more changes as you highlighted. The question
here is whether the added benefit outweighs the effort of these extra changes. Thoughts?
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting
container logs/local dirs after the job finishes). Currently it is a nodemanager property
and changing it requires restarting the nodemanager. In a production cluster this can be a
real problem. It is better to have this property set on a per-job basis and not requiring
the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message