hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj K (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-592) Container logs lost for the application when NM gets restarted
Date Tue, 09 Jul 2013 11:21:49 GMT

    [ https://issues.apache.org/jira/browse/YARN-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703172#comment-13703172
] 

Devaraj K commented on YARN-592:
--------------------------------

Thanks Omkar for looking into the patch and trying to understanding.

This JIRA is trying to address these two problems while running containers for an application
NM goes down and comes up and then launch containers for the same application. 

1. Graceful shutdown of NM and start again 
2. NM Crash(or abrupt kill) and start again 


bq.•are you assuming that after nm restarts application for which containers were running
on that node manager will again get new container on the same node manager? at present NM
doesn't remember the applications which were running on it across restart. Also RM doesn't
inform NM about all the running applications in the cluster.
Yes, This Jira is mainly to address the case where containers running for the same application
before and after NM restart. It is the important case because NM gets the application completed
event and deletes the all container logs(including the container logs which ran before crash)
for that application, and those logs(not aggregated) will not be available in the HDFS as
explained in the previous comment. If NM doesn't get application completed event from RM then
the logs atleast will be availble in the local logs dir.
 
bq.•Now across NM restart applications might be still running or it might have just finished
before restart. Do you want to upload the logs for both scenarios? at present we upload logs
only when application finishes...
This patch is trying to upload logs for the applications which run before and after NM restart.
If the application gets completed after NM crash and before starting NM, atleast logs for
the containers ran on that node can get from NM local logs dirs. 

If the NM gets stopped properly, presently NM uploads logs for all the running containers
before going down. This case we may not need to handle anything.

                
> Container logs lost for the application when NM gets restarted
> --------------------------------------------------------------
>
>                 Key: YARN-592
>                 URL: https://issues.apache.org/jira/browse/YARN-592
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.1-alpha, 2.0.3-alpha
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>            Priority: Critical
>         Attachments: YARN-592.patch
>
>
> While running a big job if the NM goes down due to some reason and comes back, it will
do the log aggregation for the newly launched containers and deletes all the containers for
the application. This case we don't get the container logs from HDFS or local for the containers
which are launched before restart and completed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message