hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yesha Vora (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-7175) Log collection fails when a container is acquired but not launched on NM
Date Fri, 08 Sep 2017 01:15:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-7175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yesha Vora updated YARN-7175:
-----------------------------
    Description: 
Scenario:
* Run Spark App
* As soon as spark application finishes, Run "yarn application -status <appID>" cli
in a loop for 2-3 mins to check Log_aggreagtion status. 

I'm noticing that log_aggregation status remains in "RUNNING" and eventually ends up with
"TIMED_OUT" status.  

This situation happens when an application has acquired a container but it is not launched
on NM. 

This scenario should be better handled and should not cause this delay to get the application
log. 

Example: application_1502070770869_0012
application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs were not available
till 2017-08-07 03:08:36.
{code}
RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status application_1502070770869_0012
17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History server at host5/xxx.xx.xx.xx:10200
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active
RM in [rm1, rm2]...
17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1]
Application Report :
Application-Id : application_1502070770869_0012
Application-Name : ml.R
Application-Type : SPARK
User : hrt_qa
Queue : default
Application Priority : null
Start-Time : 1502075166506
Finish-Time : 1502075198997
Progress : 100%
State : FINISHED
2017-08-07 03:08:37,770|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Final-State
: SUCCEEDED
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Tracking-URL
: ctr-e134-1499953498516-83705-01-000005.hwx.site:18080/history/application_1502070770869_0012/1
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|RPC
Port : 0
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM
Host : 172.27.21.204
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Aggregate
Resource Allocation : 174680 MB-seconds, 84 vcore-seconds
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Log
Aggregation Status : RUNNING
2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Diagnostics
:
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Unmanaged
Application : false
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Application
Node Label Expression : <Not set>
2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM
container Node Label Expression : <DEFAULT_PARTITION>
2017-08-07 03:08:37,808|INFO|MainThread|machine.py:184 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Exit
Code: 0{code}

  was:
Scenario:
* Run Spark App
* As soon as spark application finishes, Run "yarn application -status <appID>" cli
in a loop for 2-3 mins to check Log_aggreagtion status. 

I'm noticing that log_aggregation status remains in "RUNNING" and eventually ends up with
"TIMED_OUT" status.  

This situation happens when an application has acquired a container but it is not launched
on NM. 

This scenario should be better handled and should not cause this delay to get the application
log. 


> Log collection fails when a container is acquired but not launched on NM
> ------------------------------------------------------------------------
>
>                 Key: YARN-7175
>                 URL: https://issues.apache.org/jira/browse/YARN-7175
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Yesha Vora
>         Attachments: SparkApp.log
>
>
> Scenario:
> * Run Spark App
> * As soon as spark application finishes, Run "yarn application -status <appID>"
cli in a loop for 2-3 mins to check Log_aggreagtion status. 
> I'm noticing that log_aggregation status remains in "RUNNING" and eventually ends up
with "TIMED_OUT" status.  
> This situation happens when an application has acquired a container but it is not launched
on NM. 
> This scenario should be better handled and should not cause this delay to get the application
log. 
> Example: application_1502070770869_0012
> application_1502070770869_0012 finished at 2017-08-07 03:06:39 . The logs were not available
till 2017-08-07 03:08:36.
> {code}
> RUNNING: /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status application_1502070770869_0012
> 17/08/07 03:08:37 INFO client.AHSProxy: Connecting to Application History server at host5/xxx.xx.xx.xx:10200
> 17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the
active RM in [rm1, rm2]...
> 17/08/07 03:08:37 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM
[rm1]
> Application Report :
> Application-Id : application_1502070770869_0012
> Application-Name : ml.R
> Application-Type : SPARK
> User : hrt_qa
> Queue : default
> Application Priority : null
> Start-Time : 1502075166506
> Finish-Time : 1502075198997
> Progress : 100%
> State : FINISHED
> 2017-08-07 03:08:37,770|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Final-State
: SUCCEEDED
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Tracking-URL
: ctr-e134-1499953498516-83705-01-000005.hwx.site:18080/history/application_1502070770869_0012/1
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|RPC
Port : 0
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM
Host : 172.27.21.204
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Aggregate
Resource Allocation : 174680 MB-seconds, 84 vcore-seconds
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Log
Aggregation Status : RUNNING
> 2017-08-07 03:08:37,771|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Diagnostics
:
> 2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Unmanaged
Application : false
> 2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Application
Node Label Expression : <Not set>
> 2017-08-07 03:08:37,772|INFO|MainThread|machine.py:159 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|AM
container Node Label Expression : <DEFAULT_PARTITION>
> 2017-08-07 03:08:37,808|INFO|MainThread|machine.py:184 - run()||GUID=94721f7c-d414-4936-b8a1-a387eae8d6c6|Exit
Code: 0{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message