hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JackZhou (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool
Date Sat, 27 May 2017 14:43:04 GMT

     [ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

JackZhou updated YARN-6661:
---------------------------
    Issue Type: Bug  (was: Improvement)

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -----------------------------------------------------------------
>
>                 Key: YARN-6661
>                 URL: https://issues.apache.org/jira/browse/YARN-6661
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>         Environment: hadoop 2.7.2 
>            Reporter: JackZhou
>             Fix For: 2.9.0
>
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for detail.
> But I think the fix have not solve the problem completely, blow was the problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send a CLEANUP
event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it will hang
the ApplicationMasterLauncher
> thread pool for a large time. I have already use the patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for each
of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins, so it will retry and create a new
application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to ApplicationMasterLauncher
thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the application
attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher thread pools
too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep retrying.
> appattempt_1495786030132_4000_000005
> appattempt_1495786030132_4000_000004
> appattempt_1495786030132_4000_000003
> appattempt_1495786030132_4000_000002	
> appattempt_1495786030132_4000_000001
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use other way.
> Sorry, I english is so poor. I don't know have I describe it clearly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message