hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wuchang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7474) Yarn resourcemanager stop allocating container when cluster resource is sufficient
Date Tue, 14 Nov 2017 08:28:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251082#comment-16251082
] 

wuchang commented on YARN-7474:
-------------------------------

[~yufeigu] [~templedf] Big thanks for you reply.
I noticed that the  the bug mentioned at [YARN-4477|https://issues.apache.org/jira/browse/YARN-4477]
 is just for hadoop 2.8.0 or higher, but my hadoop version is 2.7.2. I have already checked
my 2.7.2 source code , there didn't exists the method *reservationExceedsThreshold()* metioned
there.
Would you please give me more suggestions?

> Yarn resourcemanager stop allocating container when cluster resource is sufficient 
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-7474
>                 URL: https://issues.apache.org/jira/browse/YARN-7474
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>            Reporter: wuchang
>            Priority: Critical
>
> Hadoop Version: *2.7.2*
> My Yarn cluster have *(1100TB,368vCores)*  totallly with 15 nodemangers . 
> My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
>  
> {quote}
> <allocations>
>     <queue name="queue1">
>        <minResources>100000 mb, 30 vcores</minResources>
>        <maxResources>422280 mb, 132 vcores</maxResources>
>        <maxAMShare>0.5f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>     <queue name="queue2">
>        <minResources>25000 mb, 20 vcores</minResources>
>        <maxResources>600280 mb, 150 vcores</maxResources>
>        <maxAMShare>0.6f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>     <queue name="queue3">
>        <minResources>100000 mb, 30 vcores</minResources>
>        <maxResources>647280 mb, 132 vcores</maxResources>
>        <maxAMShare>0.8f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>   
>     <queue name="queue4">
>        <minResources>80000 mb, 20 vcores</minResources>
>        <maxResources>120000 mb, 30 vcores</maxResources>
>        <maxAMShare>0.5f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>      </queue>
> </allocations>
>  {quote}
> from about 9:00 am, all new-coming applications get stuck for nearly 5 hours, but the
cluster resource usage is about *(600GB,120vCores)*,it means,the cluster resource is still
*sufficient*.
> *The resource usage of the whole yarn cluster AND of each single queue stay unchanged
for 5 hours*, really strange. Obviously , if it a resource insufficiency problem , it's impossible
that used resource of all queues didn't have any change for 5 hours. So , I is the problem
of ResourceManager.
> Since my cluster scale is not large, only 15 nodes with 1100G memory ,I exclude the possibility
showed in [YARN-4618].
>  
> besides that , all the running applications seems never finished, the Yarn RM seems static
,the RM log  have no more state change logs about running applications,except for the log
about more and more application is submitted and become ACCEPTED, but never from ACCEPTED
to RUNNING.
> *The resource usage of the whole yarn cluster AND of each single queue stay unchanged
for 5 hours*, really strange.
> The cluster seems like a zombie.
>  
> I haved checked the ApplicationMaster log of some running but stucked application , 
>  
>  {quote}
> 2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService:
Getting task report for MAP job_1507795051888_183385. Report-size will be 4
> 2017-11-11 09:04:55,957 INFO [IPC Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService:
Getting task report for REDUCE job_1507795051888_183385. Report-size will be 0
> 2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
> 2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
getResources() for application_1507795051888_183385: ask=6 release= 0 newContainers=0 finishedContainers=0
resourcelimit=<memory:109760, vCores:25> knownNMs=15
> 2017-11-11 13:58:56,736 INFO [IPC Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService:
Kill job job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
>  {quote}
>  
> You can see that at  *2017-11-11 09:04:56,061* It send resource request to ResourceManager
but RM allocate zero containers. Then ,no more logs  for 5 hours. At  13:58, I have to kill
it manually.
>  
> After 5 hours , I kill some pending applications and then everything recovered,remaining
cluster resources can be allocated again, ResourceManager seems  to be alive again.
>  
> I have exclude the possibility of  the restriction of maxRunningApps and maxAMShare config
because they will just affect a single queue, but my problem is that whole yarn cluster application
get stuck.
>  
>  
>  
> Also , I exclude the possibility of a  resourcemanger  full gc problem because I check
that with gcutil,no full gc happened , resource manager memory is OK.
>  
> So , anyone could give me some suggestions?
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message