hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsuyoshi OZAWA (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (YARN-2476) Apps are scheduled in random order after RM failover
Date Fri, 03 Oct 2014 04:33:33 GMT

     [ https://issues.apache.org/jira/browse/YARN-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tsuyoshi OZAWA resolved YARN-2476.
----------------------------------
    Resolution: Duplicate

> Apps are scheduled in random order after RM failover
> ----------------------------------------------------
>
>                 Key: YARN-2476
>                 URL: https://issues.apache.org/jira/browse/YARN-2476
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.1
>         Environment: Linux
>            Reporter: Santosh Marella
>              Labels: ha, high-availability, resourcemanager
>
> RM HA is configured with 2 RMs. Used FileSystemRMStateStore.
> Fairscheduler allocation file is configured in yarn-site.xml:
> <property>
>   <name>yarn.scheduler.fair.allocation.file</name>
>   <value>/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop/allocation-pools.xml</value>
> </property>
> FS allocation-pools.xml:
> <?xml version="1.0"?>
> <allocations>
>    <queue name="dev">
>       <minResources>10000 mb,10vcores</minResources>
>           <maxResources>19000 mb,100vcores</maxResources>
>           <maxRunningApps>5525</maxRunningApps>
>           <weight>4.5</weight>
>           <schedulingPolicy>fair</schedulingPolicy>
>           <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
>    </queue>
>    <queue name="default">
>       <minResources>10000 mb,10vcores</minResources>
>           <maxResources>19000 mb,100vcores</maxResources>
>           <maxRunningApps>5525</maxRunningApps>
>           <weight>1.5</weight>
>           <schedulingPolicy>fair</schedulingPolicy>
>           <fairSharePreemptionTimeout>3600</fairSharePreemptionTimeout>
>    </queue>
>     <defaultMinSharePreemptionTimeout>600</defaultMinSharePreemptionTimeout>
>     <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
> </allocations>
>     Submitted 10 sleep jobs to a FS queue using the command:
>     hadoop jar hadoop-mapreduce-examples-2.4.1-mapr-4.0.1-SNAPSHOT.jar sleep
>     -Dmapreduce.job.queuename=root.dev  -m 10 -r 10 -mt 10000 -rt 10000
>     All the jobs were submitted by the same user, with the same priority and to the
>     same queue. No other jobs were running in the cluster. Jobs started executing
>     in the order in which they were submitted (jobs 6 to 10 were active, while 11
>     to 15 were waiting):
>     root@perfnode131:/opt/mapr/hadoop/hadoop-2.4.1/logs# yarn application -list
>     Total number of applications (application-types: [] and states: [SUBMITTED,ACCEPTED,
RUNNING]):10
>     Application-Id      Application-Name        Application-Type User           Queue
                  State             Final-State Progress                        Tracking-URL
>     application_1408572781346_0012             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0% N/A
>     application_1408572781346_0014             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0% N/A
>     application_1408572781346_0011             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0% N/A
>     application_1408572781346_0010             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5% http://perfnode132:52799
>     application_1408572781346_0008             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5% http://perfnode131:33766
>     application_1408572781346_0009             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5% http://perfnode132:50964
>     application_1408572781346_0007             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5% http://perfnode134:52966
>     application_1408572781346_0015             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0% N/A
>     application_1408572781346_0006             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 9.5% http://perfnode134:34094
>     application_1408572781346_0013             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0%  N/A
>     Stopped RM1. There was a failover and RM2 became active. But the jobs seem to
>     have started in a different order:
>     root@perfnode131:~/scratch/raw_rm_logs_fs_hang# yarn application -list
>     14/08/21 07:26:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to
rm2
>     Total number of applications (application-types: [] and states: [SUBMITTED,ACCEPTED,
RUNNING]):10
>     Application-Id      Application-Name        Application-Type User           Queue
                  State             Final-State Progress                        Tracking-URL
>     application_1408572781346_0012             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5%http://perfnode134:59351
>     application_1408572781346_0014             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5%http://perfnode132:37866
>     application_1408572781346_0011             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5%http://perfnode131:59744
>     application_1408572781346_0010             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0%N/A
>     application_1408572781346_0008             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0%N/A
>     application_1408572781346_0009             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0%N/A
>     application_1408572781346_0007             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0%N/A
>     application_1408572781346_0015             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5%http://perfnode134:39754
>     application_1408572781346_0006             Sleep job               MAPREDUCE userA
       root.dev                ACCEPTED               UNDEFINED 0%N/A
>     application_1408572781346_0013             Sleep job               MAPREDUCE userA
       root.dev                 RUNNING               UNDEFINED 5%http://perfnode132:34714
> The problem is this:
> - The jobs that were previously in RUNNING state moved to ACCEPTED after failover.
> - The jobs that were previously in ACCEPTED state moved to RUNNING after failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message