hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandy Ryza (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (YARN-484) Fair Scheduler preemption fails if the other queue has a mapreduce job with some tasks in excess of cluster capacity
Date Fri, 29 Mar 2013 21:33:15 GMT

     [ https://issues.apache.org/jira/browse/YARN-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sandy Ryza resolved YARN-484.
-----------------------------

    Resolution: Cannot Reproduce
    
> Fair Scheduler preemption fails if the other queue has a mapreduce job with some tasks
in excess of cluster capacity
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-484
>                 URL: https://issues.apache.org/jira/browse/YARN-484
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>         Environment: Mac OS X; CDH4.1.2; CDH4.2.0
>            Reporter: Vitaly Kruglikov
>            Assignee: Sandy Ryza
>              Labels: hadoop
>
> This is reliably reproduced while running CDH4.1.2 or CDH4.2.0 on a single Mac OS X machine.
> # Two queues are being configured: cjmQ and slotsQ. Both queues are configured with tiny
minResources. The intention is for the task(s) of the job in cjmQ to be able to preempt tasks
of the job in slotsQ.
> # yarn.nodemanager.resource.memory-mb = 24576
> # First, a long-running 6-map-task (0 reducers) mapreduce job is started in slotsQ with
mapreduce.map.memory.mb=4096. Because MRAppMaster's container consumes some memory, only 5
of its 6 map tasks are able to start, and the 6th is pending, but will never run.
> # Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted via cjmQ with
mapreduce.map.memory.mb=2048.
> Expected behavior:
> At this point, because the minimum share of cjmQ has not been met, I expected Fair Scheduler
to preempt one of the executing map tasks from the single slotsQ mapreduce job to make room
for the single map tasks of the cjmQ mapreduce job. However, Fair Scheduler didn't preempt
any of the running map tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually.
Since slotsQ had far more than its minimum share allocated to it and already running, while
cjmQ was far below its minimum share (0 actually), Fair Scheduler should have started preempting,
regardless of there being one task container from the slotsQ job (the 6th map container) that
was not being allocated.
> Additional useful info:
> # If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ mapreduce job
in that Q gets scheduled and its state changes to RUNNING; once that that first job completes,
then the second job submitted via cjmQ gets starved until a third job is submitted into cjmQ,
and so on. This happens regardless of the values of maxRunningApps in the queue configurations.
> # If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 so that
everything fits nicely into yarn.nodemanager.resource.memory-mb - without that 6th pending,
but not running task - then preemption works as I would have expected. However, I cannot rely
on this arrangement because in a production cluster that is running at full capacity, if a
machine dies, the mapreduce job from slotsQ will request new containers for the failed tasks
and because the cluster was already at capacity, those containers will end up as pending and
will never run, recreating my original scenario of the starving cjmQ job.
> # I initially wrote this up on https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
so it would be good to update that group with the resolution.
> Configuration:
> In yarn-site.xml:
> {code}
>   <property>
>     <description>Scheduler plug-in class to use instead of the default scheduler.</description>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
> {code}
> fair-scheduler.xml:
> {code}
> <configuration>
> <!-- Site specific FairScheduler configuration properties -->
>   <property>
>     <description>Absolute path to allocation file. An allocation file is an XML
>     manifest describing queues and their properties, in addition to certain
>     policy defaults. This file must be in XML format as described in
>     http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
>     </description>
>     <name>yarn.scheduler.fair.allocation.file</name>
>     <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
>   </property>
>   <property>
>     <description>Whether to use preemption. Note that preemption is experimental
>     in the current version. Defaults to false.</description>
>     <name>yarn.scheduler.fair.preemption</name>
>     <value>true</value>
>   </property>
>   <property>
>     <description>Whether to allow multiple container assignments in one
>     heartbeat. Defaults to false.</description>
>     <name>yarn.scheduler.fair.assignmultiple</name>
>     <value>true</value>
>   </property>
>   
> </configuration>
> {code}
> My fair-scheduler-allocations.xml:
> {code}
> <allocations>
>   <queue name="cjmQ">
>     <!-- minimum amount of aggregate memory; TODO which units??? -->
>     <minResources>2048</minResources>
>     <!-- limit the number of apps from the queue to run at once -->
>     <maxRunningApps>1</maxRunningApps>
>     
>     <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
>     desired -->
>     <schedulingMode>fifo</schedulingMode>
>     <!-- Number of seconds after which the pool can preempt other pools'
>       tasks to achieve its min share. Requires preemption to be enabled in
>       mapred-site.xml by setting mapred.fairscheduler.preemption to true.
>       Defaults to infinity (no preemption). -->
>     <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
>     <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
>     <weight>1.0</weight>
>   </queue>
>   <queue name="slotsQ">
>     <!-- minimum amount of aggregate memory; TODO which units??? -->
>     <minResources>1</minResources>
>     <!-- limit the number of apps from the queue to run at once -->
>     <maxRunningApps>1</maxRunningApps>
>     <!-- Number of seconds after which the pool can preempt other pools'
>       tasks to achieve its min share. Requires preemption to be enabled in
>       mapred-site.xml by setting mapred.fairscheduler.preemption to true.
>       Defaults to infinity (no preemption). -->
>     <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
>     <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
>     <weight>1.0</weight>
>   </queue>
>   
>   <!-- number of seconds a queue is under its fair share before it will try to
>   preempt containers to take resources from other queues. -->
>   <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>
> </allocations>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message