hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashwin Shankar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-2959) Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock among jobs
Date Sat, 13 Dec 2014 00:19:14 GMT
Ashwin Shankar created YARN-2959:
------------------------------------

             Summary: Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock
among jobs
                 Key: YARN-2959
                 URL: https://issues.apache.org/jira/browse/YARN-2959
             Project: Hadoop YARN
          Issue Type: Bug
          Components: fairscheduler
            Reporter: Ashwin Shankar


We have a cluster which run jobs in fifo order(due to the nature of those jobs) using Fair
scheduler's "fifo" option.
Recently we found jobs deadlocked in the cluster, here is what happened :
There were two jobs,say A and B. A was submitted before B.
Both were in PENDING state since the cluster was busy.
When containers freed up, the two pending jobs got their AM containers at about the same time.

However Job B's AM or appattempt1 registered with RM a little earlier than Job A and grabbed
available containers at that time, and satisfied a fraction of its requirement. Note, JobB
can't make progress until it gets all its requirement satisfied.
Next, JobA's appattempt1 registered with RM and since JobA was submitted earlier, RM stops
allocating containers to JobB and starts allocating to JobA, satisfying a fraction of its
requirement as well.
Now together jobA,jobB hold the entire cluster, but neither can progress and are deadlocked
since their resource requests are partially satisfied.

Note:Above is an example with 2 jobs, however the deadlock can happen with n jobs : J1..Jn
if the sequence of AM registration is Jn, J(n-1),..J1.
 
Solution : one proposed solution is to order the fifo queue by appattempt start/register time
instead of app submit time.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message