Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-dev@hadoop.apache.org
Date: Sat, 13 Dec 2014 00:19:14 +0000 (UTC)
From: "Ashwin Shankar (JIRA)" <jira@apache.org>
To: yarn-dev@hadoop.apache.org
Message-ID: <JIRA.12761504.1418429951000.14450.1418429954664@Atlassian.JIRA>
In-Reply-To: <JIRA.12761504.1418429951000@Atlassian.JIRA>
References: <JIRA.12761504.1418429951000@Atlassian.JIRA>
 <JIRA.12761504.1418429951753@arcas>
Subject: [jira] [Created] (YARN-2959) Fair Scheduler "fifo" option can
 violate FIFO behavior and cause deadlock among jobs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Ashwin Shankar created YARN-2959:
------------------------------------

             Summary: Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock among jobs
                 Key: YARN-2959
                 URL: https://issues.apache.org/jira/browse/YARN-2959
             Project: Hadoop YARN
          Issue Type: Bug
          Components: fairscheduler
            Reporter: Ashwin Shankar


We have a cluster which run jobs in fifo order(due to the nature of those jobs) using Fair scheduler's "fifo" option.
Recently we found jobs deadlocked in the cluster, here is what happened :
There were two jobs,say A and B. A was submitted before B.
Both were in PENDING state since the cluster was busy.
When containers freed up, the two pending jobs got their AM containers at about the same time. 
However Job B's AM or appattempt1 registered with RM a little earlier than Job A and grabbed available containers at that time, and satisfied a fraction of its requirement. Note, JobB can't make progress until it gets all its requirement satisfied.
Next, JobA's appattempt1 registered with RM and since JobA was submitted earlier, RM stops allocating containers to JobB and starts allocating to JobA, satisfying a fraction of its requirement as well.
Now together jobA,jobB hold the entire cluster, but neither can progress and are deadlocked since their resource requests are partially satisfied.

Note:Above is an example with 2 jobs, however the deadlock can happen with n jobs : J1..Jn if the sequence of AM registration is Jn, J(n-1),..J1.
 
Solution : one proposed solution is to order the fifo queue by appattempt start/register time instead of app submit time.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)