Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9B73F9014 for ; Sat, 13 Dec 2014 00:19:16 +0000 (UTC) Received: (qmail 89598 invoked by uid 500); 13 Dec 2014 00:19:14 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 89487 invoked by uid 500); 13 Dec 2014 00:19:14 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 89300 invoked by uid 99); 13 Dec 2014 00:19:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Dec 2014 00:19:14 +0000 Date: Sat, 13 Dec 2014 00:19:14 +0000 (UTC) From: "Ashwin Shankar (JIRA)" To: yarn-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (YARN-2959) Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock among jobs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ashwin Shankar created YARN-2959: ------------------------------------ Summary: Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock among jobs Key: YARN-2959 URL: https://issues.apache.org/jira/browse/YARN-2959 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Ashwin Shankar We have a cluster which run jobs in fifo order(due to the nature of those jobs) using Fair scheduler's "fifo" option. Recently we found jobs deadlocked in the cluster, here is what happened : There were two jobs,say A and B. A was submitted before B. Both were in PENDING state since the cluster was busy. When containers freed up, the two pending jobs got their AM containers at about the same time. However Job B's AM or appattempt1 registered with RM a little earlier than Job A and grabbed available containers at that time, and satisfied a fraction of its requirement. Note, JobB can't make progress until it gets all its requirement satisfied. Next, JobA's appattempt1 registered with RM and since JobA was submitted earlier, RM stops allocating containers to JobB and starts allocating to JobA, satisfying a fraction of its requirement as well. Now together jobA,jobB hold the entire cluster, but neither can progress and are deadlocked since their resource requests are partially satisfied. Note:Above is an example with 2 jobs, however the deadlock can happen with n jobs : J1..Jn if the sequence of AM registration is Jn, J(n-1),..J1. Solution : one proposed solution is to order the fifo queue by appattempt start/register time instead of app submit time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)