Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 47831 invoked from network); 11 Feb 2010 18:50:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Feb 2010 18:50:53 -0000 Received: (qmail 6501 invoked by uid 500); 11 Feb 2010 18:50:53 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 6466 invoked by uid 500); 11 Feb 2010 18:50:53 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 6456 invoked by uid 99); 11 Feb 2010 18:50:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Feb 2010 18:50:53 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Feb 2010 18:50:49 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id F294C29A0017 for ; Thu, 11 Feb 2010 10:50:27 -0800 (PST) Message-ID: <107424314.212791265914227992.JavaMail.jira@brutus.apache.org> Date: Thu, 11 Feb 2010 18:50:27 +0000 (UTC) From: "Matei Zaharia (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler In-Reply-To: <356063493.4261265054298884.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832623#action_12832623 ] Matei Zaharia commented on MAPREDUCE-1436: ------------------------------------------ Are you suggesting that I add a JobTracker lock in update() or in the JobListener methods? I think it's best to add it in update() because it also gets called from a separate thread. This actually happens quite rarely now (it used to be every few seconds, but it's every 15 seconds after MAPREDUCE-706, and can be set higher pretty safely). BTW, I found another deadlock that seems to be much rarer (it happened when I was submitting about 50 jobs simultaneously) but is not related to preemption: Found one Java-level deadlock: ============================= "IPC Server handler 24 on 9001": waiting to lock monitor 0x0000000040c91750 (object 0x00007fc0243e2c20, a org.apache.hadoop.mapred.JobTracker), which is held by "IPC Server handler 0 on 9001" "IPC Server handler 0 on 9001": waiting to lock monitor 0x0000000040bc0770 (object 0x00007fc0243e3080, a org.apache.hadoop.mapred.FairScheduler), which is held by "FairScheduler update thread" "FairScheduler update thread": waiting to lock monitor 0x000000004095dd98 (object 0x00007fc0258bc0d0, a org.apache.hadoop.mapred.JobInProgress), which is held by "IPC Server handler 0 on 9001" Java stack information for the threads listed above: =================================================== "IPC Server handler 24 on 9001": at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2487) - waiting to lock <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) "IPC Server handler 0 on 9001": at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2115) - waiting to lock <0x00007fc0243e3080> (a org.apache.hadoop.mapred.FairScheduler) - locked <0x00007fc0243e3420> (a java.util.TreeMap) - locked <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker) at org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2510) - locked <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2146) at org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2084) - locked <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:883) - locked <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3564) at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2758) - locked <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2553) - locked <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) "FairScheduler update thread": at org.apache.hadoop.mapred.JobInProgress.scheduleReduces(JobInProgress.java:1203) - waiting to lock <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress) at org.apache.hadoop.mapred.JobSchedulable.updateDemand(JobSchedulable.java:53) at org.apache.hadoop.mapred.PoolSchedulable.updateDemand(PoolSchedulable.java:81) at org.apache.hadoop.mapred.FairScheduler.update(FairScheduler.java:577) - locked <0x00007fc0243e3080> (a org.apache.hadoop.mapred.FairScheduler) at org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.java:277) The problem in this one is that updateDemand() has to lock the jobs (briefly). That could be factored out above the other code in update(), but it seems safer to just lock the JT in all of update(). > Deadlock in preemption code in fair scheduler > --------------------------------------------- > > Key: MAPREDUCE-1436 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/fair-share > Affects Versions: 0.21.0, 0.22.0 > Reporter: Matei Zaharia > Assignee: Matei Zaharia > Priority: Blocker > Attachments: deadlock.png, mapreduce-1436.patch > > > In testing the fair scheduler with preemption, I found a deadlock between updatePreemptionVariables and some code in the JobTracker. This was found while testing a backport of the fair scheduler to Hadoop 0.20, but it looks like it could also happen in trunk and 0.21. Details are in a comment below. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.