hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler
Date Thu, 11 Feb 2010 18:50:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832623#action_12832623
] 

Matei Zaharia commented on MAPREDUCE-1436:
------------------------------------------

Are you suggesting that I add a JobTracker lock in update() or in the JobListener methods?
I think it's best to add it in update() because it also gets called from a separate thread.
This actually happens quite rarely now (it used to be every few seconds, but it's every 15
seconds after MAPREDUCE-706, and can be set higher pretty safely).

BTW, I found another deadlock that seems to be much rarer (it happened when I was submitting
about 50 jobs simultaneously) but is not related to preemption:

<code>

Found one Java-level deadlock:
=============================
"IPC Server handler 24 on 9001":
  waiting to lock monitor 0x0000000040c91750 (object 0x00007fc0243e2c20, a org.apache.hadoop.mapred.JobTracker),
  which is held by "IPC Server handler 0 on 9001"
"IPC Server handler 0 on 9001":
  waiting to lock monitor 0x0000000040bc0770 (object 0x00007fc0243e3080, a org.apache.hadoop.mapred.FairScheduler),
  which is held by "FairScheduler update thread"
"FairScheduler update thread":
  waiting to lock monitor 0x000000004095dd98 (object 0x00007fc0258bc0d0, a org.apache.hadoop.mapred.JobInProgress),
  which is held by "IPC Server handler 0 on 9001"

Java stack information for the threads listed above:
===================================================
"IPC Server handler 24 on 9001":
	at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2487)
	- waiting to lock <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"IPC Server handler 0 on 9001":
	at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2115)
	- waiting to lock <0x00007fc0243e3080> (a org.apache.hadoop.mapred.FairScheduler)
	- locked <0x00007fc0243e3420> (a java.util.TreeMap)
	- locked <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
	at org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2510)
	- locked <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
	at org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2146)
	at org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2084)
	- locked <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
	at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:883)
	- locked <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
	at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3564)
	at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2758)
	- locked <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
	at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2553)
	- locked <0x00007fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"FairScheduler update thread":
	at org.apache.hadoop.mapred.JobInProgress.scheduleReduces(JobInProgress.java:1203)
	- waiting to lock <0x00007fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
	at org.apache.hadoop.mapred.JobSchedulable.updateDemand(JobSchedulable.java:53)
	at org.apache.hadoop.mapred.PoolSchedulable.updateDemand(PoolSchedulable.java:81)
	at org.apache.hadoop.mapred.FairScheduler.update(FairScheduler.java:577)
	- locked <0x00007fc0243e3080> (a org.apache.hadoop.mapred.FairScheduler)
	at org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.java:277)
</code>

The problem in this one is that updateDemand() has to lock the jobs (briefly). That could
be factored out above the other code in update(), but it seems safer to just lock the JT in
all of update().

> Deadlock in preemption code in fair scheduler
> ---------------------------------------------
>
>                 Key: MAPREDUCE-1436
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/fair-share
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>            Priority: Blocker
>         Attachments: deadlock.png, mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between updatePreemptionVariables
and some code in the JobTracker. This was found while testing a backport of the fair scheduler
to Hadoop 0.20, but it looks like it could also happen in trunk and 0.21. Details are in a
comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message