hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1970) tasktracker hang in reduce. Deadlock between main and comm thread
Date Mon, 01 Oct 2007 08:31:51 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Vivek Ratan updated HADOOP-1970:

    Attachment: 1970_patch01

Well, it's a little bigger than that. What's happening is that there are 2-3 methods that
traverse the tree-like structure of Progress methods, and at least two of them traverse (and
obtain locks) in different directions, hence the deadlock. (One of) the right solution is
to obtain locks in one direction only - so we lock when going downwards from the root node.
This happens in get(), getInternal(), and toString(). If you need to traverse upwards towards
the root (in complete()), you either release your lock before getting your parent's (which
is what I've chosen to do, since we don't need transactional semantics), or you get locks
in the same direction as other traversal methods. 

Another somewhat related issue is that Progress::get(), which is called quite often, always
traverses upwards to find the root of a structure. Since a node's root never changes, it should
be cached at each node. This certainly improves performance for get(), but it also offers
a synch mechanism should we ever need to write code that needs to lock multiple nodes and
traverse upwards towards the root. In such a case, the methods can lock the root object to
get sole access control to the entire structure. We don't need this for now, but it's a good
mechanism to have for the future. 

I have attached a patch (1970_patch01) with these changes, along with lots of comments for

> tasktracker hang in reduce. Deadlock between main and comm thread
> -----------------------------------------------------------------
>                 Key: HADOOP-1970
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1970
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.1
>            Reporter: Koji Noguchi
>            Assignee: Vivek Ratan
>            Priority: Blocker
>             Fix For: 0.14.2
>         Attachments: 1970_patch01
> Saw one reduce task stuck on copy.
> jstack on the reduce task(task_200709272248_0001_r_000150_0)  process showed 
> {noformat} 
> Found one Java-level deadlock:
> =============================
> "Comm thread for task_200709272248_0001_r_000150_0":
>   waiting to lock monitor 0x08144020 (object 0xd4e30aa8, a org.apache.hadoop.util.Progress),
>   which is held by "main"
> "main":
>   waiting to lock monitor 0x08144084 (object 0xd4e30958, a org.apache.hadoop.util.Progress),
>   which is held by "Comm thread for task_200709272248_0001_r_000150_0"
> Java stack information for the threads listed above:
> ===================================================
> "Comm thread for task_200709272248_0001_r_000150_0":
>         at org.apache.hadoop.util.Progress.toString(Progress.java:113)
>         - waiting to lock <0xd4e30aa8> (a org.apache.hadoop.util.Progress)
>         at org.apache.hadoop.util.Progress.toString(Progress.java:116)
>         - locked <0xd4e30958> (a org.apache.hadoop.util.Progress)
>         at org.apache.hadoop.util.Progress.toString(Progress.java:108)
>         at org.apache.hadoop.mapred.Task$1.run(Task.java:268)
>         at java.lang.Thread.run(Thread.java:619)
> "main":
>         at org.apache.hadoop.util.Progress.startNextPhase(Progress.java:58)
>         - waiting to lock <0xd4e30958> (a org.apache.hadoop.util.Progress)
>         at org.apache.hadoop.util.Progress.complete(Progress.java:70)
>         - locked <0xd4e30aa8> (a org.apache.hadoop.util.Progress)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:253)
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1777)
> {noformat} 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message