hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)
Date Thu, 15 Dec 2011 19:03:27 GMT
Hi John,

 It's hard for folks on this list to diagnose CDH (you might have to ask their lists). However,
I haven't seen similar issues with hadoop-0.20.2xx in a while.

 One thing to check would be to grab a stack trace (jstack) on the tasks to see what they
are upto. Next, try get a tcpdump to see if the tasks are indeed sending heartbeats to the
TT, which might be the reason the TTs aren't timing them out.


On Dec 15, 2011, at 7:58 AM, John Miller wrote:

> I’ve recently come across some interesting things happening within a 50-node cluster
regarding the tasktrackers and task attempts.  Essentially tasks are being created but they
are sticking at 0.0% and it seems the ‘mapreduce.task.timeout’ isn’t taking effect and
they just sit there (for days if we let them) and the jobs have to get killed.  Its interesting
to note that the HDFS datanode service and HBASE regionserver running on these nodes work
fine and we’ve been simply shutting down the tasktracker service on them in order to get
around jobs stalling forever.
> Some historical information… We’re running Cloudera’s cdh3u0 release, and this
has so far only happened on a handful of random tasktracker nodes and it seems to only effected
those that have been taken down for maintenance and then brought back into the cluster, or
alternatively one node was brought into the cluster after it had been running for a while
and we ran into the same issue.  After re-adding the nodes back into the cluster the tasktracker
service starts getting these stalls.  Also know that this has not happened to every node that
has been taken out of service for a time and then re-added… I would say about 1/3’rd of
them or so has ran into this issue after maintenance.  The particular maintenance issues on
the effected nodes were NOT the same, i.e. one was bad ram another was a bad sector on a disk
etc… never the same initial problem only the same outcome after rejoining the cluster.
> It’s also never the same mapred job that sticks, nor is there any time related evidence
relating the stalls to a specific time of day.  Rather the node will run fine for many jobs
and then just all of a sudden some tasks will stall and stick at 0.0%.  There are no visible
errors in the log outputs, although nothing will move forward nor will it release the mappers
for any other jobs to use until the stalled job is killed.  It seems that the default ‘mapreduce.task.timeout’
just isn’t working for some reason.
> Has anyone come across anything similar to this?  I can provide more details/data as
> John Miller  |  Sr. Linux Systems Administrator
> <image001.png>
> 530 E. Liberty St.
> Ann Arbor, MI 48104
> Direct: 734.922.7007
> http://mybuys.com/

View raw message