hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rajesh balamohan <rbalamoha...@gmail.com>
Subject Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not working)
Date Tue, 20 Dec 2011 03:29:08 GMT
Hi John,

Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM
arguments you use for the spawning the map/reduce slots?

Check if the JVM is stuck in the machine. Sometimes I have seen task JVM
just launching, gets into spinning mode and occupies 100% CPU.

Can you check if this is the case?

~Rajesh Balamohan



On Fri, Dec 16, 2011 at 2:26 AM, John Miller <jmiller@mybuys.com> wrote:

> Hello Arun,****
>
> ** **
>
> Thanks for the quick reply.  I totally understand the CDH issue but
> figured I’d ask the broader community as well in case there was any
> upstream known issue as I’ve noticed some patches relating to “somewhat
> similar” issues.****
>
> ** **
>
> The jstack was currently on my radar but I hadn’t even thought about
> tcpdump to catch weather the tasks were heartbeating or not so thanks for
> the tip, will make sure to check that out! We are also planning our release
> update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
> 0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
> the issue as well, in which case I’ll at least let everyone here know if it
> does.****
>
> ** **
>
> Any further ideas or if anyone else has experienced this similar issue my
> ears are open.  Thanks again Arun! J****
>
> ** **
>
> *John Miller  **|*  Sr. Linux Systems Administrator**
>
> [image: mybuys-ops-small] <http://mybuys.com/>**
>
> 530 E. Liberty St.****
>
> Ann Arbor, MI 48104****
>
> Direct: 734.922.7007****
>
> *http://mybuys.com/*
>
> ** **
>
> *From:* Arun C Murthy [mailto:acm@hortonworks.com]
> *Sent:* Thursday, December 15, 2011 2:03 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout
> not working)****
>
> ** **
>
> Hi John,****
>
> ** **
>
>  It's hard for folks on this list to diagnose CDH (you might have to ask
> their lists). However, I haven't seen similar issues with hadoop-0.20.2xx
> in a while.****
>
> ** **
>
>  One thing to check would be to grab a stack trace (jstack) on the tasks
> to see what they are upto. Next, try get a tcpdump to see if the tasks are
> indeed sending heartbeats to the TT, which might be the reason the TTs
> aren't timing them out.****
>
> ** **
>
> hth,****
>
> Arun****
>
> ** **
>
> On Dec 15, 2011, at 7:58 AM, John Miller wrote:****
>
>
>
> ****
>
> I’ve recently come across some interesting things happening within a
> 50-node cluster regarding the tasktrackers and task attempts.  Essentially
> tasks are being created but they are sticking at 0.0% and it seems the
> ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for
> days if we let them) and the jobs have to get killed.  Its interesting to
> note that the HDFS datanode service and HBASE regionserver running on these
> nodes work fine and we’ve been simply shutting down the tasktracker service
> on them in order to get around jobs stalling forever.****
>
>  ****
>
> Some historical information… We’re running Cloudera’s cdh3u0 release, and
> this has so far only happened on a handful of random tasktracker nodes and
> it seems to only effected those that have been taken down for maintenance
> and then brought back into the cluster, or alternatively one node was
> brought into the cluster after it had been running for a while and we ran
> into the same issue.  After re-adding the nodes back into the cluster the
> tasktracker service starts getting these stalls.  Also know that this has
> not happened to every node that has been taken out of service for a time
> and then re-added… I would say about 1/3’rd of them or so has ran into this
> issue after maintenance.  The particular maintenance issues on the effected
> nodes were NOT the same, i.e. one was bad ram another was a bad sector on a
> disk etc… never the same initial problem only the same outcome after
> rejoining the cluster.****
>
>  ****
>
> It’s also never the same mapred job that sticks, nor is there any time
> related evidence relating the stalls to a specific time of day.  Rather the
> node will run fine for many jobs and then just all of a sudden some tasks
> will stall and stick at 0.0%.  There are no visible errors in the log
> outputs, although nothing will move forward nor will it release the mappers
> for any other jobs to use until the stalled job is killed.  It seems that
> the default ‘mapreduce.task.timeout’ just isn’t working for some reason.**
> **
>
>  ****
>
> Has anyone come across anything similar to this?  I can provide more
> details/data as needed.****
>
>  ****
>
> *John Miller  **|*  Sr. Linux Systems Administrator****
>
> <image001.png> <http://mybuys.com/>****
>
> 530 E. Liberty St.****
>
> Ann Arbor, MI 48104****
>
> Direct: 734.922.7007****
>
> *http://mybuys.com/*****
>
>  ****
>
> ** **
>

Mime
View raw message