hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Иван <...@mail.ru>
Subject Re[2]: Timeouts at reduce stage
Date Sat, 30 Aug 2008 02:24:21 GMT
Thank you, this suggestion seems to be very close to the real situation. The cluster have already
been left looping such a (relatively) frequently failing mapreduce jobs over a huge period
of time to produce a more clear picture of the problem. And I've tried to investigate this
suggestion more closely when I've read it. After taking a look at Ganglia monitoring system
that's running on that same cluster it became clear that cluster's computing resources apparently
are exhausted. Further step was quite simple and straightforward - just to login to the one
random node and find out the consumer of server's resources. The answer became clear almost
instantly because top and jps commands offered produced a huge list of orphaned TaskTracker$Child
processes consuming tons of CPU time and RAM (in fact, almost all of them). Some other nodes
even have run out of 16G RAM and few GB of swap and stopped responding at all. 

This situation apparently doesn't seems normal, I am going to try to repeat such a test with
some simpler jobs (I think it would be something from Hadoop distribution to make sure that
everything is fine with code) to find out more definitely whether this orphaning of forked
processes depends on exact MR job running or not (theoretically it still could be something
wrong with Hadoop/HBase configuration or even maybe with operating system, some additional
installed software or, as it was suggested earlier, hardware).

I would be glad if someone could help me in this process by some advice (googling on this
topic already proved to be hard because of $ being treated as separator and lookup usually
results in materials about real childs). Maybe this situation is quite common and there is
a definite reason or solution?  


Ivan Blinkov

-----Original Message-----
From: Karl Anderson <kra@monkey.org>
To: core-user@hadoop.apache.org
Date: Fri, 29 Aug 2008 13:17:18 -0700
Subject: Re: Timeouts at reduce stage

> On 29-Aug-08, at 3:53 AM, Иван wrote:
> > Thanks for a fast reply, but in fact it sometimes fails even on  
> > default MR jobs like, for example, rowcounter job from HBase 0.2.0  
> > distribution. Hardware problems are theoretically possible, but they  
> > doesn't seem to be the case because everything else is operating  
> > fine on the same set of servers. It seems that all major components  
> > of each server are fine, even disk arrays are regularly checked by  
> > datacenter stuff.
> It could be due to a resource problem, I've found these hard to debug  
> at times.  Tasks or parts of the framework can fail due to other tasks  
> using up resources, and sometimes the errors you see don't make the  
> cause easy to find.  I've had memory consumption in a mapper cause  
> errors in other mappers, reducers, and fetching HDFS blocks, as well  
> as job infrastructure failures that I don't really understand (for  
> example, one task unable to find a file that was put in a job jar and  
> found by other tasks).  I think all of my timeouts have been  
> straightforward, but I could imagine resource consumption causing that  
> in an otherwise unrelated task - IO blocking, swap, etc.

View raw message