hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@apache.org>
Subject Re: Stability issue - dead DN's
Date Wed, 11 May 2011 16:45:44 GMT

On May 11, 2011, at 5:57 AM, Eric Fiala wrote:
> If we do the math that means [ map.tasks.max * mapred.child.java.opts ]  +
> [ reduce.tasks.max * mapred.child.java.opts ] => or [ 4 * 2.5G ] + [ 4 *
> 2.5G ] is greater than the amount of physical RAM in the machine.
> This doesn't account for the base tasktracker and datanode process + OS
> overhead and whatever else may be hoarding resources on the systems.

	+1 to what Eric said.

	You've exhausted memory and now the whole system is falling apart.  

> I would play with this ratio, either less maps / reduces max - or lower your
> child.java.opts so that when you are fully subscribed you are not using
> more resource than the machine can offer.


> Also, setting mapred.reduce.slowstart.completed.maps  to 1.00 or some other
> value close to 1 would be one way to guarantee only 4 either maps or reduces
> to be running at once and address (albeit in a duct tape like way) the
> oversubscription problem you are seeing (this represents the fractions of
> maps that should complete before initiating the reduce phase).

	slowstart isn't really going to help you much here.  All it takes is another job with the
same settings running at the same time and processes will start dying again.  That said, the
default for slowstart is incredibly stupid for the vast majority.  Something closer to .70
or .80 is more realistic.

>> * a 2x1GE bonded network interface for interconnects
>> * a 2x1GE bonded network interface for external access

	Multiple NICs on a box can sometimes cause big performance problems with Hadoop.  So watch
your traffic carefully.

View raw message