hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stephen mulcahy <stephen.mulc...@deri.org>
Subject Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel
Date Thu, 08 Apr 2010 16:37:51 GMT
Hi,

I'm commissioning a new Hadoop cluster with the following spec.

45 x data nodes:
- 2 x Quad-Core AMD Opteron(tm) Processor 2378
- 16GB ram
- 4 x WDC WD1002FBYS 1TB SATA drives (configured as separate ext4 
filesystems)

3 x name nodes:
- 2 x Quad-Core AMD Opteron(tm) Processor 2378
- 32GB ram
- 2 x WDC WD1002FBYS 1TB SATA drives (in software RAID1 config and ext4 
filesystem)

All nodes are running Debian testing/squeeze.

I'm doing my benchmarking with TeraSort running as follows

hadoop jar hadoop-0.20.2-examples.jar teragen -Dmapred.map.tasks=8000 
10000000000 /terasort/in

hadoop jar hadoop-0.20.2-examples.jar terasort -Dmapred.reduce.tasks=530 
/terasort/in /terasort/out

When I run this on the Debian 2.6.30 kernel - it runs to completion in 
about 23 minutes (occasionally running into the cpu soft lockups 
problems described in [1]). I assume that is a reasonable time for this 
benchmark to complete in?

When I run this on the Debian 2.6.32 kernel - over the course of the 
run, 1 or 2 datanodes of the cluster enter a state whereby they are no 
longer responsive to network traffic.

Logging into these nodes via the console reveals no messages in the 
log-files. Running ifdown eth0 followed by ifup eth0 brings these 
systems back online. The systems that become unresponsive vary from run 
to run suggesting this is not a h/w problem specific to certain nodes.

I have raised this issue with the Debian kernel team[2] and have tested
various system and switch changes in an attempt to identify the cause -
but without success.

Has anyone run into similar problems with their environments? I noticed 
that the when the nodes become unresponsive, it often happens when the 
TeraSort is at

map 100%, reduce 78%

Is there any significance to that?

Any feedback welcome (including comments on what distro/kernel 
combinations others are using).

Thanks,

-stephen

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=572201

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Mime
View raw message