hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stephen mulcahy <stephen.mulc...@deri.org>
Subject Re: Network problems Hadoop 0.20.2 and Terasort on Debian 2.6.32 kernel
Date Thu, 15 Apr 2010 15:46:48 GMT
Todd Lipcon wrote:
>> Yes, it looks like it is a kernel bug alright (see thread on kernel netdev
>> at http://marc.info/?t=127094288900001&r=1&w=2 if interested). To be fair,
>> I don't think these bugs are confined to Debian - I did some initial testing
>> with Scientific Linux and also ran into problems with forcedeth.
> 
> 
> Interesting, good find. I try to avoid forcedeth now and have heard the same
> from ops people at various large linux deployments. Not sure why, but it's
> traditionally had a lot of bugs/regressions.

FYI, the netdev guys have proposed a patch and initial testing indicates 
it fixes the problem (and brings the TeraSort down to about 18 minutes, 
so win win :)

I share similar feelings about forcedeth, particularly after this, but 
then I'm also dubious about at least some broadcom chipsets and even 
Intel have had their issues 
(https://bugzilla.kernel.org/show_bug.cgi?id=11382) so maybe it's just 
that all nic's suck.

>> Finally, I figured burning in our cluster was a good opportunity to give
>> back to the community and do some testing on their behalf.
> 
> Very admirable of you :) It is good to have some people running new kernels
> to suss these issues out before the rest of us check out modern technology
> ;-)

It also means there aren't problems lurking for us in the future when we 
get forced to newer kernels for support/maintenance issues. I also ran 
into http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030 while 
testing a 2.6.30 kernel which may be lurking in older kernels too (and 
seems to have been fixed in 2.6.32) so there are perils to staying back 
and going forward.

>> With regard to our TeraSort benchmark time of ~23 minutes - is that in the
>> right ballpark for a cluster of 45 data nodes and a nn and 2nn?
>>
>>
> Yep, sounds about the right ballpark.

Cool, thanks for the feedback. I'm surprised that others didn't comment 
on the TeraSort result - perhaps others use something else for 
smoke-testing/benchmarking their Hadoop clusters? If so, anyone want to 
suggest what they do use? It'd be nice to see a collection of TeraSort 
results somewhere to get an idea of what cluster configs work well and 
for people who want to sanity check a new cluster.

-stephen

-- 
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com

Mime
View raw message