hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Howell <dehow...@gmail.com>
Subject losing network interfaces during long running map-reduce jobs
Date Sat, 03 Apr 2010 01:16:52 GMT
I'm encountering a completely bizarre failure mode in my Hadoop
cluster. A week ago, I switched from vanilla apache Hadoop 0.20.1 to
CDH 2.

Ever since then, my tasktracker/ datenode machines have been regularly
losing their networking during long (> 1 hour) jobs. Restarting the
network interface brings them back online immediately.

I'm mystified as to how this can be happening: anyone care to venture
a hypothesis? I'm running on Centos 5.2.


View raw message