hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Иван <...@mail.ru>
Subject Instant death of all TaskTrackers
Date Tue, 16 Sep 2008 15:35:33 GMT
Today I've met a strange situation: during running of some MapReduce jobs all the TaskTrackers
in the cluster simply disappeared without any apparent reason, but the JobTracker have remained
alive like nothing have happened. Even it's web interface was running showing zero capacity
for maps and reduces and all the same jobs in the running state (in fact, TaskTracker$Childs
have also remained in memory). Examination of tasktracker's logs resulted in (almost) same
exception in the tail, like this one:

2008-09-16 06:27:11,244 WARN org.apache.hadoop.mapred.TaskTracker: Error initializing task_200809151253_1938_m_000003_0:
java.lang.InternalError: jzentry == 0,
 jzfile = 46912646564160,
 total = 148,
 name = /data/hadoop/root/mapred/local/taskTracker/jobcache/job_200809151253_1938/jars/job.jar,
 i = 3,
 message = invalid LOC header (bad signature)
        at java.util.zip.ZipFile$3.nextElement(ZipFile.java:429)
        at java.util.zip.ZipFile$3.nextElement(ZipFile.java:415)
        at java.util.jar.JarFile$1.nextElement(JarFile.java:221)
        at java.util.jar.JarFile$1.nextElement(JarFile.java:220)
        at org.apache.hadoop.util.RunJar.unJar(RunJar.java:40)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:708)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
        at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)
        at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)

The exact exception strings were different, but the stacktraces on all nodes were close to
each other. At the beginning it looked like that I can just shrug it off and simply start
them once more, but something have gone wrong. The fsck utility found some missing blocks
and the HBase running on the same cluster have just simply became unavailable and later failed
to start up (seems to be the connected issues). The HBase reported the SocketTimeoutExceptions
(in fact only about two servers simultaneously each time, but after cluster-restart the role
of "victims" have transferred to other nodes), while in the HDFS logs sometimes emerged the
messages about unability to find some old blocks or create a new ones. I've double-checked
the possible variants: some DNS problems, network collisions, iptables, possible disk corruption,
or something like that, but even complete cluster reboot haven't changed the situation a bit.

P.S.: Hadoop 0.17.1, HBase 0.2.0, Debian Etch
P.S.: If this does matter: the MR jobs running at that moment have performed some manipulation
with data in HBase and all the blocks which report some problems are located in HBase root
directory (at least it looks like that).

Ivan Blinkov

View raw message