hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@archive.org>
Subject Hung job
Date Fri, 10 Mar 2006 19:22:13 GMT
On hadoop-users, I've described two recent hangs.  I'm writing here to 
dev because I'm looking for pointers on how best to conjure a fix with 
perhaps pointers to any existing facility I might exploit (I do not know 
the codebase well).

In synopsis the problem goes as follows:

If a reduce cannot pick up map outputs -- for example, the output has 
been moved aside because of a ChecksumException (See below stack trace) 
-- then the job gets stuck with the reduce task trying and failing every 
ten seconds or so to pick up the non-existent map output part.

Somehow the reduce needs to give up and the jobtracker needs to rerun 
the map just as it would if the tasktracker had died completely.

Thanks in advance for any pointers,
St.Ack


060309 014426 Moving bad file /0/hadoop/tmp/task_m_bq2g76/part-20.out to
/0/bad_files/part-20.out.2002824050
060309 014426 Server handler 0 on 50040 caught:
org.apache.hadoop.fs.ChecksumException: Checksum error:
/0/hadoop/tmp/task_m_bq2g76/part-20.out at 2649600
org.apache.hadoop.fs.ChecksumException: Checksum error:
/0/hadoop/tmp/task_m_bq2g76/part-20.out at 2649600
      at
org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:122)
      at
org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)
      at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:158)
      at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
      at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
      at java.io.DataInputStream.read(DataInputStream.java:80)
      at 
org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:110)
      at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
      at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)



Mime
View raw message