hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack@archive.org (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-86) If corrupted map outputs, reducers get stuck fetching forever
Date Mon, 20 Mar 2006 18:05:58 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-86?page=comments#action_12371117 ] 

stack@archive.org commented on HADOOP-86:

+1 on patch.  It works for me.

Here is failure over on a tasktracker:

060318 190019 Moving bad file /0/hadoop/tmp/part-24.out/task_m_4eop89 to /0/bad_files/task_m_4eop89.-1553185447
060318 190019 Can't read map output:/0/hadoop/tmp/part-24.out/task_m_4eop89
org.apache.hadoop.fs.ChecksumException: Checksum error: /0/hadoop/tmp/part-24.out/task_m_4eop89
at 1598464
    at org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:122)
    at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:98)
    at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:158)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
    at java.io.DataInputStream.read(DataInputStream.java:80)
    at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:129)
    at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
    at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:229)
060318 190019 Reporting output lost:task_m_4eop89

Reducers must be getting the message fast because after above  no one else comes looking for
the corrupted part (No FileNotFoundExceptions in TT log).

Meanwhile over on jobtracker I see following sequence....

060318 184356 Adding task 'task_m_4eop89' to tip tip_ccbgb1, for tracker 'tracker_14754' on
060318 184716 Taskid 'task_m_4eop89' has finished successfully.
060318 184716 Task 'task_m_4eop89' has completed.
060318 190028 Task 'task_m_4eop89' has been lost.
060318 190029 Adding task 'task_m_7mdx9h' to tip tip_ccbgb1, for tracker 'tracker_61554' on
060318 190329 Taskid 'task_m_7mdx9h' has finished successfully.
060318 190329 Task 'task_m_7mdx9h' has completed.

i.e. Task completes successfully.  Subsequently a message comes in that its been lost and
a new task is scheduled which in turn completes.

> If corrupted map outputs, reducers get stuck fetching forever
> -------------------------------------------------------------
>          Key: HADOOP-86
>          URL: http://issues.apache.org/jira/browse/HADOOP-86
>      Project: Hadoop
>         Type: Bug
>     Reporter: stack@archive.org
>  Attachments: mapout.patch
> In our rack, there is a machine that reliably corrupts map output parts.  When reducers
try to pickup the map output, Server#Handler checks the checksum, notices corruption, moves
the bad map output part aside and throws a ChecksumException.  Undeterred, the reducer comes
back again minutes later only this time it gets a FileNotFoundException out of Server#Handler
(Because the part was moved aside).  And so it goes till the cows come home.
> Doug applied a patch that in map output  file, when it notices a fatal exception, it
logs a severe error on the TaskTracker#LOG. Then in TT, if a severe logging has occurred,
TT does a soft restart (TT stays up but closes down all services and then goes through init
again).  This patch was committed (after I suggested it was working), only, later, I noticed
the severe log flag is not cleared across TT restart so TT goes into a cycle of continuous
> A further patch that clears the severe flag was posted to the list.  This improves things
but has issues too in that on revival, the TT continues to be plagued by reducers looking
for parts no longer available for a period of ten minutes or so until the JobTracker gets
around to updating them about change in where to go get map outputs.  During this period,
the TT gets restarted 5-10 times -- but eventually comes back on line (There may have been
too much damage done during this period of flux making it so the job will fail).
> This issue covers implementing a better solution.  
> Suggestions include having the TT stay down a period to avoid the incoming reducers or
somehow examining the incoming reducer request, checking its list of tasks to see if it knows
anything of the reducers' request and rejecting it with a non-severe error if not a map of
the currently running TT.  A little birdie (named DC) suggests a better soln. is probably
an addition to intertrackerprotocol so either the TT or the reducer updates JT when corrupted
map output.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message