hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Lawrence <gr...@yahoo-inc.com>
Subject Re: Reduce gets struck at 99%
Date Thu, 08 Apr 2010 19:15:43 GMT

I have also experienced this problem. Have you tried speculative execution? Also, I have had
jobs that took a long time for one mapper / reducer because of a record that was significantly
larger than those contained in the other filesplits. Do you know if it always slows down for
the same filesplit?

Greg Lawrence

On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:

Hello all,

         I got the time out error as mentioned below -- after 600 seconds, that attempt was
killed and the attempt would be deemed a failure. I searched around about this error, and
one of the suggestions to include "progress" statements in the reducer -- it might be taking
longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str)
in the reducer. Now, it works fine -- there are no timeout errors.

         But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%"
to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce
code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not
huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of
them are 1-3 MB in size and couple of them are 16MB in size.

         Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on
a linux cluster of 16 nodes.

Thank you.


On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <m.vijayaraghava@gmail.com> wrote:
Hi all,

       I am running a series of jobs one after another. While executing the 4th job, the job
fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%.
It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status
Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!

It makes several attempts again to execute it but fails with similar message. I couldn't get
anything from this error message and wanted to look at logs (located in the default dir of
${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also
I did not find history and userlogs in the logs folder. Should I look at some other place
for the logs? What could be the possible causes for the above error?

       I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

Thank you.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message