hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: Reduce gets struck at 99%
Date Thu, 08 Apr 2010 20:14:04 GMT
Hi,

     Thank you Eric, Prashant and Greg. Although the timeout problem was
resolved, reduce is getting stuck at 99%. As of now, it has been stuck there
for about 3 hrs. That is too high a wait time for my task. Do you guys see
any reason for this?

      Speculative execution is "on" by default right? Or should I enable it?

Regards,
Raghava.

On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com>wrote:

>  Hi,
>
> I have also experienced this problem. Have you tried speculative execution?
> Also, I have had jobs that took a long time for one mapper / reducer because
> of a record that was significantly larger than those contained in the other
> filesplits. Do you know if it always slows down for the same filesplit?
>
> Regards,
> Greg Lawrence
>
>
> On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:
>
> Hello all,
>
>          I got the time out error as mentioned below -- after 600 seconds,
> that attempt was killed and the attempt would be deemed a failure. I
> searched around about this error, and one of the suggestions to include
> "progress" statements in the reducer -- it might be taking longer than 600
> seconds and so is timing out. I added calls to context.progress() and
> context.setStatus(str) in the reducer. Now, it works fine -- there are no
> timeout errors.
>
>          But, for a few jobs, it takes awfully long time to move from "Map
> 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
> was more than an hour. The reduce code is not complex -- 2 level loop and
> couple of if-else blocks. The input size is also not huge, for the job that
> gets struck for an hour at reduce 99%, it would take in 130. Some of them
> are 1-3 MB in size and couple of them are 16MB in size.
>
>          Has anyone encountered this problem before? Any pointers? I use
> Hadoop 0.20.2 on a linux cluster of 16 nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
> On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> Hi all,
>
>        I am running a series of jobs one after another. While executing the
> 4th job, the job fails. It fails in the reducer --- the progress percentage
> would be map 100%, reduce 99%. It gives out the following message
>
> 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> attempt_201003240138_0110_r_000018_1, Status : FAILED
> Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> seconds. Killing!
>
> It makes several attempts again to execute it but fails with similar
> message. I couldn't get anything from this error message and wanted to look
> at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> find any files which match the timestamp of the job. Also I did not find
> history and userlogs in the logs folder. Should I look at some other place
> for the logs? What could be the possible causes for the above error?
>
>        I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> nodes.
>
> Thank you.
>
> Regards,
> Raghava.
>
>
>
>

Mime
View raw message