hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Reduce gets struck at 99%
Date Thu, 08 Apr 2010 20:40:03 GMT
You need to turn on yourself (hadoop-site.xml):
<property>
  <name>mapred.reduce.tasks.speculative.execution</name>
  <value>true</value>
</property>

<property>
  <name>mapred.map.tasks.speculative.execution</name>
  <value>true</value>
</property>


On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <m.vijayaraghava@gmail.com
> wrote:

> Hi,
>
>     Thank you Eric, Prashant and Greg. Although the timeout problem was
> resolved, reduce is getting stuck at 99%. As of now, it has been stuck
> there
> for about 3 hrs. That is too high a wait time for my task. Do you guys see
> any reason for this?
>
>      Speculative execution is "on" by default right? Or should I enable it?
>
> Regards,
> Raghava.
>
> On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
> >wrote:
>
> >  Hi,
> >
> > I have also experienced this problem. Have you tried speculative
> execution?
> > Also, I have had jobs that took a long time for one mapper / reducer
> because
> > of a record that was significantly larger than those contained in the
> other
> > filesplits. Do you know if it always slows down for the same filesplit?
> >
> > Regards,
> > Greg Lawrence
> >
> >
> > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com>
> wrote:
> >
> > Hello all,
> >
> >          I got the time out error as mentioned below -- after 600
> seconds,
> > that attempt was killed and the attempt would be deemed a failure. I
> > searched around about this error, and one of the suggestions to include
> > "progress" statements in the reducer -- it might be taking longer than
> 600
> > seconds and so is timing out. I added calls to context.progress() and
> > context.setStatus(str) in the reducer. Now, it works fine -- there are no
> > timeout errors.
> >
> >          But, for a few jobs, it takes awfully long time to move from
> "Map
> > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some
> it
> > was more than an hour. The reduce code is not complex -- 2 level loop and
> > couple of if-else blocks. The input size is also not huge, for the job
> that
> > gets struck for an hour at reduce 99%, it would take in 130. Some of them
> > are 1-3 MB in size and couple of them are 16MB in size.
> >
> >          Has anyone encountered this problem before? Any pointers? I use
> > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> >
> > Thank you.
> >
> > Regards,
> > Raghava.
> >
> > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > m.vijayaraghava@gmail.com> wrote:
> >
> > Hi all,
> >
> >        I am running a series of jobs one after another. While executing
> the
> > 4th job, the job fails. It fails in the reducer --- the progress
> percentage
> > would be map 100%, reduce 99%. It gives out the following message
> >
> > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
> > seconds. Killing!
> >
> > It makes several attempts again to execute it but fails with similar
> > message. I couldn't get anything from this error message and wanted to
> look
> > at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
> > find any files which match the timestamp of the job. Also I did not find
> > history and userlogs in the logs folder. Should I look at some other
> place
> > for the logs? What could be the possible causes for the above error?
> >
> >        I am using Hadoop 0.20.2 and I am running it on a cluster with 16
> > nodes.
> >
> > Thank you.
> >
> > Regards,
> > Raghava.
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message