hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: Reduce gets struck at 99%
Date Fri, 09 Apr 2010 02:40:07 GMT
Hi Ted,

        Thank you for all the suggestions. I went through the job tracker
logs and I have attached the exceptions found in the logs. I found two
exceptions

1) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file    (DFS Client)

2) org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
File does not exist. Holder DFSClient_attempt_201004060646_0057_r_000014_0
does not have any open files.


The exception occurs at the point of writing out <K,V> pairs in the reducer
and it occurs only in certain task attempts. I am not using any custom
output format or record writers but I do use custom input reader.

What could have gone wrong here?

Thank you.

Regards,
Raghava.


On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Raghava:
> Are you able to share the last segment of reducer log ?
> You can get them from web UI:
>
> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
>
> Adding more log in your reducer task would help pinpoint where the issue
> is.
> Also look in job tracker log.
>
> Cheers
>
> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com
> > wrote:
>
> > Hi Ted,
> >
> >      Thank you for the suggestion. I enabled it using the Configuration
> > class because I cannot change hadoop-site.xml file (I am not an admin).
> The
> > situation is still the same --- it gets stuck at reduce 99% and does not
> > move further.
> >
> > Regards,
> > Raghava.
> >
> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > You need to turn on yourself (hadoop-site.xml):
> > > <property>
> > >  <name>mapred.reduce.tasks.speculative.execution</name>
> > >  <value>true</value>
> > > </property>
> > >
> > > <property>
> > >  <name>mapred.map.tasks.speculative.execution</name>
> > >  <value>true</value>
> > > </property>
> > >
> > >
> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
> > > m.vijayaraghava@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > >     Thank you Eric, Prashant and Greg. Although the timeout problem
> was
> > > > resolved, reduce is getting stuck at 99%. As of now, it has been
> stuck
> > > > there
> > > > for about 3 hrs. That is too high a wait time for my task. Do you
> guys
> > > see
> > > > any reason for this?
> > > >
> > > >      Speculative execution is "on" by default right? Or should I
> enable
> > > it?
> > > >
> > > > Regards,
> > > > Raghava.
> > > >
> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
> gregl@yahoo-inc.com
> > > > >wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > I have also experienced this problem. Have you tried speculative
> > > > execution?
> > > > > Also, I have had jobs that took a long time for one mapper /
> reducer
> > > > because
> > > > > of a record that was significantly larger than those contained in
> the
> > > > other
> > > > > filesplits. Do you know if it always slows down for the same
> > filesplit?
> > > > >
> > > > > Regards,
> > > > > Greg Lawrence
> > > > >
> > > > >
> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > >          I got the time out error as mentioned below -- after 600
> > > > seconds,
> > > > > that attempt was killed and the attempt would be deemed a failure.
> I
> > > > > searched around about this error, and one of the suggestions to
> > include
> > > > > "progress" statements in the reducer -- it might be taking longer
> > than
> > > > 600
> > > > > seconds and so is timing out. I added calls to context.progress()
> and
> > > > > context.setStatus(str) in the reducer. Now, it works fine -- there
> > are
> > > no
> > > > > timeout errors.
> > > > >
> > > > >          But, for a few jobs, it takes awfully long time to move
> from
> > > > "Map
> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for
> > some
> > > > it
> > > > > was more than an hour. The reduce code is not complex -- 2 level
> loop
> > > and
> > > > > couple of if-else blocks. The input size is also not huge, for the
> > job
> > > > that
> > > > > gets struck for an hour at reduce 99%, it would take in 130. Some
> of
> > > them
> > > > > are 1-3 MB in size and couple of them are 16MB in size.
> > > > >
> > > > >          Has anyone encountered this problem before? Any pointers?
> I
> > > use
> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raghava.
> > > > >
> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> > > > > m.vijayaraghava@gmail.com> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > >        I am running a series of jobs one after another. While
> > executing
> > > > the
> > > > > 4th job, the job fails. It fails in the reducer --- the progress
> > > > percentage
> > > > > would be map 100%, reduce 99%. It gives out the following message
> > > > >
> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
> > > > > attempt_201003240138_0110_r_000018_1, Status : FAILED
> > > > > Task attempt_201003240138_0110_r_000018_1 failed to report status
> for
> > > 602
> > > > > seconds. Killing!
> > > > >
> > > > > It makes several attempts again to execute it but fails with
> similar
> > > > > message. I couldn't get anything from this error message and wanted
> > to
> > > > look
> > > > > at logs (located in the default dir of ${HADOOP_HOME/logs}). But
I
> > > don't
> > > > > find any files which match the timestamp of the job. Also I did not
> > > find
> > > > > history and userlogs in the logs folder. Should I look at some
> other
> > > > place
> > > > > for the logs? What could be the possible causes for the above
> error?
> > > > >
> > > > >        I am using Hadoop 0.20.2 and I am running it on a cluster
> with
> > > 16
> > > > > nodes.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raghava.
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
View raw message