hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: Reduce gets struck at 99%
Date Sun, 18 Apr 2010 08:24:43 GMT
Hi,

        Thank you Ted. I would just describe the problem again, so that it
is easier for anyone reading this email chain.

I run a series of jobs one after another. Starting from the 4th job, Reducer
gets stuck at 99% (Map 100% and Reduce 99%). It gets stuck at 99% for a many
hours and then the job fails. Earlier there were 2 exceptions in the logs
--- DFSClient exception (could not completely write into a file <file name>)
and Lease Expired Exception. Then I increased the ulimit -n (max no of open
files) from 1024 to 32768 on the advise of Ted. After this, there are no
exceptions in the logs but the reduce still gets stuck at 99%.

Do you have any suggestions?

Thank you.

Regards,
Raghava.


On Sat, Apr 17, 2010 at 9:36 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Hi,
> Putting this thread back in pool to leverage collective intelligence.
>
> If you get the full command line of the java processes, it wouldn't be
> difficult to correlate reduce task(s) with a particular job.
>
> Cheers
>
> On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju <
> m.vijayaraghava@gmail.com> wrote:
>
> > Hello Ted,
> >
> >        Thank you for the suggestions :). I haven't come across any other
> > serious issue before this one. Infact, the same MR job runs for a smaller
> > input size, although, lot slower than what we expected.
> >
> > I will use jstack to get stack trace. I had a question in this regard.
> How
> > would I know which MR job (job id) is related to which java process
> (pid)? I
> > can get a list of hadoop jobs with "hadoop job -list" and list of java
> > processes with "jps" but how I couldn't determine how to get the
> connection
> > between these 2 lists.
> >
> >
> > Thank you again.
> >
> > Regards,
> > Raghava.
> >
> > On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> If you look at
> >>
> https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776,
> >> you can see that hdfs-127-branch20-redone-v2.txt<
> https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was
> the latest.
> >>
> >> You need to download the source code corresponding to your version of
> >> hadoop, apply the patch and rebuild.
> >>
> >> If you haven't experienced serious issue with hadoop for other
> scenarios,
> >> we should try to find out the root cause for the current problem without
> the
> >> 127 patch.
> >>
> >> My advice is to use jstack to find what each thread was waiting for
> after
> >> reducers get stuck.
> >> I would expect a deadlock in either your code or hdfs, I would think it
> >> should the former.
> >>
> >> You can replace sensitive names in the stack traces and paste it if you
> >> cannot determine the deadlock.
> >>
> >> Cheers
> >>
> >>
> >> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
> >> m.vijayaraghava@gmail.com> wrote:
> >>
> >>> Hello Ted,
> >>>
> >>>       Thank you for the reply. Will this change fix my issue? I asked
> >>> this because I again need to convince my admin to make this change.
> >>>
> >>>       We have a gateway to the cluster-head. We generally run our MR
> jobs
> >>> on the gateway. Should this change be made to the hadoop installation
> on the
> >>> gateway?
> >>>
> >>> 1) I am confused on which patch to be applied? There are 4 patches
> listed
> >>> at https://issues.apache.org/jira/browse/HDFS-127
> >>>
> >>> 2) How to apply the patch? Should we change the lines of code specified
> >>> and rebuild hadoop? Or is there any other way?
> >>>
> >>> Thank you again.
> >>>
> >>> Regards,
> >>> Raghava.
> >>>
> >>>
> >>> On Fri, Apr 16, 2010 at 6:42 PM, <yuzhihong@gmail.com> wrote:
> >>>
> >>>> That patch is very important.
> >>>>
> >>>> please apply it.
> >>>>
> >>>> Sent from my Verizon Wireless BlackBerry
> >>>> ------------------------------
> >>>> *From: * Raghava Mutharaju <m.vijayaraghava@gmail.com>
> >>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400
> >>>> *To: *Ted Yu<yuzhihong@gmail.com>
> >>>> *Subject: *Re: Reduce gets struck at 99%
> >>>>
> >>>> Hi Ted,
> >>>>
> >>>>         It took sometime to contact my department's admin (he was on
> >>>> leave) and ask him to make ulimit changes effective in the cluster
> (just
> >>>> adding entry in /etc/security/limits.conf was not sufficient, so took
> >>>> sometime to figure out). Now the ulimit is 32768. I ran the set of MR
> jobs,
> >>>> the result is the same --- it gets stuck at Reduce 99%. But this time,
> there
> >>>> are no exceptions in the logs. I view JobTracker logs through the Web
> UI. I
> >>>> checked "Running Jobs" as well as "Failed Jobs".
> >>>>
> >>>> I haven't asked the admin to apply the patch
> >>>> https://issues.apache.org/jira/browse/HDFS-127 that you mentioned
> >>>> earlier. Is this important?
> >>>>
> >>>> Do you any suggestions?
> >>>>
> >>>> Thank you.
> >>>>
> >>>> Regards,
> >>>> Raghava.
> >>>>
> >>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>>>
> >>>>> For the user under whom you launch MR jobs.
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <
> >>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Ted,
> >>>>>>
> >>>>>>        Sorry to bug you again :) but I do not have an account
on all
> >>>>>> the datanodes, I just have it on the machine on which I start
the MR
> jobs.
> >>>>>> So is it required to increase the ulimit on all the nodes (in
this
> case the
> >>>>>> admin may have to increase it for all the users?)
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Raghava.
> >>>>>>
> >>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> ulimit should be increased on all nodes.
> >>>>>>>
> >>>>>>> The link I gave you lists several actions to take. I think
they're
> >>>>>>> not specifically for hbase.
> >>>>>>> Also make sure the following is applied:
> >>>>>>> https://issues.apache.org/jira/browse/HDFS-127
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju <
> >>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hello Ted,
> >>>>>>>>
> >>>>>>>>        Should the increase in ulimit to 32768 be applied
on all
> the
> >>>>>>>> datanodes (its a 16 node cluster)? Is this related to
HBase,
> because I am
> >>>>>>>> not using HBase.
> >>>>>>>>        Are the exceptions & delay (at Reduce 99%)
due to this?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Raghava.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> Your ulimit is low.
> >>>>>>>>> Ask your admin to increase it to 32768
> >>>>>>>>>
> >>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting,
item #6
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju
<
> >>>>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Ted,
> >>>>>>>>>>
> >>>>>>>>>> I am pasting below the timestamps from the log.
> >>>>>>>>>>
> >>>>>>>>>>        Lease-exception:
> >>>>>>>>>>
> >>>>>>>>>> Task Attempts Machine Status Progress Start
Time Shuffle
> Finished
> >>>>>>>>>> Sort Finished Finish Time Errors Task Logs
> >>>>>>>>>>    Counters Actions
> >>>>>>>>>>    attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15
> >>>>>>>>>> FAILED 0.00%
> >>>>>>>>>>    8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec)
8-Apr-2010
> >>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins,
39sec)
> >>>>>>>>>>
> >>>>>>>>>> -------------------------------------
> >>>>>>>>>>
> >>>>>>>>>>         DFS Client Exception:
> >>>>>>>>>>
> >>>>>>>>>> Task Attempts Machine Status Progress Start
Time Shuffle
> Finished
> >>>>>>>>>> Sort Finished Finish Time Errors Task Logs
> >>>>>>>>>>    Counters Actions
> >>>>>>>>>>    attempt_201004060646_0057_r_000006_0 /default-rack/
> >>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00%
> >>>>>>>>>>    8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec)
8-Apr-2010
> >>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins,
46sec)
> >>>>>>>>>> ------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> The file limit is set to 1024. I checked couple
of datanodes. I
> >>>>>>>>>> haven't checked the headnode though.
> >>>>>>>>>>
> >>>>>>>>>> The no of currently open files under my username,
on the system
> on
> >>>>>>>>>> which I started the MR jobs are 346
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thank you for you help :)
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Raghava.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Can you give me the timestamps of the two
exceptions ?
> >>>>>>>>>>> I want to see if they're related.
> >>>>>>>>>>>
> >>>>>>>>>>> I saw DFSClient$DFSOutputStream.close()
in the first stack
> trace.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> just to double check it's not a file
> >>>>>>>>>>>> limits issue could you run the following
on each of the hosts:
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ ulimit -a
> >>>>>>>>>>>> $ lsof | wc -l
> >>>>>>>>>>>>
> >>>>>>>>>>>> The first command will show you (among
other things) the file
> >>>>>>>>>>>> limits, it
> >>>>>>>>>>>> should be above the default 1024.  The
second will tell you
> have
> >>>>>>>>>>>> many files
> >>>>>>>>>>>> are currently open...
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:40 PM, Raghava
Mutharaju <
> >>>>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Ted,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>         Thank you for all the suggestions.
I went through the
> >>>>>>>>>>>>> job tracker logs and I have attached
the exceptions found in
> the logs. I
> >>>>>>>>>>>>> found two exceptions
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> >>>>>>>>>>>>> Could not complete write to file
   (DFS Client)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) org.apache.hadoop.ipc.RemoteException:
> >>>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
> No lease on
> >>>>>>>>>>>>>
> /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
> >>>>>>>>>>>>> File does not exist. Holder
> DFSClient_attempt_201004060646_0057_r_000014_0
> >>>>>>>>>>>>> does not have any open files.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The exception occurs at the point
of writing out <K,V> pairs
> in
> >>>>>>>>>>>>> the reducer and it occurs only in
certain task attempts. I am
> not using any
> >>>>>>>>>>>>> custom output format or record writers
but I do use custom
> input reader.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What could have gone wrong here?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Raghava.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Apr 8, 2010 at 5:51 PM,
Ted Yu <yuzhihong@gmail.com
> >wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Raghava:
> >>>>>>>>>>>>>> Are you able to share the last
segment of reducer log ?
> >>>>>>>>>>>>>> You can get them from web UI:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Adding more log in your reducer
task would help pinpoint
> where
> >>>>>>>>>>>>>> the issue is.
> >>>>>>>>>>>>>> Also look in job tracker log.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 2:46
PM, Raghava Mutharaju <
> >>>>>>>>>>>>>> m.vijayaraghava@gmail.com
> >>>>>>>>>>>>>> > wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> > Hi Ted,
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> >      Thank you for the
suggestion. I enabled it using the
> >>>>>>>>>>>>>> Configuration
> >>>>>>>>>>>>>> > class because I cannot
change hadoop-site.xml file (I am
> not
> >>>>>>>>>>>>>> an admin). The
> >>>>>>>>>>>>>> > situation is still the
same --- it gets stuck at reduce
> 99%
> >>>>>>>>>>>>>> and does not
> >>>>>>>>>>>>>> > move further.
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > Regards,
> >>>>>>>>>>>>>> > Raghava.
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > On Thu, Apr 8, 2010 at
4:40 PM, Ted Yu <
> yuzhihong@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>> > > You need to turn on
yourself (hadoop-site.xml):
> >>>>>>>>>>>>>> > > <property>
> >>>>>>>>>>>>>> > >  <name>mapred.reduce.tasks.speculative.execution</name>
> >>>>>>>>>>>>>> > >  <value>true</value>
> >>>>>>>>>>>>>> > > </property>
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > <property>
> >>>>>>>>>>>>>> > >  <name>mapred.map.tasks.speculative.execution</name>
> >>>>>>>>>>>>>> > >  <value>true</value>
> >>>>>>>>>>>>>> > > </property>
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > On Thu, Apr 8, 2010
at 1:14 PM, Raghava Mutharaju <
> >>>>>>>>>>>>>> > > m.vijayaraghava@gmail.com
> >>>>>>>>>>>>>> > > > wrote:
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> > > > Hi,
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > >     Thank you
Eric, Prashant and Greg. Although the
> >>>>>>>>>>>>>> timeout problem was
> >>>>>>>>>>>>>> > > > resolved, reduce
is getting stuck at 99%. As of now,
> it
> >>>>>>>>>>>>>> has been stuck
> >>>>>>>>>>>>>> > > > there
> >>>>>>>>>>>>>> > > > for about 3 hrs.
That is too high a wait time for my
> >>>>>>>>>>>>>> task. Do you guys
> >>>>>>>>>>>>>> > > see
> >>>>>>>>>>>>>> > > > any reason for
this?
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > >      Speculative
execution is "on" by default right?
> Or
> >>>>>>>>>>>>>> should I enable
> >>>>>>>>>>>>>> > > it?
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > Regards,
> >>>>>>>>>>>>>> > > > Raghava.
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > On Thu, Apr 8,
2010 at 3:15 PM, Gregory Lawrence <
> >>>>>>>>>>>>>> gregl@yahoo-inc.com
> >>>>>>>>>>>>>> > > > >wrote:
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > > > >  Hi,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > I have also
experienced this problem. Have you tried
> >>>>>>>>>>>>>> speculative
> >>>>>>>>>>>>>> > > > execution?
> >>>>>>>>>>>>>> > > > > Also, I
have had jobs that took a long time for one
> >>>>>>>>>>>>>> mapper / reducer
> >>>>>>>>>>>>>> > > > because
> >>>>>>>>>>>>>> > > > > of a record
that was significantly larger than those
> >>>>>>>>>>>>>> contained in the
> >>>>>>>>>>>>>> > > > other
> >>>>>>>>>>>>>> > > > > filesplits.
Do you know if it always slows down for
> >>>>>>>>>>>>>> the same
> >>>>>>>>>>>>>> > filesplit?
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Greg Lawrence
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > On 4/8/10
10:30 AM, "Raghava Mutharaju" <
> >>>>>>>>>>>>>> m.vijayaraghava@gmail.com>
> >>>>>>>>>>>>>> > > > wrote:
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Hello all,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        
 I got the time out error as mentioned below
> >>>>>>>>>>>>>> -- after 600
> >>>>>>>>>>>>>> > > > seconds,
> >>>>>>>>>>>>>> > > > > that attempt
was killed and the attempt would be
> >>>>>>>>>>>>>> deemed a failure. I
> >>>>>>>>>>>>>> > > > > searched
around about this error, and one of the
> >>>>>>>>>>>>>> suggestions to
> >>>>>>>>>>>>>> > include
> >>>>>>>>>>>>>> > > > > "progress"
statements in the reducer -- it might be
> >>>>>>>>>>>>>> taking longer
> >>>>>>>>>>>>>> > than
> >>>>>>>>>>>>>> > > > 600
> >>>>>>>>>>>>>> > > > > seconds
and so is timing out. I added calls to
> >>>>>>>>>>>>>> context.progress() and
> >>>>>>>>>>>>>> > > > > context.setStatus(str)
in the reducer. Now, it works
> >>>>>>>>>>>>>> fine -- there
> >>>>>>>>>>>>>> > are
> >>>>>>>>>>>>>> > > no
> >>>>>>>>>>>>>> > > > > timeout
errors.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        
 But, for a few jobs, it takes awfully long
> >>>>>>>>>>>>>> time to move from
> >>>>>>>>>>>>>> > > > "Map
> >>>>>>>>>>>>>> > > > > 100%, Reduce
99%" to Reduce 100%. For some jobs its
> >>>>>>>>>>>>>> 15mins and for
> >>>>>>>>>>>>>> > some
> >>>>>>>>>>>>>> > > > it
> >>>>>>>>>>>>>> > > > > was more
than an hour. The reduce code is not
> complex
> >>>>>>>>>>>>>> -- 2 level loop
> >>>>>>>>>>>>>> > > and
> >>>>>>>>>>>>>> > > > > couple of
if-else blocks. The input size is also not
> >>>>>>>>>>>>>> huge, for the
> >>>>>>>>>>>>>> > job
> >>>>>>>>>>>>>> > > > that
> >>>>>>>>>>>>>> > > > > gets struck
for an hour at reduce 99%, it would take
> >>>>>>>>>>>>>> in 130. Some of
> >>>>>>>>>>>>>> > > them
> >>>>>>>>>>>>>> > > > > are 1-3
MB in size and couple of them are 16MB in
> >>>>>>>>>>>>>> size.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        
 Has anyone encountered this problem before?
> >>>>>>>>>>>>>> Any pointers? I
> >>>>>>>>>>>>>> > > use
> >>>>>>>>>>>>>> > > > > Hadoop 0.20.2
on a linux cluster of 16 nodes.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Thank you.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Raghava.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > On Thu,
Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
> >>>>>>>>>>>>>> > > > > m.vijayaraghava@gmail.com>
wrote:
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Hi all,
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        I
am running a series of jobs one after
> >>>>>>>>>>>>>> another. While
> >>>>>>>>>>>>>> > executing
> >>>>>>>>>>>>>> > > > the
> >>>>>>>>>>>>>> > > > > 4th job,
the job fails. It fails in the reducer ---
> >>>>>>>>>>>>>> the progress
> >>>>>>>>>>>>>> > > > percentage
> >>>>>>>>>>>>>> > > > > would be
map 100%, reduce 99%. It gives out the
> >>>>>>>>>>>>>> following message
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > 10/04/01
01:04:15 INFO mapred.JobClient: Task Id :
> >>>>>>>>>>>>>> > > > > attempt_201003240138_0110_r_000018_1,
Status :
> FAILED
> >>>>>>>>>>>>>> > > > > Task attempt_201003240138_0110_r_000018_1
failed to
> >>>>>>>>>>>>>> report status for
> >>>>>>>>>>>>>> > > 602
> >>>>>>>>>>>>>> > > > > seconds.
Killing!
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > It makes
several attempts again to execute it but
> >>>>>>>>>>>>>> fails with similar
> >>>>>>>>>>>>>> > > > > message.
I couldn't get anything from this error
> >>>>>>>>>>>>>> message and wanted
> >>>>>>>>>>>>>> > to
> >>>>>>>>>>>>>> > > > look
> >>>>>>>>>>>>>> > > > > at logs
(located in the default dir of
> >>>>>>>>>>>>>> ${HADOOP_HOME/logs}). But I
> >>>>>>>>>>>>>> > > don't
> >>>>>>>>>>>>>> > > > > find any
files which match the timestamp of the job.
> >>>>>>>>>>>>>> Also I did not
> >>>>>>>>>>>>>> > > find
> >>>>>>>>>>>>>> > > > > history
and userlogs in the logs folder. Should I
> look
> >>>>>>>>>>>>>> at some other
> >>>>>>>>>>>>>> > > > place
> >>>>>>>>>>>>>> > > > > for the
logs? What could be the possible causes for
> >>>>>>>>>>>>>> the above error?
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >        I
am using Hadoop 0.20.2 and I am running it
> on
> >>>>>>>>>>>>>> a cluster with
> >>>>>>>>>>>>>> > > 16
> >>>>>>>>>>>>>> > > > > nodes.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Thank you.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > > Regards,
> >>>>>>>>>>>>>> > > > > Raghava.
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > > >
> >>>>>>>>>>>>>> > > >
> >>>>>>>>>>>>>> > >
> >>>>>>>>>>>>>> >
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Mime
View raw message