Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 35625 invoked from network); 18 Apr 2010 08:25:36 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Apr 2010 08:25:36 -0000 Received: (qmail 11073 invoked by uid 500); 18 Apr 2010 08:25:35 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 10843 invoked by uid 500); 18 Apr 2010 08:25:32 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 10827 invoked by uid 99); 18 Apr 2010 08:25:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Apr 2010 08:25:31 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of m.vijayaraghava@gmail.com designates 209.85.223.198 as permitted sender) Received: from [209.85.223.198] (HELO mail-iw0-f198.google.com) (209.85.223.198) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Apr 2010 08:25:25 +0000 Received: by iwn36 with SMTP id 36so2630463iwn.29 for ; Sun, 18 Apr 2010 01:25:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:received:message-id:subject:to:cc:content-type; bh=oyW3WV2WOthFSX/6yWnlcd3iMa7+1wrL59WN+FaSgho=; b=d8TGRhk90Xq4bE6YnMZAzkZ/KNLr3UkiVipQf4xkA3Nr/Ti3msgZfx1SDOZSCtBuu2 q6ipGj/xTU8MNgxsAxjWLSFp3g8YOeRzg7lNZVkf13g6eQPWfkm4G+5SteARCYkKTxAx Ks85A6sAMvW0/+jOahW6Hbx4WukwYBB8khcqg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=skrMG1woDIASx92MW7COuctgVRAgX0b1dTbh10mVompsqfN05fjSMCHNSW5db6T3ps 429yv/ykSawXikJm/VVhs/knBKv6WFBKsp2aQfHoepMp4NnmzT+k0Xjv0BmVY9R3Q3aX 34mFkwZ59dN4y83ftwEI9y8fiW8Iicl+bLH6I= MIME-Version: 1.0 Received: by 10.231.193.218 with HTTP; Sun, 18 Apr 2010 01:24:43 -0700 (PDT) In-Reply-To: References: <53344011-1271457730-cardhu_decombobulator_blackberry.rim.net-1713886646-@bda726.bisx.prod.on.blackberry> From: Raghava Mutharaju Date: Sun, 18 Apr 2010 04:24:43 -0400 Received: by 10.231.169.144 with SMTP id z16mr1380662iby.25.1271579103253; Sun, 18 Apr 2010 01:25:03 -0700 (PDT) Message-ID: Subject: Re: Reduce gets struck at 99% To: common-user@hadoop.apache.org, Ted Yu Cc: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6d26c5ac0d51d04847e9486 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6d26c5ac0d51d04847e9486 Content-Type: text/plain; charset=ISO-8859-1 Hi, Thank you Ted. I would just describe the problem again, so that it is easier for anyone reading this email chain. I run a series of jobs one after another. Starting from the 4th job, Reducer gets stuck at 99% (Map 100% and Reduce 99%). It gets stuck at 99% for a many hours and then the job fails. Earlier there were 2 exceptions in the logs --- DFSClient exception (could not completely write into a file ) and Lease Expired Exception. Then I increased the ulimit -n (max no of open files) from 1024 to 32768 on the advise of Ted. After this, there are no exceptions in the logs but the reduce still gets stuck at 99%. Do you have any suggestions? Thank you. Regards, Raghava. On Sat, Apr 17, 2010 at 9:36 PM, Ted Yu wrote: > Hi, > Putting this thread back in pool to leverage collective intelligence. > > If you get the full command line of the java processes, it wouldn't be > difficult to correlate reduce task(s) with a particular job. > > Cheers > > On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju < > m.vijayaraghava@gmail.com> wrote: > > > Hello Ted, > > > > Thank you for the suggestions :). I haven't come across any other > > serious issue before this one. Infact, the same MR job runs for a smaller > > input size, although, lot slower than what we expected. > > > > I will use jstack to get stack trace. I had a question in this regard. > How > > would I know which MR job (job id) is related to which java process > (pid)? I > > can get a list of hadoop jobs with "hadoop job -list" and list of java > > processes with "jps" but how I couldn't determine how to get the > connection > > between these 2 lists. > > > > > > Thank you again. > > > > Regards, > > Raghava. > > > > On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu wrote: > > > >> If you look at > >> > https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776, > >> you can see that hdfs-127-branch20-redone-v2.txt< > https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was > the latest. > >> > >> You need to download the source code corresponding to your version of > >> hadoop, apply the patch and rebuild. > >> > >> If you haven't experienced serious issue with hadoop for other > scenarios, > >> we should try to find out the root cause for the current problem without > the > >> 127 patch. > >> > >> My advice is to use jstack to find what each thread was waiting for > after > >> reducers get stuck. > >> I would expect a deadlock in either your code or hdfs, I would think it > >> should the former. > >> > >> You can replace sensitive names in the stack traces and paste it if you > >> cannot determine the deadlock. > >> > >> Cheers > >> > >> > >> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju < > >> m.vijayaraghava@gmail.com> wrote: > >> > >>> Hello Ted, > >>> > >>> Thank you for the reply. Will this change fix my issue? I asked > >>> this because I again need to convince my admin to make this change. > >>> > >>> We have a gateway to the cluster-head. We generally run our MR > jobs > >>> on the gateway. Should this change be made to the hadoop installation > on the > >>> gateway? > >>> > >>> 1) I am confused on which patch to be applied? There are 4 patches > listed > >>> at https://issues.apache.org/jira/browse/HDFS-127 > >>> > >>> 2) How to apply the patch? Should we change the lines of code specified > >>> and rebuild hadoop? Or is there any other way? > >>> > >>> Thank you again. > >>> > >>> Regards, > >>> Raghava. > >>> > >>> > >>> On Fri, Apr 16, 2010 at 6:42 PM, wrote: > >>> > >>>> That patch is very important. > >>>> > >>>> please apply it. > >>>> > >>>> Sent from my Verizon Wireless BlackBerry > >>>> ------------------------------ > >>>> *From: * Raghava Mutharaju > >>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400 > >>>> *To: *Ted Yu > >>>> *Subject: *Re: Reduce gets struck at 99% > >>>> > >>>> Hi Ted, > >>>> > >>>> It took sometime to contact my department's admin (he was on > >>>> leave) and ask him to make ulimit changes effective in the cluster > (just > >>>> adding entry in /etc/security/limits.conf was not sufficient, so took > >>>> sometime to figure out). Now the ulimit is 32768. I ran the set of MR > jobs, > >>>> the result is the same --- it gets stuck at Reduce 99%. But this time, > there > >>>> are no exceptions in the logs. I view JobTracker logs through the Web > UI. I > >>>> checked "Running Jobs" as well as "Failed Jobs". > >>>> > >>>> I haven't asked the admin to apply the patch > >>>> https://issues.apache.org/jira/browse/HDFS-127 that you mentioned > >>>> earlier. Is this important? > >>>> > >>>> Do you any suggestions? > >>>> > >>>> Thank you. > >>>> > >>>> Regards, > >>>> Raghava. > >>>> > >>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu wrote: > >>>> > >>>>> For the user under whom you launch MR jobs. > >>>>> > >>>>> > >>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju < > >>>>> m.vijayaraghava@gmail.com> wrote: > >>>>> > >>>>>> Hi Ted, > >>>>>> > >>>>>> Sorry to bug you again :) but I do not have an account on all > >>>>>> the datanodes, I just have it on the machine on which I start the MR > jobs. > >>>>>> So is it required to increase the ulimit on all the nodes (in this > case the > >>>>>> admin may have to increase it for all the users?) > >>>>>> > >>>>>> > >>>>>> Regards, > >>>>>> Raghava. > >>>>>> > >>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu > wrote: > >>>>>> > >>>>>>> ulimit should be increased on all nodes. > >>>>>>> > >>>>>>> The link I gave you lists several actions to take. I think they're > >>>>>>> not specifically for hbase. > >>>>>>> Also make sure the following is applied: > >>>>>>> https://issues.apache.org/jira/browse/HDFS-127 > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju < > >>>>>>> m.vijayaraghava@gmail.com> wrote: > >>>>>>> > >>>>>>>> Hello Ted, > >>>>>>>> > >>>>>>>> Should the increase in ulimit to 32768 be applied on all > the > >>>>>>>> datanodes (its a 16 node cluster)? Is this related to HBase, > because I am > >>>>>>>> not using HBase. > >>>>>>>> Are the exceptions & delay (at Reduce 99%) due to this? > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Raghava. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu > wrote: > >>>>>>>> > >>>>>>>>> Your ulimit is low. > >>>>>>>>> Ask your admin to increase it to 32768 > >>>>>>>>> > >>>>>>>>> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju < > >>>>>>>>> m.vijayaraghava@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Ted, > >>>>>>>>>> > >>>>>>>>>> I am pasting below the timestamps from the log. > >>>>>>>>>> > >>>>>>>>>> Lease-exception: > >>>>>>>>>> > >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle > Finished > >>>>>>>>>> Sort Finished Finish Time Errors Task Logs > >>>>>>>>>> Counters Actions > >>>>>>>>>> attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15 > >>>>>>>>>> FAILED 0.00% > >>>>>>>>>> 8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010 > >>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec) > >>>>>>>>>> > >>>>>>>>>> ------------------------------------- > >>>>>>>>>> > >>>>>>>>>> DFS Client Exception: > >>>>>>>>>> > >>>>>>>>>> Task Attempts Machine Status Progress Start Time Shuffle > Finished > >>>>>>>>>> Sort Finished Finish Time Errors Task Logs > >>>>>>>>>> Counters Actions > >>>>>>>>>> attempt_201004060646_0057_r_000006_0 /default-rack/ > >>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00% > >>>>>>>>>> 8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010 > >>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec) > >>>>>>>>>> ------------------------------------------ > >>>>>>>>>> > >>>>>>>>>> The file limit is set to 1024. I checked couple of datanodes. I > >>>>>>>>>> haven't checked the headnode though. > >>>>>>>>>> > >>>>>>>>>> The no of currently open files under my username, on the system > on > >>>>>>>>>> which I started the MR jobs are 346 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thank you for you help :) > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> Raghava. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu >wrote: > >>>>>>>>>> > >>>>>>>>>>> Can you give me the timestamps of the two exceptions ? > >>>>>>>>>>> I want to see if they're related. > >>>>>>>>>>> > >>>>>>>>>>> I saw DFSClient$DFSOutputStream.close() in the first stack > trace. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu >wrote: > >>>>>>>>>>> > >>>>>>>>>>>> just to double check it's not a file > >>>>>>>>>>>> limits issue could you run the following on each of the hosts: > >>>>>>>>>>>> > >>>>>>>>>>>> $ ulimit -a > >>>>>>>>>>>> $ lsof | wc -l > >>>>>>>>>>>> > >>>>>>>>>>>> The first command will show you (among other things) the file > >>>>>>>>>>>> limits, it > >>>>>>>>>>>> should be above the default 1024. The second will tell you > have > >>>>>>>>>>>> many files > >>>>>>>>>>>> are currently open... > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju < > >>>>>>>>>>>> m.vijayaraghava@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Ted, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you for all the suggestions. I went through the > >>>>>>>>>>>>> job tracker logs and I have attached the exceptions found in > the logs. I > >>>>>>>>>>>>> found two exceptions > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1) org.apache.hadoop.ipc.RemoteException: > java.io.IOException: > >>>>>>>>>>>>> Could not complete write to file (DFS Client) > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2) org.apache.hadoop.ipc.RemoteException: > >>>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: > No lease on > >>>>>>>>>>>>> > /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014 > >>>>>>>>>>>>> File does not exist. Holder > DFSClient_attempt_201004060646_0057_r_000014_0 > >>>>>>>>>>>>> does not have any open files. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> The exception occurs at the point of writing out pairs > in > >>>>>>>>>>>>> the reducer and it occurs only in certain task attempts. I am > not using any > >>>>>>>>>>>>> custom output format or record writers but I do use custom > input reader. > >>>>>>>>>>>>> > >>>>>>>>>>>>> What could have gone wrong here? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Raghava. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu >wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Raghava: > >>>>>>>>>>>>>> Are you able to share the last segment of reducer log ? > >>>>>>>>>>>>>> You can get them from web UI: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Adding more log in your reducer task would help pinpoint > where > >>>>>>>>>>>>>> the issue is. > >>>>>>>>>>>>>> Also look in job tracker log. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Cheers > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju < > >>>>>>>>>>>>>> m.vijayaraghava@gmail.com > >>>>>>>>>>>>>> > wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Hi Ted, > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > Thank you for the suggestion. I enabled it using the > >>>>>>>>>>>>>> Configuration > >>>>>>>>>>>>>> > class because I cannot change hadoop-site.xml file (I am > not > >>>>>>>>>>>>>> an admin). The > >>>>>>>>>>>>>> > situation is still the same --- it gets stuck at reduce > 99% > >>>>>>>>>>>>>> and does not > >>>>>>>>>>>>>> > move further. > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > Regards, > >>>>>>>>>>>>>> > Raghava. > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu < > yuzhihong@gmail.com> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > You need to turn on yourself (hadoop-site.xml): > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > mapred.reduce.tasks.speculative.execution > >>>>>>>>>>>>>> > > true > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > mapred.map.tasks.speculative.execution > >>>>>>>>>>>>>> > > true > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju < > >>>>>>>>>>>>>> > > m.vijayaraghava@gmail.com > >>>>>>>>>>>>>> > > > wrote: > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > Hi, > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > Thank you Eric, Prashant and Greg. Although the > >>>>>>>>>>>>>> timeout problem was > >>>>>>>>>>>>>> > > > resolved, reduce is getting stuck at 99%. As of now, > it > >>>>>>>>>>>>>> has been stuck > >>>>>>>>>>>>>> > > > there > >>>>>>>>>>>>>> > > > for about 3 hrs. That is too high a wait time for my > >>>>>>>>>>>>>> task. Do you guys > >>>>>>>>>>>>>> > > see > >>>>>>>>>>>>>> > > > any reason for this? > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > Speculative execution is "on" by default right? > Or > >>>>>>>>>>>>>> should I enable > >>>>>>>>>>>>>> > > it? > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > Regards, > >>>>>>>>>>>>>> > > > Raghava. > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence < > >>>>>>>>>>>>>> gregl@yahoo-inc.com > >>>>>>>>>>>>>> > > > >wrote: > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > Hi, > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > I have also experienced this problem. Have you tried > >>>>>>>>>>>>>> speculative > >>>>>>>>>>>>>> > > > execution? > >>>>>>>>>>>>>> > > > > Also, I have had jobs that took a long time for one > >>>>>>>>>>>>>> mapper / reducer > >>>>>>>>>>>>>> > > > because > >>>>>>>>>>>>>> > > > > of a record that was significantly larger than those > >>>>>>>>>>>>>> contained in the > >>>>>>>>>>>>>> > > > other > >>>>>>>>>>>>>> > > > > filesplits. Do you know if it always slows down for > >>>>>>>>>>>>>> the same > >>>>>>>>>>>>>> > filesplit? > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Regards, > >>>>>>>>>>>>>> > > > > Greg Lawrence > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > On 4/8/10 10:30 AM, "Raghava Mutharaju" < > >>>>>>>>>>>>>> m.vijayaraghava@gmail.com> > >>>>>>>>>>>>>> > > > wrote: > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Hello all, > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > I got the time out error as mentioned below > >>>>>>>>>>>>>> -- after 600 > >>>>>>>>>>>>>> > > > seconds, > >>>>>>>>>>>>>> > > > > that attempt was killed and the attempt would be > >>>>>>>>>>>>>> deemed a failure. I > >>>>>>>>>>>>>> > > > > searched around about this error, and one of the > >>>>>>>>>>>>>> suggestions to > >>>>>>>>>>>>>> > include > >>>>>>>>>>>>>> > > > > "progress" statements in the reducer -- it might be > >>>>>>>>>>>>>> taking longer > >>>>>>>>>>>>>> > than > >>>>>>>>>>>>>> > > > 600 > >>>>>>>>>>>>>> > > > > seconds and so is timing out. I added calls to > >>>>>>>>>>>>>> context.progress() and > >>>>>>>>>>>>>> > > > > context.setStatus(str) in the reducer. Now, it works > >>>>>>>>>>>>>> fine -- there > >>>>>>>>>>>>>> > are > >>>>>>>>>>>>>> > > no > >>>>>>>>>>>>>> > > > > timeout errors. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > But, for a few jobs, it takes awfully long > >>>>>>>>>>>>>> time to move from > >>>>>>>>>>>>>> > > > "Map > >>>>>>>>>>>>>> > > > > 100%, Reduce 99%" to Reduce 100%. For some jobs its > >>>>>>>>>>>>>> 15mins and for > >>>>>>>>>>>>>> > some > >>>>>>>>>>>>>> > > > it > >>>>>>>>>>>>>> > > > > was more than an hour. The reduce code is not > complex > >>>>>>>>>>>>>> -- 2 level loop > >>>>>>>>>>>>>> > > and > >>>>>>>>>>>>>> > > > > couple of if-else blocks. The input size is also not > >>>>>>>>>>>>>> huge, for the > >>>>>>>>>>>>>> > job > >>>>>>>>>>>>>> > > > that > >>>>>>>>>>>>>> > > > > gets struck for an hour at reduce 99%, it would take > >>>>>>>>>>>>>> in 130. Some of > >>>>>>>>>>>>>> > > them > >>>>>>>>>>>>>> > > > > are 1-3 MB in size and couple of them are 16MB in > >>>>>>>>>>>>>> size. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Has anyone encountered this problem before? > >>>>>>>>>>>>>> Any pointers? I > >>>>>>>>>>>>>> > > use > >>>>>>>>>>>>>> > > > > Hadoop 0.20.2 on a linux cluster of 16 nodes. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Thank you. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Regards, > >>>>>>>>>>>>>> > > > > Raghava. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju < > >>>>>>>>>>>>>> > > > > m.vijayaraghava@gmail.com> wrote: > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Hi all, > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > I am running a series of jobs one after > >>>>>>>>>>>>>> another. While > >>>>>>>>>>>>>> > executing > >>>>>>>>>>>>>> > > > the > >>>>>>>>>>>>>> > > > > 4th job, the job fails. It fails in the reducer --- > >>>>>>>>>>>>>> the progress > >>>>>>>>>>>>>> > > > percentage > >>>>>>>>>>>>>> > > > > would be map 100%, reduce 99%. It gives out the > >>>>>>>>>>>>>> following message > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > 10/04/01 01:04:15 INFO mapred.JobClient: Task Id : > >>>>>>>>>>>>>> > > > > attempt_201003240138_0110_r_000018_1, Status : > FAILED > >>>>>>>>>>>>>> > > > > Task attempt_201003240138_0110_r_000018_1 failed to > >>>>>>>>>>>>>> report status for > >>>>>>>>>>>>>> > > 602 > >>>>>>>>>>>>>> > > > > seconds. Killing! > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > It makes several attempts again to execute it but > >>>>>>>>>>>>>> fails with similar > >>>>>>>>>>>>>> > > > > message. I couldn't get anything from this error > >>>>>>>>>>>>>> message and wanted > >>>>>>>>>>>>>> > to > >>>>>>>>>>>>>> > > > look > >>>>>>>>>>>>>> > > > > at logs (located in the default dir of > >>>>>>>>>>>>>> ${HADOOP_HOME/logs}). But I > >>>>>>>>>>>>>> > > don't > >>>>>>>>>>>>>> > > > > find any files which match the timestamp of the job. > >>>>>>>>>>>>>> Also I did not > >>>>>>>>>>>>>> > > find > >>>>>>>>>>>>>> > > > > history and userlogs in the logs folder. Should I > look > >>>>>>>>>>>>>> at some other > >>>>>>>>>>>>>> > > > place > >>>>>>>>>>>>>> > > > > for the logs? What could be the possible causes for > >>>>>>>>>>>>>> the above error? > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > I am using Hadoop 0.20.2 and I am running it > on > >>>>>>>>>>>>>> a cluster with > >>>>>>>>>>>>>> > > 16 > >>>>>>>>>>>>>> > > > > nodes. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Thank you. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > Regards, > >>>>>>>>>>>>>> > > > > Raghava. > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > --0016e6d26c5ac0d51d04847e9486 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,

=A0=A0=A0=A0=A0=A0=A0 Thank you Ted. I would just describe the p= roblem again, so that it is easier for anyone reading this email chain.
=
I run a series of jobs one after another. Starting from the 4th job, Re= ducer gets stuck at 99% (Map 100% and Reduce 99%). It gets stuck at 99% for= a many hours and then the job fails. Earlier there were 2 exceptions in th= e logs --- DFSClient exception (could not completely write into a file <= file name>) and Lease Expired Exception. Then I increased the ulimit -n = (max no of open files) from 1024 to 32768 on the advise of Ted. After this,= there are no exceptions in the logs but the reduce still gets stuck at 99%= .

Do you have any suggestions?

Thank you.

Regards,
Ragha= va.


On Sat, Apr 17, 2010 at 9:36 PM, = Ted Yu <yuzhiho= ng@gmail.com> wrote:
Hi,
Putting this thread back in pool to leverage collective intelligence.

If you get the full command line of the java processes, it wouldn't be<= br> difficult to correlate reduce task(s) with a particular job.

Cheers

On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju <
m.vijayaragh= ava@gmail.com> wrote:

> Hello Ted,
>
> =A0 =A0 =A0 =A0Thank you for the suggestions := ). I haven't come across any other
> serious issue before this one. Infact, the same MR job runs for a smal= ler
> input size, although, lot slower than what we expected.
>
> I will use jstack to get stack trace. I had a question in this regard.= How
> would I know which MR job (job id) is related to which java process (p= id)? I
> can get a list of hadoop jobs with "hadoop job -list" and li= st of java
> processes with "jps" but how I couldn't determine how to= get the connection
> between these 2 lists.
>
>
> Thank you again.
>
> Regards,
> Raghava.
>
> On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> If you look at
>> https://issues.apache.org/jira/secur= e/ManageAttachments.jspa?id=3D12408776,
>> you can see that hdfs-127-branch20-redone-v2.txt<https://issues.apache.org/jira/secure= /attachment/12431012/hdfs-127-branch20-redone-v2.txt>was the latest.=
>>
>> You need to download the source code corresponding to your version= of
>> hadoop, apply the patch and rebuild.
>>
>> If you haven't experienced serious issue with hadoop for other= scenarios,
>> we should try to find out the root cause for the current problem w= ithout the
>> 127 patch.
>>
>> My advice is to use jstack to find what each thread was waiting fo= r after
>> reducers get stuck.
>> I would expect a deadlock in either your code or hdfs, I would thi= nk it
>> should the former.
>>
>> You can replace sensitive names in the stack traces and paste it i= f you
>> cannot determine the deadlock.
>>
>> Cheers
>>
>>
>> On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
>> m.vijayaraghava@gmail= .com> wrote:
>>
>>> Hello Ted,
>>>
>>> =A0 =A0 =A0 Thank you for the reply. Will this change fix my i= ssue? I asked
>>> this because I again need to convince my admin to make this ch= ange.
>>>
>>> =A0 =A0 =A0 We have a gateway to the cluster-head. We generall= y run our MR jobs
>>> on the gateway. Should this change be made to the hadoop insta= llation on the
>>> gateway?
>>>
>>> 1) I am confused on which patch to be applied? There are 4 pat= ches listed
>>> at https://issues.apache.org/jira/browse/HDFS-127
>>>
>>> 2) How to apply the patch? Should we change the lines of code = specified
>>> and rebuild hadoop? Or is there any other way?
>>>
>>> Thank you again.
>>>
>>> Regards,
>>> Raghava.
>>>
>>>
>>> On Fri, Apr 16, 2010 at 6:42 PM, <yuzhihong@gmail.com> wrote:
>>>
>>>> That patch is very important.
>>>>
>>>> please apply it.
>>>>
>>>> Sent from my Verizon Wireless BlackBerry
>>>> ------------------------------
>>>> *From: * Raghava Mutharaju <m.vijayaraghava@gmail.com>
>>>> *Date: *Fri, 16 Apr 2010 17:27:11 -0400
>>>> *To: *Ted Yu<yuz= hihong@gmail.com>
>>>> *Subject: *Re: Reduce gets struck at 99%
>>>>
>>>> Hi Ted,
>>>>
>>>> =A0 =A0 =A0 =A0 It took sometime to contact my department&= #39;s admin (he was on
>>>> leave) and ask him to make ulimit changes effective in the= cluster (just
>>>> adding entry in /etc/security/limits.conf was not sufficie= nt, so took
>>>> sometime to figure out). Now the ulimit is 32768. I ran th= e set of MR jobs,
>>>> the result is the same --- it gets stuck at Reduce 99%. Bu= t this time, there
>>>> are no exceptions in the logs. I view JobTracker logs thro= ugh the Web UI. I
>>>> checked "Running Jobs" as well as "Failed J= obs".
>>>>
>>>> I haven't asked the admin to apply the patch
>>>> https://issues.apache.org/jira/browse/HDFS-127 that = you mentioned
>>>> earlier. Is this important?
>>>>
>>>> Do you any suggestions?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Raghava.
>>>>
>>>> On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>
>>>>> For the user under whom you launch MR jobs.
>>>>>
>>>>>
>>>>> On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <= ;
>>>>> m.vijayar= aghava@gmail.com> wrote:
>>>>>
>>>>>> Hi Ted,
>>>>>>
>>>>>> =A0 =A0 =A0 =A0Sorry to bug you again :) but I do = not have an account on all
>>>>>> the datanodes, I just have it on the machine on wh= ich I start the MR jobs.
>>>>>> So is it required to increase the ulimit on all th= e nodes (in this case the
>>>>>> admin may have to increase it for all the users?)<= br> >>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Raghava.
>>>>>>
>>>>>> On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>>>
>>>>>>> ulimit should be increased on all nodes.
>>>>>>>
>>>>>>> The link I gave you lists several actions to t= ake. I think they're
>>>>>>> not specifically for hbase.
>>>>>>> Also make sure the following is applied:
>>>>>>> https://issues.apache.org/jira/browse/HDFS-1= 27
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutha= raju <
>>>>>>> m= .vijayaraghava@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Ted,
>>>>>>>>
>>>>>>>> =A0 =A0 =A0 =A0Should the increase in ulim= it to 32768 be applied on all the
>>>>>>>> datanodes (its a 16 node cluster)? Is this= related to HBase, because I am
>>>>>>>> not using HBase.
>>>>>>>> =A0 =A0 =A0 =A0Are the exceptions & de= lay (at Reduce 99%) due to this?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Raghava.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu <= ;yuzhihong@gmail.com> wrote:<= br> >>>>>>>>
>>>>>>>>> Your ulimit is low.
>>>>>>>>> Ask your admin to increase it to 32768=
>>>>>>>>>
>>>>>>>>> See http://wiki.apache.org/hado= op/Hbase/Troubleshooting, item #6
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 8, 2010 at 9:46 PM, Raghav= a Mutharaju <
>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ted,
>>>>>>>>>>
>>>>>>>>>> I am pasting below the timestamps = from the log.
>>>>>>>>>>
>>>>>>>>>> =A0 =A0 =A0 =A0Lease-exception: >>>>>>>>>>
>>>>>>>>>> Task Attempts Machine Status Progr= ess Start Time Shuffle Finished
>>>>>>>>>> Sort Finished Finish Time Errors T= ask Logs
>>>>>>>>>> =A0 =A0Counters Actions
>>>>>>>>>> =A0 =A0attempt_201004060646_0057_r= _000014_0 /default-rack/nimbus15
>>>>>>>>>> FAILED 0.00%
>>>>>>>>>> =A0 =A08-Apr-2010 07:38:53 8-Apr-2= 010 07:39:21 (27sec) 8-Apr-2010
>>>>>>>>>> 07:39:21 (0sec) 8-Apr-2010 09:54:3= 3 (2hrs, 15mins, 39sec)
>>>>>>>>>>
>>>>>>>>>> ----------------------------------= ---
>>>>>>>>>>
>>>>>>>>>> =A0 =A0 =A0 =A0 DFS Client Excepti= on:
>>>>>>>>>>
>>>>>>>>>> Task Attempts Machine Status Progr= ess Start Time Shuffle Finished
>>>>>>>>>> Sort Finished Finish Time Errors T= ask Logs
>>>>>>>>>> =A0 =A0Counters Actions
>>>>>>>>>> =A0 =A0attempt_201004060646_0057_r= _000006_0 /default-rack/
>>>>>>>>>> nimbus3.cs.wright.edu FAILED 0.00%
>>>>>>>>>> =A0 =A08-Apr-2010 07:38:47 8-Apr-2= 010 07:39:10 (23sec) 8-Apr-2010
>>>>>>>>>> 07:39:11 (0sec) 8-Apr-2010 08:51:3= 3 (1hrs, 12mins, 46sec)
>>>>>>>>>> ----------------------------------= --------
>>>>>>>>>>
>>>>>>>>>> The file limit is set to 1024. I c= hecked couple of datanodes. I
>>>>>>>>>> haven't checked the headnode t= hough.
>>>>>>>>>>
>>>>>>>>>> The no of currently open files und= er my username, on the system on
>>>>>>>>>> which I started the MR jobs are 34= 6
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thank you for you help :)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Raghava.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 9, 2010 at 12:14 AM, T= ed Yu <yuzhihong@gmail.com>= ;wrote:
>>>>>>>>>>
>>>>>>>>>>> Can you give me the timestamps= of the two exceptions ?
>>>>>>>>>>> I want to see if they're r= elated.
>>>>>>>>>>>
>>>>>>>>>>> I saw DFSClient$DFSOutputStrea= m.close() in the first stack trace.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 8, 2010 at 9:09 PM= , Ted Yu <yuzhihong@gmail.com= >wrote:
>>>>>>>>>>>
>>>>>>>>>>>> just to double check it= 9;s not a file
>>>>>>>>>>>> limits issue could you run= the following on each of the hosts:
>>>>>>>>>>>>
>>>>>>>>>>>> $ ulimit -a
>>>>>>>>>>>> $ lsof | wc -l
>>>>>>>>>>>>
>>>>>>>>>>>> The first command will sho= w you (among other things) the file
>>>>>>>>>>>> limits, it
>>>>>>>>>>>> should be above the defaul= t 1024. =A0The second will tell you have
>>>>>>>>>>>> many files
>>>>>>>>>>>> are currently open...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 8, 2010 at 7:4= 0 PM, Raghava Mutharaju <
>>>>>>>>>>>> m.vijayaraghava@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>>>
>>>>>>>>>>>>> =A0 =A0 =A0 =A0 Thank = you for all the suggestions. I went through the
>>>>>>>>>>>>> job tracker logs and I= have attached the exceptions found in the logs. I
>>>>>>>>>>>>> found two exceptions >>>>>>>>>>>>>
>>>>>>>>>>>>> 1) org.apache.hadoop.i= pc.RemoteException: java.io.IOException:
>>>>>>>>>>>>> Could not complete wri= te to file =A0 =A0(DFS Client)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) org.apache.hadoop.i= pc.RemoteException:
>>>>>>>>>>>>> org.apache.hadoop.hdfs= .server.namenode.LeaseExpiredException: No lease on
>>>>>>>>>>>>> /user/raghava/MR_EL/ou= tput/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
>>>>>>>>>>>>> File does not exist. H= older DFSClient_attempt_201004060646_0057_r_000014_0
>>>>>>>>>>>>> does not have any open= files.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The exception occurs a= t the point of writing out <K,V> pairs in
>>>>>>>>>>>>> the reducer and it occ= urs only in certain task attempts. I am not using any
>>>>>>>>>>>>> custom output format o= r record writers but I do use custom input reader.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What could have gone w= rong here?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Raghava.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 8, 2010 at= 5:51 PM, Ted Yu <yuzhihong@gmail= .com>wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Raghava:
>>>>>>>>>>>>>> Are you able to sh= are the last segment of reducer log ?
>>>>>>>>>>>>>> You can get them f= rom web UI:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://snv-it-lin-012.pr.com:50= 060/tasklog?taskid=3Dattempt_201003221148_1211_r_000003_0&start=3D-8193=
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Adding more log in= your reducer task would help pinpoint where
>>>>>>>>>>>>>> the issue is.
>>>>>>>>>>>>>> Also look in job t= racker log.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Apr 8, 201= 0 at 2:46 PM, Raghava Mutharaju <
>>>>>>>>>>>>>> m.vijayaraghava@gmail.com
>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Hi Ted,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > =A0 =A0 =A0Th= ank you for the suggestion. I enabled it using the
>>>>>>>>>>>>>> Configuration
>>>>>>>>>>>>>> > class because= I cannot change hadoop-site.xml file (I am not
>>>>>>>>>>>>>> an admin). The
>>>>>>>>>>>>>> > situation is = still the same --- it gets stuck at reduce 99%
>>>>>>>>>>>>>> and does not
>>>>>>>>>>>>>> > move further.=
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Regards,
>>>>>>>>>>>>>> > Raghava.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Thu, Apr 8= , 2010 at 4:40 PM, Ted Yu <yuzhih= ong@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > > You need= to turn on yourself (hadoop-site.xml):
>>>>>>>>>>>>>> > > <prop= erty>
>>>>>>>>>>>>>> > > =A0<n= ame>mapred.reduce.tasks.speculative.execution</name>
>>>>>>>>>>>>>> > > =A0<v= alue>true</value>
>>>>>>>>>>>>>> > > </pro= perty>
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > > <prop= erty>
>>>>>>>>>>>>>> > > =A0<n= ame>mapred.map.tasks.speculative.execution</name>
>>>>>>>>>>>>>> > > =A0<v= alue>true</value>
>>>>>>>>>>>>>> > > </pro= perty>
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > > On Thu, = Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
>>>>>>>>>>>>>> > > m.vijayaraghava@gmail.com
>>>>>>>>>>>>>> > > > wro= te:
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> > > > Hi,=
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > =A0= =A0 Thank you Eric, Prashant and Greg. Although the
>>>>>>>>>>>>>> timeout problem wa= s
>>>>>>>>>>>>>> > > > res= olved, reduce is getting stuck at 99%. As of now, it
>>>>>>>>>>>>>> has been stuck
>>>>>>>>>>>>>> > > > the= re
>>>>>>>>>>>>>> > > > for= about 3 hrs. That is too high a wait time for my
>>>>>>>>>>>>>> task. Do you guys<= br> >>>>>>>>>>>>>> > > see
>>>>>>>>>>>>>> > > > any= reason for this?
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > =A0= =A0 =A0Speculative execution is "on" by default right? Or
>>>>>>>>>>>>>> should I enable >>>>>>>>>>>>>> > > it?
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > Reg= ards,
>>>>>>>>>>>>>> > > > Rag= hava.
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > On = Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
>>>>>>>>>>>>>> gregl@yahoo-inc.com
>>>>>>>>>>>>>> > > > >= ;wrote:
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > > > >= ; =A0Hi,
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; I have also experienced this problem. Have you tried
>>>>>>>>>>>>>> speculative
>>>>>>>>>>>>>> > > > exe= cution?
>>>>>>>>>>>>>> > > > >= ; Also, I have had jobs that took a long time for one
>>>>>>>>>>>>>> mapper / reducer >>>>>>>>>>>>>> > > > bec= ause
>>>>>>>>>>>>>> > > > >= ; of a record that was significantly larger than those
>>>>>>>>>>>>>> contained in the >>>>>>>>>>>>>> > > > oth= er
>>>>>>>>>>>>>> > > > >= ; filesplits. Do you know if it always slows down for
>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>> > filesplit? >>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Regards,
>>>>>>>>>>>>>> > > > >= ; Greg Lawrence
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; On 4/8/10 10:30 AM, "Raghava Mutharaju" <
>>>>>>>>>>>>>> m.vijayaraghava@gmail.com>
>>>>>>>>>>>>>> > > > wro= te:
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Hello all,
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; =A0 =A0 =A0 =A0 =A0I got the time out error as mentioned below
>>>>>>>>>>>>>> -- after 600
>>>>>>>>>>>>>> > > > sec= onds,
>>>>>>>>>>>>>> > > > >= ; that attempt was killed and the attempt would be
>>>>>>>>>>>>>> deemed a failure. = I
>>>>>>>>>>>>>> > > > >= ; searched around about this error, and one of the
>>>>>>>>>>>>>> suggestions to
>>>>>>>>>>>>>> > include
>>>>>>>>>>>>>> > > > >= ; "progress" statements in the reducer -- it might be
>>>>>>>>>>>>>> taking longer
>>>>>>>>>>>>>> > than
>>>>>>>>>>>>>> > > > 600=
>>>>>>>>>>>>>> > > > >= ; seconds and so is timing out. I added calls to
>>>>>>>>>>>>>> context.progress()= and
>>>>>>>>>>>>>> > > > >= ; context.setStatus(str) in the reducer. Now, it works
>>>>>>>>>>>>>> fine -- there
>>>>>>>>>>>>>> > are
>>>>>>>>>>>>>> > > no
>>>>>>>>>>>>>> > > > >= ; timeout errors.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; =A0 =A0 =A0 =A0 =A0But, for a few jobs, it takes awfully long
>>>>>>>>>>>>>> time to move from<= br> >>>>>>>>>>>>>> > > > &qu= ot;Map
>>>>>>>>>>>>>> > > > >= ; 100%, Reduce 99%" to Reduce 100%. For some jobs its
>>>>>>>>>>>>>> 15mins and for
>>>>>>>>>>>>>> > some
>>>>>>>>>>>>>> > > > it<= br> >>>>>>>>>>>>>> > > > >= ; was more than an hour. The reduce code is not complex
>>>>>>>>>>>>>> -- 2 level loop >>>>>>>>>>>>>> > > and
>>>>>>>>>>>>>> > > > >= ; couple of if-else blocks. The input size is also not
>>>>>>>>>>>>>> huge, for the
>>>>>>>>>>>>>> > job
>>>>>>>>>>>>>> > > > tha= t
>>>>>>>>>>>>>> > > > >= ; gets struck for an hour at reduce 99%, it would take
>>>>>>>>>>>>>> in 130. Some of >>>>>>>>>>>>>> > > them
>>>>>>>>>>>>>> > > > >= ; are 1-3 MB in size and couple of them are 16MB in
>>>>>>>>>>>>>> size.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; =A0 =A0 =A0 =A0 =A0Has anyone encountered this problem before?
>>>>>>>>>>>>>> Any pointers? I >>>>>>>>>>>>>> > > use
>>>>>>>>>>>>>> > > > >= ; Hadoop 0.20.2 on a linux cluster of 16 nodes.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Thank you.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Regards,
>>>>>>>>>>>>>> > > > >= ; Raghava.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
>>>>>>>>>>>>>> > > > >= ; m.vijayaraghava@gmail.com> wrote:
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Hi all,
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; =A0 =A0 =A0 =A0I am running a series of jobs one after
>>>>>>>>>>>>>> another. While
>>>>>>>>>>>>>> > executing
>>>>>>>>>>>>>> > > > the=
>>>>>>>>>>>>>> > > > >= ; 4th job, the job fails. It fails in the reducer ---
>>>>>>>>>>>>>> the progress
>>>>>>>>>>>>>> > > > per= centage
>>>>>>>>>>>>>> > > > >= ; would be map 100%, reduce 99%. It gives out the
>>>>>>>>>>>>>> following message<= br> >>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; 10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
>>>>>>>>>>>>>> > > > >= ; attempt_201003240138_0110_r_000018_1, Status : FAILED
>>>>>>>>>>>>>> > > > >= ; Task attempt_201003240138_0110_r_000018_1 failed to
>>>>>>>>>>>>>> report status for<= br> >>>>>>>>>>>>>> > > 602
>>>>>>>>>>>>>> > > > >= ; seconds. Killing!
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; It makes several attempts again to execute it but
>>>>>>>>>>>>>> fails with similar=
>>>>>>>>>>>>>> > > > >= ; message. I couldn't get anything from this error
>>>>>>>>>>>>>> message and wanted=
>>>>>>>>>>>>>> > to
>>>>>>>>>>>>>> > > > loo= k
>>>>>>>>>>>>>> > > > >= ; at logs (located in the default dir of
>>>>>>>>>>>>>> ${HADOOP_HOME/logs= }). But I
>>>>>>>>>>>>>> > > don'= t
>>>>>>>>>>>>>> > > > >= ; find any files which match the timestamp of the job.
>>>>>>>>>>>>>> Also I did not
>>>>>>>>>>>>>> > > find
>>>>>>>>>>>>>> > > > >= ; history and userlogs in the logs folder. Should I look
>>>>>>>>>>>>>> at some other
>>>>>>>>>>>>>> > > > pla= ce
>>>>>>>>>>>>>> > > > >= ; for the logs? What could be the possible causes for
>>>>>>>>>>>>>> the above error? >>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; =A0 =A0 =A0 =A0I am using Hadoop 0.20.2 and I am running it on
>>>>>>>>>>>>>> a cluster with
>>>>>>>>>>>>>> > > 16
>>>>>>>>>>>>>> > > > >= ; nodes.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Thank you.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ; Regards,
>>>>>>>>>>>>>> > > > >= ; Raghava.
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > > >= ;
>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--0016e6d26c5ac0d51d04847e9486--