Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 61305 invoked from network); 9 Dec 2009 19:02:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Dec 2009 19:02:05 -0000 Received: (qmail 14799 invoked by uid 500); 9 Dec 2009 19:02:05 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 14754 invoked by uid 500); 9 Dec 2009 19:02:04 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 14745 invoked by uid 99); 9 Dec 2009 19:02:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2009 19:02:04 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.34.111.254] (HELO barmail1.idig.net) (64.34.111.254) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2009 19:02:02 +0000 Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net (Spam & Virus Firewall) with ESMTP id 2872C39CB914 for ; Wed, 9 Dec 2009 11:01:40 -0800 (PST) Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net with ESMTP id tME2BFmISQISe160 for ; Wed, 09 Dec 2009 11:01:40 -0800 (PST) Received: from efendi-140-187.cust.b2b2c.ca ([72.10.140.187] helo=FuadPC) by cpweb10.idig.net with esmtp (Exim 4.69) (envelope-from ) id 1NIRn4-0003cP-BO for mapreduce-user@hadoop.apache.org; Wed, 09 Dec 2009 11:01:39 -0800 From: "Fuad Efendi" To: References: <004101ca78fe$ea6a6cf0$bf3f46d0$@ca> In-Reply-To: Subject: RE: Copyphase dies on timeout Date: Wed, 9 Dec 2009 14:01:23 -0500 Message-ID: <004e01ca7902$022ba270$0682e750$@ca> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Acp4/+BikWR6mmerS3aL+peNDLsIdwAAhc6A Content-Language: en-ca X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cpweb10.idig.net X-AntiAbuse: Original Domain - hadoop.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - efendi.ca X-Source: X-Source-Args: X-Source-Dir: > >> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow = hosts > >> and39 dup hosts) - Reduce was started - you are probably using kind of web crawler (39 duplicate hosts, 0 slow = hosts) > >> Task attempt_200912081402_0042_r_000000_0 failed to report status = for > >> 902 seconds. Killing! - Try to increase mapred.task.timeout 10 times; you are = currently using 900 seconds. Another use case which I had before (with Reducer): TIKA hangs forever = on some weird HTML. And another use case (BIXO): HttpClient configured with 20 attempts, 30 = seconds timeout, which makes 20*30 of "non reporting delay" so that = default (mapred.task.timeout) 600 seconds setting is not good. Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca Data Mining, Vertical Search > -----Original Message----- > From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] > Sent: December-09-09 1:46 PM > To: Fuad Efendi > Cc: mapreduce-user@hadoop.apache.org > Subject: Re: Copyphase dies on timeout >=20 > Hmm it looks like the reducer hasn't even started. and I do not have > an external resource at all. >=20 > simon >=20 > On Wed, Dec 9, 2009 at 7:39 PM, Fuad Efendi wrote: > > I had similar issues before; Reduce can do some 'additional' = processing > and have dependency on external services. > > > > For instance, in my case it sends documents to SOLR (Lucene based). = So > that I have dependency on Lucene in a reducer. And, Lucene sometimes = needs > 30-50 minutes to perform index merge. If your timeout settings are not > enough, job will be killed. You may also execute reporter.keepAlive() = each > 60 seconds (in a separate thread spawned by your reducer). > > > > See at mapred.task.timeout, and use = reporter.progress(), > keepAlive(), etc. > > > > > > Fuad Efendi > > +1 416-993-2060 > > http://www.tokenizer.ca > > Data Mining, Vertical Search > > > > > >> -----Original Message----- > >> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] > >> Sent: December-09-09 1:31 PM > >> To: mapreduce-user@hadoop.apache.org > >> Subject: Copyphase dies on timeout > >> > >> Hi folks, > >> > >> I do have a questions regarding a map-reduce job who gets killed > >> during the copy phase due to a timeout. I don't really understand = the > >> output completely so I first look for some help to understand what = is > >> going on. > >> There are about 50 mapper running where each mapper succeeds. the = job > >> is configured with a single reduces (I assume this could be closely > >> related to that?!) > >> So this is the output: > >> > >> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: = Merging > >> 5 sorted segments > >> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Down = to > >> the last merge-pass, with 5 segments left of total size: 216092233 > >> bytes > >> 2009-12-09 15:01:50,727 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Merge of the 5 files in-memory > >> complete. Local file is > >> /data/hadoop- > >> > = data/mapred/local/taskTracker/jobcache/job_200912081402_0042/attempt_2009= 1 > >> 2081402_0042_r_000000_0/output/map_275.out > >> of size 216092229 > >> 2009-12-09 15:01:50,728 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Thread waiting: Thread for > >> merging on-disk files > >> 2009-12-09 15:01:58,109 INFO org.apache.hadoop.mapred.ReduceTask: = Read > >> 53281055 bytes from map-output for > >> attempt_200912081402_0042_m_000253_0 > >> 2009-12-09 15:01:58,148 INFO org.apache.hadoop.mapred.ReduceTask: = Rec > >> #1 from attempt_200912081402_0042_m_000253_0 -> (71, 103) from > >> hadoop32 > >> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Need another 44 map output(s) > >> where 3 is already in progress > >> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow = hosts > >> and41 dup hosts) > >> 2009-12-09 15:02:40,174 INFO org.apache.hadoop.mapred.ReduceTask: = Read > >> 81297732 bytes from map-output for > >> attempt_200912081402_0042_m_000112_0 > >> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: = Rec > >> #1 from attempt_200912081402_0042_m_000112_0 -> (129, 153) from > >> hadoop31 > >> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow = hosts > >> and16 dup hosts) > >> 2009-12-09 15:02:40,223 INFO org.apache.hadoop.mapred.ReduceTask: > >> Shuffling 82364774 bytes (82364778 raw bytes) into RAM from > >> attempt_200912081402_0042_m_000121_0 > >> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: = Read > >> 82722679 bytes from map-output for > >> attempt_200912081402_0042_m_000196_0 > >> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: = Rec > >> #1 from attempt_200912081402_0042_m_000196_0 -> (114, 137) from > >> hadoop21 > >> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow = hosts > >> and32 dup hosts) > >> 2009-12-09 15:02:45,143 INFO org.apache.hadoop.mapred.ReduceTask: > >> Shuffling 82639206 bytes (82639210 raw bytes) into RAM from > >> attempt_200912081402_0042_m_000195_0 > >> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s) > >> where 3 is already in progress > >> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow = hosts > >> and39 dup hosts) > >> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s) > >> where 3 is already in progress > >> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask: > >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow = hosts > >> and39 dup hosts) > >> ...... > >> > >> Task attempt_200912081402_0042_r_000000_0 failed to report status = for > >> 902 seconds. Killing! > >> > >> my first guess would be some IO problem but I don not have any > >> indicator that those logs. If somebody can clarify or give me a = hint > >> would be great. Even if it is something like: use more reducers > >> because of ... <-- this is the interesting part at the moment! > >> btw: this is hadoop-0.20 > >> > >> simon > > > > > >