Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 48939 invoked from network); 9 Dec 2009 18:40:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Dec 2009 18:40:01 -0000 Received: (qmail 64701 invoked by uid 500); 9 Dec 2009 18:40:01 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 64627 invoked by uid 500); 9 Dec 2009 18:40:01 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 64618 invoked by uid 99); 9 Dec 2009 18:40:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2009 18:40:01 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [64.34.111.254] (HELO barmail1.idig.net) (64.34.111.254) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2009 18:39:49 +0000 Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net (Spam & Virus Firewall) with ESMTP id 05A7F39DA8B5 for ; Wed, 9 Dec 2009 10:39:25 -0800 (PST) Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net with ESMTP id IDA9VZQo6WOiJLYj for ; Wed, 09 Dec 2009 10:39:25 -0800 (PST) Received: from efendi-140-187.cust.b2b2c.ca ([72.10.140.187] helo=FuadPC) by cpweb10.idig.net with esmtp (Exim 4.69) (envelope-from ) id 1NIRRd-00035P-3L; Wed, 09 Dec 2009 10:39:25 -0800 From: "Fuad Efendi" To: , References: In-Reply-To: Subject: RE: Copyphase dies on timeout Date: Wed, 9 Dec 2009 13:39:14 -0500 Message-ID: <004101ca78fe$ea6a6cf0$bf3f46d0$@ca> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Acp4/dzNeCtT/gZ4TU2MqZ0eXo16HwAAD50A Content-Language: en-ca X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cpweb10.idig.net X-AntiAbuse: Original Domain - hadoop.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - efendi.ca X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked by ClamAV on apache.org I had similar issues before; Reduce can do some 'additional' processing = and have dependency on external services. For instance, in my case it sends documents to SOLR (Lucene based). So = that I have dependency on Lucene in a reducer. And, Lucene sometimes = needs 30-50 minutes to perform index merge. If your timeout settings are = not enough, job will be killed. You may also execute = reporter.keepAlive() each 60 seconds (in a separate thread spawned by = your reducer). See at mapred.task.timeout, and use reporter.progress(), = keepAlive(), etc. Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca Data Mining, Vertical Search > -----Original Message----- > From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] > Sent: December-09-09 1:31 PM > To: mapreduce-user@hadoop.apache.org > Subject: Copyphase dies on timeout >=20 > Hi folks, >=20 > I do have a questions regarding a map-reduce job who gets killed > during the copy phase due to a timeout. I don't really understand the > output completely so I first look for some help to understand what is > going on. > There are about 50 mapper running where each mapper succeeds. the job > is configured with a single reduces (I assume this could be closely > related to that?!) > So this is the output: >=20 > 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Merging > 5 sorted segments > 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Down to > the last merge-pass, with 5 segments left of total size: 216092233 > bytes > 2009-12-09 15:01:50,727 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Merge of the 5 files in-memory > complete. Local file is > /data/hadoop- > = data/mapred/local/taskTracker/jobcache/job_200912081402_0042/attempt_2009= 1 > 2081402_0042_r_000000_0/output/map_275.out > of size 216092229 > 2009-12-09 15:01:50,728 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Thread waiting: Thread for > merging on-disk files > 2009-12-09 15:01:58,109 INFO org.apache.hadoop.mapred.ReduceTask: Read > 53281055 bytes from map-output for > attempt_200912081402_0042_m_000253_0 > 2009-12-09 15:01:58,148 INFO org.apache.hadoop.mapred.ReduceTask: Rec > #1 from attempt_200912081402_0042_m_000253_0 -> (71, 103) from > hadoop32 > 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Need another 44 map output(s) > where 3 is already in progress > 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts > and41 dup hosts) > 2009-12-09 15:02:40,174 INFO org.apache.hadoop.mapred.ReduceTask: Read > 81297732 bytes from map-output for > attempt_200912081402_0042_m_000112_0 > 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: Rec > #1 from attempt_200912081402_0042_m_000112_0 -> (129, 153) from > hadoop31 > 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts > and16 dup hosts) > 2009-12-09 15:02:40,223 INFO org.apache.hadoop.mapred.ReduceTask: > Shuffling 82364774 bytes (82364778 raw bytes) into RAM from > attempt_200912081402_0042_m_000121_0 > 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Read > 82722679 bytes from map-output for > attempt_200912081402_0042_m_000196_0 > 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Rec > #1 from attempt_200912081402_0042_m_000196_0 -> (114, 137) from > hadoop21 > 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts > and32 dup hosts) > 2009-12-09 15:02:45,143 INFO org.apache.hadoop.mapred.ReduceTask: > Shuffling 82639206 bytes (82639210 raw bytes) into RAM from > attempt_200912081402_0042_m_000195_0 > 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Need another 42 map output(s) > where 3 is already in progress > 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts > and39 dup hosts) > 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Need another 42 map output(s) > where 3 is already in progress > 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts > and39 dup hosts) > ...... >=20 > Task attempt_200912081402_0042_r_000000_0 failed to report status for > 902 seconds. Killing! >=20 > my first guess would be some IO problem but I don not have any > indicator that those logs. If somebody can clarify or give me a hint > would be great. Even if it is something like: use more reducers > because of ... <-- this is the interesting part at the moment! > btw: this is hadoop-0.20 >=20 > simon