Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 17293 invoked from network); 24 Mar 2011 21:04:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Mar 2011 21:04:04 -0000 Received: (qmail 35169 invoked by uid 500); 24 Mar 2011 21:04:02 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 35125 invoked by uid 500); 24 Mar 2011 21:04:02 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 35117 invoked by uid 99); 24 Mar 2011 21:04:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2011 21:04:02 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=SPF_PASS,URIBL_RHS_DOB X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [163.231.6.6] (HELO mailout1-trp.thomsonreuters.com) (163.231.6.6) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2011 21:03:56 +0000 Received: from trpusmneagrly01.int.westgroup.com (relay1 [163.231.22.97]) by mailout1-trp.thomsonreuters.com (Sentrion-MTA-4.0.5/Sentrion-MTA-4.0.5) with ESMTP id p2OL3YKr010260 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Thu, 24 Mar 2011 21:03:34 GMT Received: from tlrusmneagfe03.ERF.THOMSON.COM (xch3 [163.231.22.177]) by trpusmneagrly01.int.westgroup.com (Sentrion-MTA-4.0.5/Sentrion-MTA-4.0.5) with ESMTP id p2OL3BFs007684 for ; Thu, 24 Mar 2011 21:03:30 GMT Received: from tshusmnnadbh01.ERF.THOMSON.COM ([10.236.128.70]) by tlrusmneagfe03.ERF.THOMSON.COM with Microsoft SMTPSVC(6.0.3790.4675); Thu, 24 Mar 2011 16:03:12 -0500 Received: from thusmiannmsg08.ERF.THOMSON.COM ([10.232.42.4]) by tshusmnnadbh01.ERF.THOMSON.COM with Microsoft SMTPSVC(6.0.3790.4675); Thu, 24 Mar 2011 16:03:11 -0500 Received: from TSHUSMIANNMBX01.ERF.THOMSON.COM ([10.232.42.15]) by thusmiannmsg08.ERF.THOMSON.COM with Microsoft SMTPSVC(6.0.3790.4675); Thu, 24 Mar 2011 17:03:11 -0400 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Program freezes at Map 99% Reduce 33% Date: Thu, 24 Mar 2011 17:03:10 -0400 Message-ID: <118533FA9ABDFD4B97DB6D8A77A4E0BE1343C8A7@TSHUSMIANNMBX01.ERF.THOMSON.COM> In-Reply-To: <4D8BA210.8060106@uchicago.edu> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Program freezes at Map 99% Reduce 33% thread-index: AcvqXVt4JZbtcSM9SFaoHapIxXRMygACTCng References: <4D8B950C.8020505@uchicago.edu> <118533FA9ABDFD4B97DB6D8A77A4E0BE1343C753@TSHUSMIANNMBX01.ERF.THOMSON.COM> <4D8BA210.8060106@uchicago.edu> From: To: X-OriginalArrivalTime: 24 Mar 2011 21:03:11.0150 (UTC) FILETIME=[E21FF4E0:01CBEA66] Shi, This states "Of course, the framework discards the sub-directory of unsuccessful task-attempts." http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht ml=20 So yes, the missing directory is likely a failure. If you can, narrow the problem down by looking at sections of your data on a smaller number of mappers. Look at just the tail of your input and a couple of mappers and then compare that to your program run without hadoop. Kevin -----Original Message----- From: Shi Yu [mailto:shiyu@uchicago.edu]=20 Sent: Thursday, March 24, 2011 3:57 PM To: common-user@hadoop.apache.org Subject: Re: Program freezes at Map 99% Reduce 33% Hi Kevin, thanks for reply. I could hardly imagine an example of incomplete record. The mapper is very simple, just reading line by line as Strings, splitting the line by tab, and outputting a Text Pair for sort and secondary sort. If there were incomplete record, there should be an=20 error and the only place could happen is the tab splitting stage. I=20 use LZO codec compressing the Mapper output and Reducer output. There is one issue which I think might be the reason. The files in the log has the following structure . It seems I had 294 mappers. Notice that there is one file missing: "attempt_201103231501_0007_m_000292_0", where is the Nr. 292 mapper output ???? Was it a failed node or something? All the logs just hang there for 2 hours (~12:35, while the current time is 14:50). Some folders were visited around 14:45 by me. So no reducer nor mapper has been generating any logs in the past two hours. Shi drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36 attempt_201103231501_0007_m_000281_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000282_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000283_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35 attempt_201103231501_0007_m_000284_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000285_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59 attempt_201103231501_0007_m_000286_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000287_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44 attempt_201103231501_0007_m_000288_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000289_0 drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37 attempt_201103231501_0007_m_000289_1 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22 attempt_201103231501_0007_m_000290_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45 attempt_201103231501_0007_m_000291_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_m_000293_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000000_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000001_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000002_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000003_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000004_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000005_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000006_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000007_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47 attempt_201103231501_0007_r_000008_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000009_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000010_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000011_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000012_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000013_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000014_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000015_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000016_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000017_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42 attempt_201103231501_0007_r_000018_0 Shi On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote: > Shi, > > The key here is the 99% done mapper. Nothing can move on until all=20 > mappers complete. > Is it possible your data in the larger set has an incomplete record or > some such at the end? > > Kevin > > -----Original Message----- > From: Shi Yu [mailto:shiyu@uchicago.edu] > Sent: Thursday, March 24, 2011 3:02 PM > To: hadoop user > Subject: Program freezes at Map 99% Reduce 33% > > I am running a hadoop program processing Tera Byte size data. The code > was test successfully on a small sample (100G) and it worked. However, > when trying it on the full problem, the program freezes forever at Map > 99% Reduce 33%. There is no error, and the size of userlog folder is=20 > clean (<30M) cause otherwise it will generate Giga bytes of error logs. > > I checked the log of mapper and reducer, it seems that the reducer is=20 > waiting for an output from the mapper and it never reaches. What is=20 > the possible reason of causing this? Most of the configurations are=20 > set by default. I set "mapred.child.java.opts=3D-Xmx2000M=20 > hadoop.job.history.user.location=3Dnone". The problem occurs both on > 0.19.2 and 0.20.2. Thanks! > > Example of Mapper logs: > > 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging=20 > 3 sorted segments > 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to=20 > the last merge-pass, with 3 segments left of total size: 461743 bytes > 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index: > (11015008, 10030254, 607594) > 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner: > Task:attempt_201103231501_0007_m_000286_0 is done. And is in the=20 > process of commiting > 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task > 'attempt_201103231501_0007_m_000286_0' done. > > Example of Reducer logs: > > 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs > 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts > and0 dup hosts) > 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)=20 > where 0 is already in progress > 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs > 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts > and0 dup hosts) > > > Shi