Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Program freezes at Map 99% Reduce 33%
Date: Thu, 24 Mar 2011 17:03:10 -0400
Message-ID: 
 <118533FA9ABDFD4B97DB6D8A77A4E0BE1343C8A7@TSHUSMIANNMBX01.ERF.THOMSON.COM>
In-Reply-To: <4D8BA210.8060106@uchicago.edu>
Thread-Topic: Program freezes at Map 99% Reduce 33%
thread-index: AcvqXVt4JZbtcSM9SFaoHapIxXRMygACTCng
References: <4D8B950C.8020505@uchicago.edu>
 <118533FA9ABDFD4B97DB6D8A77A4E0BE1343C753@TSHUSMIANNMBX01.ERF.THOMSON.COM>
 <4D8BA210.8060106@uchicago.edu>
From: <Kevin.Leach@thomsonreuters.com>
To: <common-user@hadoop.apache.org>

Shi,

This states "Of course, the framework discards the sub-directory of
unsuccessful task-attempts."
http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
ml=20

So yes, the missing directory is likely a failure.

If you can, narrow the problem down by looking at sections of your data
on a smaller number of mappers. Look at just the tail of your input and
a couple of mappers and then compare that to your program run without
hadoop.

Kevin

-----Original Message-----
From: Shi Yu [mailto:shiyu@uchicago.edu]=20
Sent: Thursday, March 24, 2011 3:57 PM
To: common-user@hadoop.apache.org
Subject: Re: Program freezes at Map 99% Reduce 33%

Hi Kevin,

thanks for reply.  I could hardly imagine an example of incomplete
record. The mapper is very simple, just reading line by line as Strings,
splitting the line by tab, and outputting a Text Pair for sort and
secondary sort.  If there were incomplete record, there should be an=20
error and the only place could happen is the tab splitting stage.   I=20
use LZO codec compressing the Mapper output and Reducer output.

There is one issue which I think might be the reason. The files in the
log has the following structure . It seems I had 294 mappers.  Notice
that there is one file missing: "attempt_201103231501_0007_m_000292_0",
where is the Nr. 292 mapper output ???? Was it a failed node or
something?


All the logs just hang there for 2 hours (~12:35, while the current time
is 14:50). Some folders were visited around 14:45 by me. So no reducer
nor mapper has been generating any logs in the past two hours.

Shi

drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
attempt_201103231501_0007_m_000281_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000282_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000283_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
attempt_201103231501_0007_m_000284_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000285_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
attempt_201103231501_0007_m_000286_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000287_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
attempt_201103231501_0007_m_000288_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000289_0
drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
attempt_201103231501_0007_m_000289_1
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
attempt_201103231501_0007_m_000290_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
attempt_201103231501_0007_m_000291_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_m_000293_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000000_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000001_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000002_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000003_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000004_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000005_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000006_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000007_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
attempt_201103231501_0007_r_000008_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000009_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000010_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000011_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000012_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000013_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000014_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000015_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000016_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000017_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
attempt_201103231501_0007_r_000018_0

Shi

On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> The key here is the 99% done mapper. Nothing can move on until all=20
> mappers complete.
> Is it possible your data in the larger set has an incomplete record or

> some such at the end?
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:02 PM
> To: hadoop user
> Subject: Program freezes at Map 99% Reduce 33%
>
> I am running a hadoop program processing Tera Byte size data. The code

> was test successfully on a small sample (100G) and it worked. However,

> when trying it on the full problem, the program freezes forever at Map

> 99% Reduce 33%. There is no error, and the size of userlog folder is=20
> clean (<30M) cause otherwise it will generate Giga bytes of error
logs.
>
> I checked the log of mapper and reducer, it seems that the reducer is=20
> waiting for an output from the mapper and it never reaches. What is=20
> the possible reason of causing this? Most of the configurations are=20
> set by default. I set "mapred.child.java.opts=3D-Xmx2000M=20
> hadoop.job.history.user.location=3Dnone". The problem occurs both on
> 0.19.2 and 0.20.2. Thanks!
>
> Example of Mapper logs:
>
> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging=20
> 3 sorted segments
> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to=20
> the last merge-pass, with 3 segments left of total size: 461743 bytes
> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11015008, 10030254, 607594)
> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the=20
> process of commiting
> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task

> 'attempt_201103231501_0007_m_000286_0' done.
>
> Example of Reducer logs:
>
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts

> and0 dup hosts)
> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)=20
> where 0 is already in progress
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts

> and0 dup hosts)
>
>
> Shi