hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Kevin.Le...@thomsonreuters.com>
Subject RE: Program freezes at Map 99% Reduce 33%
Date Thu, 24 Mar 2011 21:03:10 GMT
Shi,

This states "Of course, the framework discards the sub-directory of
unsuccessful task-attempts."
http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
ml 

So yes, the missing directory is likely a failure.

If you can, narrow the problem down by looking at sections of your data
on a smaller number of mappers. Look at just the tail of your input and
a couple of mappers and then compare that to your program run without
hadoop.

Kevin

-----Original Message-----
From: Shi Yu [mailto:shiyu@uchicago.edu] 
Sent: Thursday, March 24, 2011 3:57 PM
To: common-user@hadoop.apache.org
Subject: Re: Program freezes at Map 99% Reduce 33%

Hi Kevin,

thanks for reply.  I could hardly imagine an example of incomplete
record. The mapper is very simple, just reading line by line as Strings,
splitting the line by tab, and outputting a Text Pair for sort and
secondary sort.  If there were incomplete record, there should be an 
error and the only place could happen is the tab splitting stage.   I 
use LZO codec compressing the Mapper output and Reducer output.

There is one issue which I think might be the reason. The files in the
log has the following structure . It seems I had 294 mappers.  Notice
that there is one file missing: "attempt_201103231501_0007_m_000292_0",
where is the Nr. 292 mapper output ???? Was it a failed node or
something?


All the logs just hang there for 2 hours (~12:35, while the current time
is 14:50). Some folders were visited around 14:45 by me. So no reducer
nor mapper has been generating any logs in the past two hours.

Shi

drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
attempt_201103231501_0007_m_000281_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000282_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000283_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
attempt_201103231501_0007_m_000284_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000285_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
attempt_201103231501_0007_m_000286_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000287_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
attempt_201103231501_0007_m_000288_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000289_0
drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
attempt_201103231501_0007_m_000289_1
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
attempt_201103231501_0007_m_000290_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
attempt_201103231501_0007_m_000291_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_m_000293_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000000_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000001_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000002_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000003_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000004_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000005_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000006_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000007_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
attempt_201103231501_0007_r_000008_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000009_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000010_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000011_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000012_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000013_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000014_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000015_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000016_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000017_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
attempt_201103231501_0007_r_000018_0

Shi

On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> The key here is the 99% done mapper. Nothing can move on until all 
> mappers complete.
> Is it possible your data in the larger set has an incomplete record or

> some such at the end?
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:02 PM
> To: hadoop user
> Subject: Program freezes at Map 99% Reduce 33%
>
> I am running a hadoop program processing Tera Byte size data. The code

> was test successfully on a small sample (100G) and it worked. However,

> when trying it on the full problem, the program freezes forever at Map

> 99% Reduce 33%. There is no error, and the size of userlog folder is 
> clean (<30M) cause otherwise it will generate Giga bytes of error
logs.
>
> I checked the log of mapper and reducer, it seems that the reducer is 
> waiting for an output from the mapper and it never reaches. What is 
> the possible reason of causing this? Most of the configurations are 
> set by default. I set "mapred.child.java.opts=-Xmx2000M 
> hadoop.job.history.user.location=none". The problem occurs both on
> 0.19.2 and 0.20.2. Thanks!
>
> Example of Mapper logs:
>
> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 
> 3 sorted segments
> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to 
> the last merge-pass, with 3 segments left of total size: 461743 bytes
> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11015008, 10030254, 607594)
> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the 
> process of commiting
> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task

> 'attempt_201103231501_0007_m_000286_0' done.
>
> Example of Reducer logs:
>
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts

> and0 dup hosts)
> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) 
> where 0 is already in progress
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts

> and0 dup hosts)
>
>
> Shi


Mime
View raw message