hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Re: Program freezes at Map 99% Reduce 33%
Date Thu, 24 Mar 2011 21:55:33 GMT
Hi Kevin,

thanks for the suggestion. I think I found the problem, because my code 
is a chained map / reduce. In the previous iteration there is a 
.lzo_deflate output which is 40 times larger than other files.  That was 
because of a special "key" value, which has significant larger 
occurrences than other keys. I used a self-defined partitioner:

  public int getPartition(TextPair key, Text value, int numPartitions) {
             return (key.getFirst().hashCode() & Integer.MAX_VALUE) % 
numPartitions;
         }

maybe all the occurrences of this key are send to a single reducer and 
yields such a huge compressed file. And moreover the  .lzo_deflate file 
not splittable? I removed this huge file and reran the current task, it 
worked smoothly.  I didn't notice this problem before on small test 
because the output was not compressed in that run.

Now I am trying to add some random string to that special key, since the 
previous outputs are still input for the next task so there is no reason 
to force them to a single reducer.

Shi


On 3/24/2011 4:03 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> This states "Of course, the framework discards the sub-directory of
> unsuccessful task-attempts."
> http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
> ml
>
> So yes, the missing directory is likely a failure.
>
> If you can, narrow the problem down by looking at sections of your data
> on a smaller number of mappers. Look at just the tail of your input and
> a couple of mappers and then compare that to your program run without
> hadoop.
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:57 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Program freezes at Map 99% Reduce 33%
>
> Hi Kevin,
>
> thanks for reply.  I could hardly imagine an example of incomplete
> record. The mapper is very simple, just reading line by line as Strings,
> splitting the line by tab, and outputting a Text Pair for sort and
> secondary sort.  If there were incomplete record, there should be an
> error and the only place could happen is the tab splitting stage.   I
> use LZO codec compressing the Mapper output and Reducer output.
>
> There is one issue which I think might be the reason. The files in the
> log has the following structure . It seems I had 294 mappers.  Notice
> that there is one file missing: "attempt_201103231501_0007_m_000292_0",
> where is the Nr. 292 mapper output ???? Was it a failed node or
> something?
>
>
> All the logs just hang there for 2 hours (~12:35, while the current time
> is 14:50). Some folders were visited around 14:45 by me. So no reducer
> nor mapper has been generating any logs in the past two hours.
>
> Shi
>
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
> attempt_201103231501_0007_m_000281_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000282_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000283_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
> attempt_201103231501_0007_m_000284_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000285_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
> attempt_201103231501_0007_m_000286_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000287_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
> attempt_201103231501_0007_m_000288_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000289_0
> drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
> attempt_201103231501_0007_m_000289_1
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
> attempt_201103231501_0007_m_000290_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
> attempt_201103231501_0007_m_000291_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_m_000293_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000000_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000001_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000002_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000003_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000004_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000005_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000006_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000007_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
> attempt_201103231501_0007_r_000008_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000009_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000010_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000011_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000012_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000013_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000014_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000015_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000016_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000017_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
> attempt_201103231501_0007_r_000018_0
>
> Shi
>
> On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
>> Shi,
>>
>> The key here is the 99% done mapper. Nothing can move on until all
>> mappers complete.
>> Is it possible your data in the larger set has an incomplete record or
>> some such at the end?
>>
>> Kevin
>>
>> -----Original Message-----
>> From: Shi Yu [mailto:shiyu@uchicago.edu]
>> Sent: Thursday, March 24, 2011 3:02 PM
>> To: hadoop user
>> Subject: Program freezes at Map 99% Reduce 33%
>>
>> I am running a hadoop program processing Tera Byte size data. The code
>> was test successfully on a small sample (100G) and it worked. However,
>> when trying it on the full problem, the program freezes forever at Map
>> 99% Reduce 33%. There is no error, and the size of userlog folder is
>> clean (<30M) cause otherwise it will generate Giga bytes of error
> logs.
>> I checked the log of mapper and reducer, it seems that the reducer is
>> waiting for an output from the mapper and it never reaches. What is
>> the possible reason of causing this? Most of the configurations are
>> set by default. I set "mapred.child.java.opts=-Xmx2000M
>> hadoop.job.history.user.location=none". The problem occurs both on
>> 0.19.2 and 0.20.2. Thanks!
>>
>> Example of Mapper logs:
>>
>> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
>> 3 sorted segments
>> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
>> the last merge-pass, with 3 segments left of total size: 461743 bytes
>> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
>> (11015008, 10030254, 607594)
>> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
>> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
>> process of commiting
>> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
>> 'attempt_201103231501_0007_m_000286_0' done.
>>
>> Example of Reducer logs:
>>
>> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
>> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
>> and0 dup hosts)
>> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
>> where 0 is already in progress
>> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
>> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
>> and0 dup hosts)
>>
>>
>> Shi



Mime
View raw message