hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Kunz <ck...@yahoo-inc.com>
Subject Re: How are records with equal key sorted in hadoop-0.18?
Date Mon, 08 Dec 2008 16:02:01 GMT
Devaraj,

fs.inmemory.size.mb =200

Intermediate compression turned on. At one time we checked whether turning
it off would help, but it did not.

I do not know the typical map output, but the reduces having trouble
typically merging 1-3GB (compressed) directly into the reduce application
with a fan of 50-100. That's when the task runs out of memory, and it is not
due to the application because it runs as a different process (pipes
application).

Comparing hadoop-default.xml of hadoop-0.18 with hadoop-0.17, didn't
map.sort.class change from
org.apache.hadoop.mapred.MergeSorter to
org.apache.hadoop.util.QuickSort?

What is the stability of hadoop's QuickSort? When comparing two records with
the same key, does the size of the value have an impact on sorting?

Thanks
Christian


On 12/8/08 2:15 AM, "Devaraj Das" <ddas@yahoo-inc.com> wrote:

> Hi Christian, there is no notable change to the merge algorithm except that
> it uses IFile instead of SequenceFile for the input and output.
> Is your application running with intermediate compression on? What's the
> value configured for fs.inmemory.size.mb? What is the typical map output
> size (if you happen to know)?
> 
> Devaraj
> 
> 
> On 12/8/08 12:59 PM, "Christian Kunz" <ckunz@yahoo-inc.com> wrote:
> 
>> Since running with hadoop-0.18 we have many more problems with running out
>> of memory during the final merge process in the reduce phase, especially
>> when dealing with a lot of records with the same key.
>> 
>> Typical exception:
>> java.lang.OutOfMemoryError: Java heap space
>>     at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:278)
>>     at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:340)
>>     at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
>>     at 
>> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:2
>> 25)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
>>     at 
>> org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:720)
>>     at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:679)
>>     at 
>> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.jav
>> a:227)
>>     at 
>> org.apache.hadoop.mapred.pipes.PipesReducer.reduce(PipesReducer.java:60)
>>     at 
>> org.apache.hadoop.mapred.pipes.PipesReducer.reduce(PipesReducer.java:36)
>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>>     at 
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>> 
>> This did not occur in earlier releases although we used a much larger fan
>> factor io.sort.factor (500+ versus currently just 100). Also tasks are run
>> with 2GB of heap space.
>> 
>> What changed in the merge algorithm between hadoop-0.17 and hadoop-0.18?
>> 
>> Are records with same key getting sorted by size for some reason? This would
>> cause large values to be merged at the same time.
>> 
>> Thanks,
>> Christian
>> 
> 
> 


Mime
View raw message