hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Zhang" <ezh...@yahoo-inc.com>
Subject RE: how to deal with large amount of key value pair outputs in one run of map task
Date Tue, 21 Aug 2007 18:27:07 GMT
Thanks, Owen. By configing mapred.child.java.opts to larger value (took a
little while to figure out the right way to config it: -Xmx300m), the
outofmemory problem went away. It's good know that the default value of
io.sort.mb is set to 100M and my map task required about 300M heap size to

Eric Zhang
Vespa content @Yahoo!
Work: 408-349-2466

-----Original Message-----
From: Owen O'Malley [mailto:oom@yahoo-inc.com] 
Sent: Tuesday, August 21, 2007 10:32 AM
To: hadoop-user@lucene.apache.org
Subject: Re: how to deal with large amount of key value pair outputs in one
run of map task

On Aug 20, 2007, at 2:05 PM, Eric Zhang wrote:

> Thanks a lot for the response, Arun. Just curious how OutputCollector 
> flushes key/value pair to disk: is the periodical flush based on time 
> (like every couple of mins) or based on volumn (like every 100 
> key/value pair output).
> The size of map output varies for each key/value input, it could be as 
> small as one key/value pair output or as big as tens of millions of 
> key/ value pairs. I could try to change the way my application works 
> to avoid this problem, but I am wondering if the hadoop already 
> supports the scalability in such case besides increasing memeory?

It uses io.sort.mb, which is the number of megabytes to keep before you sort
and spill to disk. (The config variable was named back when the sort was
being handled very differently, and thus the unobvious
name.) A major point of map/reduce is to scale to very large data sets and
make very few assumptions about what will fit in memory at once.

-- Owen

View raw message