hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <...@yahoo-inc.com>
Subject Re: how to deal with large amount of key value pair outputs in one run of map task
Date Tue, 21 Aug 2007 17:31:52 GMT

On Aug 20, 2007, at 2:05 PM, Eric Zhang wrote:

> Thanks a lot for the response, Arun. Just curious how OutputCollector
> flushes key/value pair to disk: is the periodical flush based on  
> time (like
> every couple of mins) or based on volumn (like every 100 key/value  
> pair
> output).
> The size of map output varies for each key/value input, it could be  
> as small
> as one key/value pair output or as big as tens of millions of key/ 
> value
> pairs. I could try to change the way my application works to avoid  
> this
> problem, but I am wondering if the hadoop already supports the  
> scalability
> in such case besides increasing memeory?

It uses io.sort.mb, which is the number of megabytes to keep before  
you sort and spill to disk. (The config variable was named back when  
the sort was being handled very differently, and thus the unobvious  
name.) A major point of map/reduce is to scale to very large data  
sets and make very few assumptions about what will fit in memory at  

-- Owen

View raw message