hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: large reduce group sizes
Date Thu, 11 Oct 2007 21:32:47 GMT


When dealing with this, I often do a second reduce to count the unique lines
output by a counting program such as you describe.

For a common case of users and content items, the first MR produces <user,
item, count> records.  Then similar counting programs give <user, count>,
<item, count>, <user, unique-item-count> and <item, unique-user> records
in
four steps that are each much faster than the original scan.  Generating all
four kinds of outputs in one step would be pretty easy as well, but that
would require special output code to separate them.  The four steps are fast
enough compared to the first step that I don't much care.  Combining unique
and totals works more often if you are dropping data into a statistics table
in a database.

In detail:

MR1:
  map: <user, item> => <<user, item>, 1>
  combine and reduce: standard key counter

MR2: 
  map: <user, item, count> => <user, count>
  combine and reduce: standard key counter

MR3: 
  map: <user, item, count> => <user, 1>
  combine and reduce: standard key counter

MR4: 
  map: <user, item, count> => <item, count>
  combine and reduce: standard key counter

MR5: 
  map: <user, item, count> => <user, 1>
  combine and reduce: standard key counter

Note the similarity between 2&3 and 4&5.  These four could be combined as
mentioned above into two steps:

MR2: 
  map: <user, item, count> => <<TOTAL, user>, count>, <<UNIQUE,
user>, 1>
  combine and reduce: standard key counter

MR3: 
  map: <user, item, count> => <<TOTAL, item>, count>, <<UNIQUE,
item>, 1>
  combine and reduce: standard key counter


On 10/11/07 1:17 PM, "Joydeep Sen Sarma" <jssarma@facebook.com> wrote:

> Hi all,
> 
>  
> 
> I am facing a problem with aggregations where reduce groups are
> extremely large. 
> 
>  
> 
> It's a very common usage scenario - for example someone might want the
> equivalent of 'count (distinct.field2) from events e group by e.field1'.
> the natural thing to do is emit e.field1 as the map-key and do the
> distinct and count in the reduce.
> 
>  
> 
> Unfortunately, the values in the reduce phase have to be all pulled into
> memory. And we end up running out of memory for large groups. It would
> be great if the values iterator were able to seamlessly pull in data
> from disk - especially since the data is coming from persistent store.
> 
>  
> 
> I was wondering if other people have faced this problem - and what they
> have done (there are some solutions I have been suggested - like first
> doing a group by on field1_hash(field2) to reduce group size - but they
> are a pain to implement). And how difficult would it be to have an
> iterator iterate over on-disk - rather than in memory - values?
> 
>  
> 
> Thx,
> 
>  
> 
> Joydeep
> 
>  
> 
>  
> 


Mime
View raw message