hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: large reduce group sizes
Date Thu, 11 Oct 2007 21:32:20 GMT
great! Didn't realize that the iterator was disk based.

The below sounds very doable. Will give it a shot. Do you see this as an
option in the mapred job (optionally sort values)?

-----Original Message-----
From: Runping Qi [mailto:runping@yahoo-inc.com] 
Sent: Thursday, October 11, 2007 2:04 PM
To: hadoop-user@lucene.apache.org
Subject: RE: large reduce group sizes


The values to reduce is an disk backed iterator.
The problematic part is to compute the distinct count.
You have to keep the unique values in memory, or you have to use some
other
tricks.
One of such tricks is sampling. The other is to do write the values out
to
disk to do a merge sort, then read the sorted values in sequentially.
It would be nice if somebody can contribute a patch.

Runping
 

> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
> Sent: Thursday, October 11, 2007 1:17 PM
> To: hadoop-user@lucene.apache.org
> Subject: large reduce group sizes
> 
> Hi all,
> 
> 
> 
> I am facing a problem with aggregations where reduce groups are
> extremely large.
> 
> 
> 
> It's a very common usage scenario - for example someone might want the
> equivalent of 'count (distinct.field2) from events e group by
e.field1'.
> the natural thing to do is emit e.field1 as the map-key and do the
> distinct and count in the reduce.
> 
> 
> 
> Unfortunately, the values in the reduce phase have to be all pulled
into
> memory. And we end up running out of memory for large groups. It would
> be great if the values iterator were able to seamlessly pull in data
> from disk - especially since the data is coming from persistent store.
> 
> 
> 
> I was wondering if other people have faced this problem - and what
they
> have done (there are some solutions I have been suggested - like first
> doing a group by on field1_hash(field2) to reduce group size - but
they
> are a pain to implement). And how difficult would it be to have an
> iterator iterate over on-disk - rather than in memory - values?
> 
> 
> 
> Thx,
> 
> 
> 
> Joydeep
> 
> 
> 
> 



Mime
View raw message