hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: DeDuplication Techniques
Date Fri, 26 Mar 2010 16:05:56 GMT

On Mar 25, 2010, at 11:09 AM, Joseph Stein wrote:

> I have been researching ways to handle de-dupping data while running a
> map/reduce program (so as to not re-calculate/re-aggregate data that
> we have seen before[possibly months before]).

So roughly, your problem is that you have large amounts of historic  
data and you need to merge in the current month. The best solution  
that I've seen looks like:

Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note that  
you want a total order, but because the md5's are evenly spaced across  
the key space this is easy. Basically, you pick a number of reduces  
(eg. 256) and then use the top N bits of the MD5 to pick your reduce.  
Since this job is only processing your new data, it is very fast.

Next you do a map-side join where each input split consists of an md5  
range. The RecordReader reads from the historic and new datasets  
merging them as they go. (You can use the map-side join library for  
this.) Your map does the merge of the new and old. This is a map-only  
job, so it also is very fast.

Of course if the new data is small enough you can read all of the new  
input in each of the maps and just keep (and sort in ram) the new  
records that are in the right range and do the merge from ram. This  
lets you avoid the step where you sort the new data. This kind of  
merge optimization is where Pig and Hive hide lots of the details from  
the developer.

-- Owen

View raw message