hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Nilsson <nils...@pike.ida.liu.se>
Subject Binary tree reduction
Date Tue, 12 Feb 2008 20:53:08 GMT


I'm looking at a problem where we need to determine the number of unique 
entities for a specific period of time. As an example, consider that we 
log all outgoing URLs in a set of proxies. We can easily create a mapper 
that turns every log file or slice thereof into a sorted list of URLs. 
Unix sort scales very well, but we are only operating on output from at 
most one log file for a day anyway.

Now for reduction I would like to take two (or more?) of these files, 
simply containing a line based list of sorted URLs, and merge them into 
a single file, removing any duplicates. This is a fast operation and 
takes constant memory, but requires that the complete files are operated 
on by the same reducer. Also the key-value paradigm doesn't apply.

The end product would be a big file with URLs for that day. When URLs 
for e.g. a week or a month are available, those should be merged into 
aggregates. I'm really only interested in the final row count, but I 
need to keep all the URLs to be able to add the statistics properly.

Is what I've described readily available within Hadoop (I did some 
looking but didn't find anything)? If not, do you have any pointers for 
how to achieve this type of processing?

/Martin Nilsson

View raw message