hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: hadoop: how to find top N frequently occurring words
Date Mon, 04 Feb 2008 23:07:53 GMT

Yes, you can do it in one pass, but the reducer will have to accumulate the
top N items.  If N is small, this is no problem.  If N is large, that is a
problem.  You also have the problem that the reducer has a close method
where it can output the accumulated data, but it can't go into the normal
output channel because you don't have access to the output collector at that
point.

In practical situations, your count reducer can eliminate items with very
small counts that you know cannot be in the top N.  This makes the total
output much smaller than the input.  This means that making a second pass
over the data costs very little.  Even without the threshold, the second
pass will likely be so much faster than the first that it doesn't matter.

IF you are counting things that satisfy Zipf's law, then the counts will be
proportional to 1/r where r is the rank of the item.  Using this, you can
show that the average count for your keywords will be

     E(k) = N H_m,2 / (H_m)^2

Where N is the total number of words counted, m is the possible vocabulary,
H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
second order harmonic number (approximately 1.6).

This means that you should have a compression of approximately

     1.6 / log(m)^2


On 2/4/08 2:20 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:

> On Feb 4, 2008 2:11 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>> This is exactly the same as word counting, except that you have a second
>> pass to find the top n per block of data (this can be done in a mapper) and
>> then a reducer can quite easily merge the results together.
>> 
> 
> This would mean I have to write a second program that reads the output
> of first and does the job. I was wondering if it could be done in one
> program.


Mime
View raw message