hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tarandeep Singh" <tarand...@gmail.com>
Subject Re: hadoop: how to find top N frequently occurring words
Date Mon, 04 Feb 2008 23:19:14 GMT
On Feb 4, 2008 3:07 PM, Ted Dunning <tdunning@veoh.com> wrote:
> Yes, you can do it in one pass, but the reducer will have to accumulate the
> top N items.  If N is small, this is no problem.  If N is large, that is a
> problem.  You also have the problem that the reducer has a close method
> where it can output the accumulated data, but it can't go into the normal
> output channel because you don't have access to the output collector at that
> point.

Could you elaborate this a bit.
My log file looks like this -

keyword  source  dateID

right now my mapper output the following as key- value pair
keyword_source_dateID - 1

reducer counts the 1s.. and output
keyword_source_dateId  frequency

so it is just the word count program so far. I have another program
that identifies the top N keywords. Please tell me how can I modify
the reducer to accumulate top N items. N is small e.g 10


> In practical situations, your count reducer can eliminate items with very
> small counts that you know cannot be in the top N.  This makes the total
> output much smaller than the input.  This means that making a second pass
> over the data costs very little.  Even without the threshold, the second
> pass will likely be so much faster than the first that it doesn't matter.
> IF you are counting things that satisfy Zipf's law, then the counts will be
> proportional to 1/r where r is the rank of the item.  Using this, you can
> show that the average count for your keywords will be
>      E(k) = N H_m,2 / (H_m)^2
> Where N is the total number of words counted, m is the possible vocabulary,
> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
> second order harmonic number (approximately 1.6).
> This means that you should have a compression of approximately
>      1.6 / log(m)^2
> On 2/4/08 2:20 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
> > On Feb 4, 2008 2:11 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> >> This is exactly the same as word counting, except that you have a second
> >> pass to find the top n per block of data (this can be done in a mapper) and
> >> then a reducer can quite easily merge the results together.
> >>
> >
> > This would mean I have to write a second program that reads the output
> > of first and does the job. I was wondering if it could be done in one
> > program.

View raw message