On Feb 4, 2008 3:07 PM, Ted Dunning <tdunning@veoh.com> wrote:
>
> Yes, you can do it in one pass, but the reducer will have to accumulate the
> top N items. If N is small, this is no problem. If N is large, that is a
> problem. You also have the problem that the reducer has a close method
> where it can output the accumulated data, but it can't go into the normal
> output channel because you don't have access to the output collector at that
> point.
>
Could you elaborate this a bit.
My log file looks like this 
keyword source dateID
right now my mapper output the following as key value pair
keyword_source_dateID  1
reducer counts the 1s.. and output
keyword_source_dateId frequency
so it is just the word count program so far. I have another program
that identifies the top N keywords. Please tell me how can I modify
the reducer to accumulate top N items. N is small e.g 10
thanks,
Taran
> In practical situations, your count reducer can eliminate items with very
> small counts that you know cannot be in the top N. This makes the total
> output much smaller than the input. This means that making a second pass
> over the data costs very little. Even without the threshold, the second
> pass will likely be so much faster than the first that it doesn't matter.
>
> IF you are counting things that satisfy Zipf's law, then the counts will be
> proportional to 1/r where r is the rank of the item. Using this, you can
> show that the average count for your keywords will be
>
> E(k) = N H_m,2 / (H_m)^2
>
> Where N is the total number of words counted, m is the possible vocabulary,
> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
> second order harmonic number (approximately 1.6).
>
> This means that you should have a compression of approximately
>
> 1.6 / log(m)^2
>
>
>
> On 2/4/08 2:20 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
>
> > On Feb 4, 2008 2:11 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> >> This is exactly the same as word counting, except that you have a second
> >> pass to find the top n per block of data (this can be done in a mapper) and
> >> then a reducer can quite easily merge the results together.
> >>
> >
> > This would mean I have to write a second program that reads the output
> > of first and does the job. I was wondering if it could be done in one
> > program.
>
>
