Approximately this:
TopNReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
OrderedSet<KeyWordIntegerPair> top =
new TreeSet<KeyWordIntegerPair>();
FileSystem fs;
void configure(JobConf conf) {
fs = FileSystem.get(conf);
}
void reduce(Text keyword, IntWritable counts,
OutputCollector<Text, IntWritable> out, Reporter reporter) {
int sum = 0;
while (counts.hasNext) {
sum += counts.next();
}
if (top.size() < 10  sum > top.first().getCount()) {
top.add(new KeyWordIntegerPair(keyword, sum);
}
while (top.size() > 10) {
top.remove(0);
}
}
void close() {
PrintWriter out = new PW(fs.create(new Path("topcounts")));
for (v : top) {
out.printf("%s\t%d\n", v.keyword(), v.count());
}
}
}
You will have to fix the errors I made in typing this off the cuff, of
course.
On 2/4/08 3:19 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
> On Feb 4, 2008 3:07 PM, Ted Dunning <tdunning@veoh.com> wrote:
>>
>> Yes, you can do it in one pass, but the reducer will have to accumulate the
>> top N items. If N is small, this is no problem. If N is large, that is a
>> problem. You also have the problem that the reducer has a close method
>> where it can output the accumulated data, but it can't go into the normal
>> output channel because you don't have access to the output collector at that
>> point.
>>
>
> Could you elaborate this a bit.
> My log file looks like this 
>
> keyword source dateID
>
> right now my mapper output the following as key value pair
> keyword_source_dateID  1
>
> reducer counts the 1s.. and output
> keyword_source_dateId frequency
>
> so it is just the word count program so far. I have another program
> that identifies the top N keywords. Please tell me how can I modify
> the reducer to accumulate top N items. N is small e.g 10
>
> thanks,
> Taran
>
>> In practical situations, your count reducer can eliminate items with very
>> small counts that you know cannot be in the top N. This makes the total
>> output much smaller than the input. This means that making a second pass
>> over the data costs very little. Even without the threshold, the second
>> pass will likely be so much faster than the first that it doesn't matter.
>>
>> IF you are counting things that satisfy Zipf's law, then the counts will be
>> proportional to 1/r where r is the rank of the item. Using this, you can
>> show that the average count for your keywords will be
>>
>> E(k) = N H_m,2 / (H_m)^2
>>
>> Where N is the total number of words counted, m is the possible vocabulary,
>> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
>> second order harmonic number (approximately 1.6).
>>
>> This means that you should have a compression of approximately
>>
>> 1.6 / log(m)^2
>>
>>
>>
>> On 2/4/08 2:20 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
>>
>>> On Feb 4, 2008 2:11 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>>> This is exactly the same as word counting, except that you have a second
>>>> pass to find the top n per block of data (this can be done in a mapper) and
>>>> then a reducer can quite easily merge the results together.
>>>>
>>>
>>> This would mean I have to write a second program that reads the output
>>> of first and does the job. I was wondering if it could be done in one
>>> program.
>>
>>
