hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: hadoop: how to find top N frequently occurring words
Date Mon, 04 Feb 2008 23:30:35 GMT


Approximately this:

   TopNReducer extends MapReduceBase
          implements Reducer<Text, IntWritable, Text, IntWritable> {
      OrderedSet<KeyWordIntegerPair> top =
            new TreeSet<KeyWordIntegerPair>();
      FileSystem fs;

      void configure(JobConf conf) {
         fs = FileSystem.get(conf);
      }

      void reduce(Text keyword, IntWritable counts,
            OutputCollector<Text, IntWritable> out, Reporter reporter) {
         int sum = 0;
         while (counts.hasNext) {
            sum += counts.next();
         }

         if (top.size() < 10 || sum > top.first().getCount()) {
              top.add(new KeyWordIntegerPair(keyword, sum);
         }

         while (top.size() > 10) {
             top.remove(0);
         }
     }

     void close() {
         PrintWriter out = new PW(fs.create(new Path("top-counts")));
         for (v : top) {
             out.printf("%s\t%d\n", v.keyword(), v.count());
         }
     }
   }

You will have to fix the errors I made in typing this off the cuff, of
course.



On 2/4/08 3:19 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:

> On Feb 4, 2008 3:07 PM, Ted Dunning <tdunning@veoh.com> wrote:
>> 
>> Yes, you can do it in one pass, but the reducer will have to accumulate the
>> top N items.  If N is small, this is no problem.  If N is large, that is a
>> problem.  You also have the problem that the reducer has a close method
>> where it can output the accumulated data, but it can't go into the normal
>> output channel because you don't have access to the output collector at that
>> point.
>> 
> 
> Could you elaborate this a bit.
> My log file looks like this -
> 
> keyword  source  dateID
> 
> right now my mapper output the following as key- value pair
> keyword_source_dateID - 1
> 
> reducer counts the 1s.. and output
> keyword_source_dateId  frequency
> 
> so it is just the word count program so far. I have another program
> that identifies the top N keywords. Please tell me how can I modify
> the reducer to accumulate top N items. N is small e.g 10
> 
> thanks,
> Taran
> 
>> In practical situations, your count reducer can eliminate items with very
>> small counts that you know cannot be in the top N.  This makes the total
>> output much smaller than the input.  This means that making a second pass
>> over the data costs very little.  Even without the threshold, the second
>> pass will likely be so much faster than the first that it doesn't matter.
>> 
>> IF you are counting things that satisfy Zipf's law, then the counts will be
>> proportional to 1/r where r is the rank of the item.  Using this, you can
>> show that the average count for your keywords will be
>> 
>>      E(k) = N H_m,2 / (H_m)^2
>> 
>> Where N is the total number of words counted, m is the possible vocabulary,
>> H_m is the mth harmonic number (approximately log m) and H_m,2 is the mth
>> second order harmonic number (approximately 1.6).
>> 
>> This means that you should have a compression of approximately
>> 
>>      1.6 / log(m)^2
>> 
>> 
>> 
>> On 2/4/08 2:20 PM, "Tarandeep Singh" <tarandeep@gmail.com> wrote:
>> 
>>> On Feb 4, 2008 2:11 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>>> This is exactly the same as word counting, except that you have a second
>>>> pass to find the top n per block of data (this can be done in a mapper) and
>>>> then a reducer can quite easily merge the results together.
>>>> 
>>> 
>>> This would mean I have to write a second program that reads the output
>>> of first and does the job. I was wondering if it could be done in one
>>> program.
>> 
>> 


Mime
View raw message