hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Gets sum of all integers between map tasks
Date Tue, 07 Oct 2008 09:55:12 GMT
I would like to get the spam probability P(word|category) of the words
from an files of category (bad/good e-mails) as describe below. BTW,
To computes it on reduce, I need a sum of "spamTotal" between map
tasks. How can i get it?

Map:

    /**
     * Counts word frequency
     */
    public void map(LongWritable key, Text value,
        OutputCollector<Text, FloatWritable> output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      String[] tokens = line.split(splitregex);

      // For every word token
      for (int i = 0; i < tokens.length; i++) {
        String word = tokens[i].toLowerCase();
        Matcher m = wordregex.matcher(word);
        if (m.matches()) {
          spamTotal++;
          output.collect(new Text(word), count);
        }
      }
    }
  }

Reduce:

  /**
   * Computes bad count / total bad words
   */
  public static class Reduce extends MapReduceBase implements
      Reducer<Text, FloatWritable, Text, FloatWritable> {

    public void reduce(Text key, Iterator<FloatWritable> values,
        OutputCollector<Text, FloatWritable> output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += (int) values.next().get();
      }

      FloatWritable badProb = new FloatWritable((float) sum / spamTotal);
      output.collect(key, badProb);
    }
  }


-- 
Best regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org

Mime
View raw message