hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <jeremy.huylebro...@orange-ftgroup.com>
Subject RE: Grouping Values for Reducer Input
Date Mon, 13 Apr 2009 22:20:06 GMT
I'm not familiar with setOutputValueGroupingComparator
 
what about adding the doc# in the key and have your own
hashing/Partitioner?
so doing something like
cat_doc5-> 1
cat_doc1-> 1
cat_doc5-> 3
 
the hashing method would take everything before "_" as the hash.
 
 
the shuffling would still put the catxxx keys together using your
hashing but sort them like you need.
cat_doc5->1
cat_doc5->3
cat_doc1->1
 
then the reduce task can count for each doc# in a "cat"

 
________________________________

From: Streckfus, William [USA] [mailto:streckfus_william@bah.com] 
Sent: Monday, April 13, 2009 2:53 PM
To: core-user@hadoop.apache.org
Subject: Grouping Values for Reducer Input


Hi Everyone,
 
I'm working on a relatively simple MapReduce job with a slight
complication with regards to the ordering of my key/values heading into
the reducer. The output from the mapper might be something like
 
cat -> doc5, 1
cat -> doc1, 1
cat -> doc5, 3
...
 
Here, 'cat' is my key and the value is the document ID and the count (my
own WritableComparable.) Originally I was going to create a HashMap in
the reduce method and add an entry for each document ID and sum the
counts for each. I realized the method would be better if the values
were in order like so:
 
cat -> doc1, 1
cat -> doc5, 1
cat -> doc5, 3
...
 
Using this style I can continue summing until I reach a new document ID
and just collect the output at this point thus avoiding data structures
and object creation costs. I tried setting
JobConf.setOutputValueGroupingComparator() but this didn't seem to do
anything. In fact, I threw an exception from the Comparator I supplied
but this never showed up when running the job. My map output value
consists of a UTF and a Long so perhaps the Comparator I'm using
(identical to Text.Comparator) is incorrect:
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)
{
    int n1 = WritableUtils.decodeVIntSize(b1[s1]);
    int n2 = WritableUtils.decodeVIntSize(b2[s2]);

    return compareBytes(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2);
}

In my final output I'm basically running into the same word ->
documentID being output multiple times. So for the above example I have
multiple lines with cat -> doc5, X.
 
Reducer method just in case:
public void reduce(Text key, Iterator<TermFrequencyWritable> values,
OutputCollector<Text, TermFrequencyWritable> output, Reporter reporter)
throws IOException {
    long sum = 0;
    String lastDocID = null;

    // Iterate through all values
    while(values.hasNext()) {
        TermFrequencyWritable value = values.next();

        // Encountered new document ID = record and reset
        if(!value.getDocumentID().equals(lastDocID)) {
            // Ignore first go through
            if(sum != 0) {
                termFrequency.setDocumentID(lastDocID);
                termFrequency.setFrequency(sum);
                output.collect(key, termFrequency);
            }

            sum = 0;
            lastDocID = value.getDocumentID();
        }

        sum += value.getFrequency();
    }

    // Record last one
    termFrequency.setDocumentID(lastDocID);
    termFrequency.setFrequency(sum);
    output.collect(key, termFrequency);
}

 
Any ideas (Using Hadoop .19.1)?
 
Thanks,
- Bill

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message