hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <...@yahoo-inc.com>
Subject Re: computing conditional probabilities with Hadoop?
Date Tue, 02 Oct 2007 04:39:07 GMT
On Oct 1, 2007, at 6:05 PM, Chris Dyer wrote:

> As of right now, I'm still having trouble determining how I can force
> the first element of the set that will be iterated over by a single
> reducer to be the marginal, and not some individual count.  Does
> anyone know if Hadoop guarantees (can be made to guarantee) that the
> relative order of keys that are equal will be left unchanged?  If so,
> this would be a fairly easy solution.

There is not a guarantee of the reduce sort being stable in any  
sense. (WIth the non-deterministic order of the map outputs being  
available to the reduce, it wouldn't make that much sense.)

There certainly isn't enough documentation about what is allowed for  
sorting. I've filed a bug HADOOP-1981 to expand the Reducer java doc  
to mention the JobConf methods that can control the sort order. In  
particular, the methods are:

setOutputKeyComparatorClass
setOutputValueGroupingComparator

The first comparator controls the sort order of the keys. The second  
controls which keys are grouped together into a single call to the  
reduce method. The combination of these two allows you to set up jobs  
that act like you've defined an order on the values.

For example, say that you want to find duplicate web pages and tag  
them all with the url of the "best" known example. You would set up  
the job like:

Map Input Key: url
Map Input Value: document
Map Output Key: document checksum, url pagerank
Map Output Value: url
Partitioner: by checksum
OutputKeyComparator: by checksum and then decreasing pagerank
OutputValueGroupingComparator: by checksum

with this setup, the reduce function will be called exactly once with  
each checksum, but the first value from the iterator will be the one  
with the highest pagerank, which can then be used to tag the other  
entries of the checksum family.

-- Owen

Mime
View raw message