hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Dyer" <redp...@umd.edu>
Subject Re: computing conditional probabilities with Hadoop?
Date Tue, 02 Oct 2007 04:45:08 GMT
Thank you very much-- this is exactly what I need!

On 10/2/07, Owen O'Malley <oom@yahoo-inc.com> wrote:
> On Oct 1, 2007, at 6:05 PM, Chris Dyer wrote:
> > As of right now, I'm still having trouble determining how I can force
> > the first element of the set that will be iterated over by a single
> > reducer to be the marginal, and not some individual count.  Does
> > anyone know if Hadoop guarantees (can be made to guarantee) that the
> > relative order of keys that are equal will be left unchanged?  If so,
> > this would be a fairly easy solution.
> There is not a guarantee of the reduce sort being stable in any
> sense. (WIth the non-deterministic order of the map outputs being
> available to the reduce, it wouldn't make that much sense.)
> There certainly isn't enough documentation about what is allowed for
> sorting. I've filed a bug HADOOP-1981 to expand the Reducer java doc
> to mention the JobConf methods that can control the sort order. In
> particular, the methods are:
> setOutputKeyComparatorClass
> setOutputValueGroupingComparator
> The first comparator controls the sort order of the keys. The second
> controls which keys are grouped together into a single call to the
> reduce method. The combination of these two allows you to set up jobs
> that act like you've defined an order on the values.
> For example, say that you want to find duplicate web pages and tag
> them all with the url of the "best" known example. You would set up
> the job like:
> Map Input Key: url
> Map Input Value: document
> Map Output Key: document checksum, url pagerank
> Map Output Value: url
> Partitioner: by checksum
> OutputKeyComparator: by checksum and then decreasing pagerank
> OutputValueGroupingComparator: by checksum
> with this setup, the reduce function will be called exactly once with
> each checksum, but the first value from the iterator will be the one
> with the highest pagerank, which can then be used to tag the other
> entries of the checksum family.
> -- Owen

View raw message