hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Best Practice?
Date Sun, 10 Feb 2008 03:06:46 GMT


I think that computing centroids in the mapper may not be the best idea.

A different structure that would work well is to use the mapper to assign
data records to centroids and use the centroid number as the key for the
reduce key.  Then the reduce itself can compute the centroids.  You can read
the old centroids from HDFS in the configure method of the mapper.  Lather,
rinse, repeat.

This process avoids moving large amounts of data through the configuration

This method can be extended to more advanced approaches such as Gaussian
mixtures by emitting each input record multiple times with multiple centroid
keys and a strength of association.

Computing centroids in the mapper works well in that it minimizes the amount
of data that is passed to the reducers, but it critically depends on the
availability of sufficient statistic for computing cluster centroids.  This
works fine for Gaussian processes (aka k-means), but there are other mixture
models that require fancier updates than this.

Computing centroids in the reducer allows you avoid your problem with the
output collector.  If sufficient statistics like sums (means) are available
then you can use a combiner to do the reduction incrementally and avoid
moving too much data around.  The reducer will still have to accumulate
these partial updates for final output, but it won't have to compute very
much of them.

All of this is completely analogous to word-counting, actually.  You don't
accumulate counts in the mapper; you accumulate partial sums in the combiner
and final sums in the reducer.

On 2/9/08 4:21 PM, "Jeff Eastman" <jeastman@collab.net> wrote:

> Thanks Aaron, I missed that one. Now I have my configuration information
> in my mapper. In the mapper, I'm computing cluster centroids by reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> I'm trying to wait until close() to output the cluster centroids to the
> reducer, but the OutputCollector is not available. Is there a way to do
> this, or do I need to backtrack?
> Jeff

View raw message