mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Helping out with the .7 release
Date Thu, 23 Feb 2012 04:23:26 GMT
All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to
accumulate gradient information without additional Map reduce iterations. 

The all reduce operation works by building a tree of all tasks. A state is sent up the tree
from the leaves. Each internal node adds together the children's states and adds in its own.
At the root we have the combination of all states and that result is sent back down the tree.

In practice all mappers iterate through there input slice and do an all reduce. Then they
reset their input and repeat. Commonly the root node will include a termination flag to signal

The effect is that iterations don't require spawning a new map reduce job and thus we save
considerable time at each step.  Indeed, if the input can fit into memory, we can gain even
more speed. With in memory operation we may get two orders of magnitude speed up. With data
too large to fit in memory gains will be more modest. 

Sent from my iPhone

On Feb 22, 2012, at 4:01 PM, Jeff Eastman <> wrote:

> Hey Ted,
> Could you elaborate on this approach? I don't grok how an "all reduce implementation"
can be done with a "map-only job", or how a mapper could do "all iteration[s] internally".
> I've just gotten the ClusterIterator working in MR mode and it does what I thought we'd
been talking about earlier: In each iteration, each mapper loads all the prior clusters and
then iterates through all its input points, training each of the prior clusters in the process.
Then, in the cleanup() method, all the trained clusters are sent to the reducers keyed by
their model indexes. This eliminates the need for a combiner and means each reducer only has
to merge n-mappers worth of trained clusters into a posterior trained cluster before it is
output. If numReducers == k then the current reduce-step overloads should disappear.
> The secret to this implementation is to allow clusters to observe other clusters in addition
to observing vectors, thereby accumulating all of those clusters' observation statistics before
recomputing posterior parameters.
> On 2/22/12 1:42 PM, Ted Dunning wrote:
>> I would also like to see if we can put an all reduce implementation into this effort.
The idea is that we can use a map only job that does all iteration internally. I think that
this could result in more than an order of magnitude speed up for our clustering codes.  It
could also provide similar benefits for the nascent parallel classifier training work.
>> This seems to be a cleanup of a long standing wart in our code but it is reasonable
that others may feel differently.
>> Sent from my iPhone
>> On Feb 22, 2012, at 10:32 AM, Jeff Eastman<>  wrote:
>>> This refactoring is focused on some of the iterative clustering algorithms which,
in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input
vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next
iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator
iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in
o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles
and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ
>>> Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify)
which we want to use to factor all of the redundant cluster-data implementations (-cl option)
out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms.
>>> An imagined benefit of this refactoring comes from the fact that ClusterClassifier
extends AbstractVectorClassifier and implements OnlineLearner. We think this means that a
posterior set of trained Clusters can be used as a component classifier in a semi-supervised
classifier implementation. I suppose we will need to demonstrate this before we go too much
further in the refactoring but Ted, at least, seems to approve of this integration approach
between supervised classification and clustering (unsupervised classification). I don't think
it has had a lot of other eyeballs on it.
>>> I don't think LDA fits into this subset of clustering algorithms as also do not
Canopy and MeanShift. As you note, it does not produce Clusters but I'd be interested in your
reactions to the above.
>>> Jeff
>>> On 2/22/12 9:55 AM, Jake Mannix wrote:
>>>> So I haven't looked super-carefully at the clustering refactoring work, can
>>>> someone give a little overview of what
>>>> the plan is?
>>>> The NewLDA stuff is technically in "clustering" and generally works by
>>>> taking in SeqFile<IW,VW>   documents as the training corpus, and spits
>>>> two things: SeqFile<IW,VW>   of a "model" (keyed on topicId, one vector
>>>> topic) and a SeqFile<IW,VW>   of "classifications" (keyed on docId,
>>>> vector over the topic space for projection onto each topic dimension).
>>>> This is similar to how SVD clustering/decomposition works, but with
>>>> L1-normed outputs instead of L2.
>>>> But this seems very different from all of the structures in the rest of
>>>> clustering.
>>>>   -jake
>>>> On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman<>wrote:
>>>>> Hi Saikat,
>>>>> I agree with Paritosh, that a great place to begin would be to write
>>>>> unit tests. This will familiarize you with the code base and help us
a lot
>>>>> with our 0.7 housekeeping release. The new clustering classification
>>>>> components are going to unify many - but not all - of the existing
>>>>> clustering algorithms to reduce their complexity by factoring out
>>>>> duplication and streamlining their integration into semi-supervised
>>>>> classification engines.
>>>>> Please feel free to post any questions you may have in reading through
>>>>> this code. This is a major refactoring effort and we will need all the
>>>>> we can get. Thanks for the offer,
>>>>> Jeff
>>>>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote:
>>>>>> Hi Paritosh,Yes creating the test case would be a great first start,
>>>>>> however are there other tasks you guys need help with before I can
>>>>>> before the test creation, I will sync trunk and start reading through
>>>>>> code in the meantime.Regards
>>>>>>  Date: Wed, 22 Feb 2012 10:57:51 +0530
>>>>>>> From:
>>>>>>> To:
>>>>>>> Subject: Re: Helping out with the .7 release
>>>>>>> We are creating clustering as classification components which
will help
>>>>>>> in moving clustering out. Once the component is ready, then the
>>>>>>> clustering algorithms would need refactoring.
>>>>>>> The clustering as classification component and the outlier removal
>>>>>>> component has been created.
>>>>>>> Most of it is committed, and rest is available as a patch. See
>>>>>>> If you will apply the latest patch available on Mahout-929 you
can see
>>>>>>> all that is available now.
>>>>>>> If you want, you can help with the test case of
>>>>>>> ClusterClassificationMapper available in the patch.
>>>>>>> On 22-02-2012 10:27, Saikat Kanjilal wrote:
>>>>>>>> Hi Guys,I was interested in helping out with the clustering
>>>>>>>> of mahout, I looked through the JIRA items below and was
wondering if there
>>>>>>>> is a specific one that would be good to start with:
>>>>>>>> jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+**
>>>>>>>> resolution+%3D+Unresolved+AND+**component+%3D+Clustering+**
>>>>>>>> ORDER+BY+priority+DESC&mode=**hide<>
>>>>>>>> I initially was thinking to work on Mahout-930 or Mahout-931
but could
>>>>>>>> work on others if needed.
>>>>>>>> Best Regards

View raw message