mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Top Down Clustering
Date Thu, 08 Dec 2011 05:08:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Top Down Clustering (https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering)


Edited by Paritosh Ranjan:
---------------------------------------------------------------------
h2. Top Down Clustering

Top Down clustering is a type of Hierarchical Clustering. It tries to find bigger clusters
first and then does fine grained clustering on these clusters. Hence the name Top Down.

Any clustering algorithm can be used to perform the Top Level Clustering ( finding bigger
clusters ) and the Bottom Level Clustering ( fine grained clustering on each of the top level
clusters). So, all clustering algorithms available in Mahout, other than the MinHash Clustering
algorithm ( which is a "Bottom Up" Clustering Algorithm ), are suitable to be used for Top
Down Clustering, on both Top Level and Bottom Level.

The top level clustering output needs to be post processed in order to identify all top level
clusters and, to group vectors into their respective top level clusters. So, that, the bottom
level clustering can execute on each of them.

The first step to execute the top down clustering, would be to run any clustering algorithm
of your choice, preferably with clustering parameters which will produce bigger clusters.
This would be the top level clustering.

Then, the output of this clustering should be post processed, to group them into respective
top level clusters. This can be done using *ClusterOutputPostProcessorDriver.*

h2. Design of implementation

When any clustering algorithm runs, the output path stores data in two directories

*clusteredPoints*

*clusters-0-final*

The clusteredPoints directory contains information in the form of _(clusterId,_ *{_}vector)_{*}*.*

The clusters-*-final directory will hold the cluster centroids.

Now, to further run clustering on the clusters found, the vectors belonging to different clusters
needs to be stored in separate directories. This can be done using the *ClusterOutputPostProcessorDriver*
as explained in the *{_}Usage{_}* section*.\*

*ClusterOutputPostProcessorDriver* will need this output path as the input, and it will segregate
it into separate clusters.

So, after post processing, if you will check the output path provided to the *ClusterOutputPostProcessorDriver,*
you will find directories with names as clusterId, i.e. 0,1,2,...,20,21,22,23,24.25....

All these directories will store files containing the vectors for that particular cluster.
Now, all of these directories can be provided as input to the bottom level clustering algorithm
one by one. The bottom level clustering algorithm can then, cluster all the top level clusters
as per the algorithm used.

h2. Running

h2.

*ClusterOutputPostProcessorDriver* has a run method

*run(Path input, Path output, boolean runSequential)*

The input parameter provided to run method is, _"the output path provided to the clustering
algorithm"_, which would be post processed. It is the path of the directory containing clusters-*-final
and clusteredPoints.

The output parameter provided to run method is _"the path where the post processed data would
be stored"_.

The runSequential parameter provided to run method is _"If set to true, post processes it
sequentially, else, uses, MapReduce to do it"_. Hint : If the clustering was done sequentially,
make it sequential, else vice versa.


!Top Down Clustering.jpg|align=left,border=1!

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message