mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Levy, Mark" <m...@last.fm>
Subject RE: Some basic introductory questions
Date Thu, 17 Sep 2009 13:24:37 GMT
Hi Aleksander,

I've also been learning how to run mahout's clustering and LDA on our
cluster.

For k-means, the following series of steps has worked for me:

* build mahout from trunk

* write a program to convert your data to mahout Vectors.  You can base
this on one of the Drivers in the mahout.utils.vectors package (which
seem designed to work locally).  For bigger datasets you'll probably
need to  write a simple map reduce job, more like
mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
your Vectors need to end up on the dfs.

* run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
something like:
   hadoop jar mahout-core-0.2-SNAPSHOT.job
org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
-c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
-x <maxIters>

* possibly fix the problem described here
http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
-of-KMeans-td24505889.html (solution is at the bottom of the page)

* get all the output files locally

* convert the output to text format with
org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
do this on the cluster, but the code seems to expect local files.  If
you set the name field in your input Vectors in the conversion step to a
suitable ID, then the final output can be a set of cluster centroids,
each followed by the list of Vector IDs in the corresponding cluster.

Hope this is useful.  

More importantly, if anything here is very wrong then please can a
mahout person correct me!  

Many thanks,

Mark

> -----Original Message-----
> From: Aleksander Stensby [mailto:aleksander.stensby@integrasco.com]
> Sent: 17 September 2009 12:32
> To: mahout-user@lucene.apache.org
> Subject: Re: Some basic introductory questions
> 
> Okay, thanks Isabel!
> That was what I thought, I just wanted to check if I had missed
> something
> important here:)
> 
> Cheers,
>  Aleksander
> 
> On Thu, Sep 17, 2009 at 11:23 AM, Isabel Drost <isabel@apache.org>
> wrote:
> 
> > On Thu, 17 Sep 2009 09:36:50 +0200
> > Aleksander Stensby <aleksander.stensby@integrasco.com> wrote:
> >
> > > Or do I have to use the KMeansDriver.runJob and read input from
> > > serialized vectors files?
> >
> > I'd say this is the recommended way currently, though we are open to
> > changes to the API that would make your life easier.
> >
> > At least during experimentation phase, serializing the processed
> > vectors to disk has the advantage of being able to rerun clustering
> > with varied parameters (number of clusters, distance measure or even
> > try out one of the other algorithms).
> >
> > Isabel
> >
> 
> 
> 
> --
> Aleksander M. Stensby
> Lead Software Developer and System Architect
> Integrasco A/S
> www.integrasco.com
> http://twitter.com/Integrasco
> http://facebook.com/Integrasco
> 
> Please consider the environment before printing all or any of this e-
> mail

Mime
View raw message