mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > k-means-commandline
Date Fri, 04 Jun 2010 15:30:03 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: k-means-commandline (http://cwiki.apache.org/confluence/display/MAHOUT/k-means-commandline)


Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Introduction

This quick start page describes how to run the kMeans clustering algorithm on a Hadoop cluster.


h1. Steps

Mahout's k-Means clustering can be launched from the same command line invocation whether
you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The
difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If
both are set to an operating Hadoop cluster on the target machine then the invocation will
run k-Means on that cluster. If either of the environment variables are missing then the stand-alone
Hadoop configuration will be invoked instead.

{code}
./bin/mahout kmeans <OPTIONS>
{code}

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated
in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example,
when using Mahout 0.3 release, the job will be mahout-core-0.3.job


h2. Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job: 
{code}
./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure
-x 5 -ow -cd 1 -k 25
{code}

h2. Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: 
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure
-x 5 -ow -cd 1 -k 25
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

h1. Command line options
{code}
  --input (-i) input                           Path to job input directory.     
                                               Must be a SequenceFile of        
                                               VectorWritable                   
  --clusters (-c) clusters                     The input centroids, as Vectors. 
                                               Must be a SequenceFile of        
                                               Writable, Cluster/Canopy.  If k  
                                               is also specified, then a random 
                                               set of vectors will be selected  
                                               and written out to this path     
                                               first                            
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --maxRed (-r) maxRed                         The number of reduce tasks.      
                                               Defaults to 2                    
  --k (-k) k                                   The k in k-Means.  If specified, 
                                               then a random selection of k     
                                               Vectors will be chosen as the    
                                               Centroid and written to the      
                                               clusters input path.             
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --help (-h)                                  Print out help                   
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
{code}

Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message