mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > fuzzy-k-means-commandline
Date Thu, 21 Jul 2011 13:24:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: fuzzy-k-means-commandline (https://cwiki.apache.org/confluence/display/MAHOUT/fuzzy-k-means-commandline)


Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Running Fuzzy k-Means Clustering from the Command Line
Mahout's Fuzzy k-Means clustering can be launched from the same command line invocation whether
you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The
difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If
both are set to an operating Hadoop cluster on the target machine then the invocation will
run FuzzyK on that cluster. If either of the environment variables are missing then the stand-alone
Hadoop configuration will be invoked instead.

{code}
./bin/mahout fkmeans <OPTIONS>
{code}

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated
in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example,
when using Mahout 0.3 release, the job will be mahout-core-0.3.job


h2. Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job: 
{code}
./bin/mahout fkmeans -i testdata <OPTIONS>
{code}

h2. Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: 
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout fkmeans -i testdata <OPTIONS>
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

h1. Command line options
{code}
  --input (-i) input                           Path to job input directory.     
                                               Must be a SequenceFile of        
                                               VectorWritable                   
  --clusters (-c) clusters                     The input centroids, as Vectors. 
                                               Must be a SequenceFile of        
                                               Writable, Cluster/Canopy.  If k  
                                               is also specified, then a random 
                                               set of vectors will be selected  
                                               and written out to this path     
                                               first                            
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --k (-k) k                                   The k in k-Means.  If specified, 
                                               then a random selection of k     
                                               Vectors will be chosen as the    
                                               Centroid and written to the      
                                               clusters input path.             
  --m (-m) m                                   coefficient normalization        
                                               factor, must be greater than 1   
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --help (-h)                                  Print out help                   
  --numMap (-u) numMap                         The number of map tasks.         
                                               Defaults to 10                   
  --maxRed (-r) maxRed                         The number of reduce tasks.      
                                               Defaults to 2                    
  --emitMostLikely (-e) emitMostLikely         True if clustering should emit   
                                               the most likely point only,      
                                               false for threshold clustering.  
                                               Default is true                  
  --threshold (-t) threshold                   The pdf threshold used for       
                                               cluster determination. Default   
                                               is 0 
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
                            
{code}

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message