mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > canopy-commandline
Date Fri, 18 Sep 2009 11:46:00 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: canopy-commandline (http://cwiki.apache.org/confluence/display/MAHOUT/canopy-commandline)


Edited by Isabel Drost:
---------------------------------------------------------------------
h1. Introduction

This quick start page describes how to run the meanshift canopy clustering algorithm on a
Hadoop cluster.

h1. Steps

h2. Testing it on one single machine w/o cluster

In the examples directory type:
{code}
mvn -q exec:java -Dexec.mainClass="org.apache.mahout.clustering.canopy.CanopyClusteringJob"
-Dexec.args="<OPTIONS>"
{code}

h2. Running it on the cluster

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated
in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout version number. For example,
when using Mahout 0.1 release, the job will be mahout-core-0.1.jar
* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<MAHOUT
VERSION>.job org.apache.mahout.clustering.canopy.CanopyClusteringJob <OPTIONS>
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

h1. Command line options
{code}
org.apache.mahout.clustering.canopy.ClusteringJob
  --input (-i) input                The Path for input Vectors. Must be a SequenceFile of
Writable, Vector.
  --output (-o) output              The Path to put the output in.
  --distance (-m) distance          The Distance Measure to use. Default is SquaredEuclidean.
  --vectorClass (-v) vectorClass    The Vector implementation class name. Default is SparseVector.class
  --t1 (-t1) t1                     t1
  --t2 (-t2) t2                     t2
  --help (-h)                       Print out help

{code}

{code}
org.apache.mahout.clustering.canopy.ClusteringDriver

  --input (-i) input                The Path for input Vectors. Must be a SequenceFile of
Writable, Vector.
  --output (-o) output              The Path to put the output in.
  --distance (-m) distance          The Distance Measure to use. Default is SquaredEuclidean.
  --vectorClass (-v) vectorClass    The Vector implementation class name. Default is SparseVector.class
  --t1 (-t1) t1                     t1
  --t2 (-t2) t2                     t2
  --help (-h)                       Print out help.

{code}

Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message