mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Mahout > Synthetic Control Data
Date Fri, 13 Aug 2010 17:47:01 GMT
Space: Apache Mahout (
Page: Synthetic Control Data (

Edited by Joe Prasanna Kumar:
h1. Introduction

The goal of this example is to demonstrate clustering of control charts which exhibits a time
series. [Control charts |] are tools used to determine
whether or not a manufacturing or business process is in a state of statistical control. Such
control charts are generated / simulated over a time interval and available for use in UCI
machine learning database. The data is described [here |].

h1. Steps

* Download the data at [].
* In $MAHOUT_HOME/, build the Job file
** The same job is used for all examples so this only needs to be done once
** mvn install
** The job will be generated in $MAHOUT_HOME/examples/target/ and it's name will contain the
$MAHOUT_VERSION number. For example, when using Mahout 0.3 release, the job will be mahout-examples-0.3.job
* (Optional){footnote}This step should be skipped when using standalone Hadoop{footnote} Start
up Hadoop: $HADOOP_HOME/bin/
* Put the data: $HADOOP_HOME/bin/hadoop fs \-put <PATH TO DATA> testdata
* Run the Job: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
 org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {footnote}Substitute in whichever
Clustring Job you want here: KMeans, Canopy, etc. See subdirectories of $MAHOUT_HOME/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/.{footnote}
** For [canopy |Canopy Clustering]:  $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
** For [kmeans |K-Means Clustering]:  $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
** For [fuzzykmeans |Fuzzy K-Means]:  $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
** For [dirichlet |Dirichlet Process Clustering]: $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
** For [meanshift |Mean Shift Clustering]: $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job
* Get the data out of HDFS{footnote}See [HDFS Shell |]{footnote}{footnote}The
output directory is cleared when a new run starts so the results must be retrieved before
a new run{footnote} and have a look{footnote}Dirichlet also prints data to console{footnote}
** All example jobs use _testdata_ as input and output to directory _output_
** Use _bin/hadoop fs \-lsr output_ to view all outputs. Copy them all to your local machine
and you can run the ClusterDumper on them.
*** Sequence files containing the original points in Vector form are in _output/data_
*** Computed clusters are contained in _output/clusters-i_
*** All result clustered points are placed into _output/clusteredPoints_


Change your notification preferences:

View raw message