mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r901427 - in /websites/staging/mahout/trunk/content: ./ users/clustering/clustering-of-synthetic-control-data.html
Date Thu, 13 Mar 2014 11:55:26 GMT
Author: buildbot
Date: Thu Mar 13 11:55:26 2014
New Revision: 901427

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Mar 13 11:55:26 2014
@@ -1 +1 @@
-1576601
+1577123

Modified: websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html
(original)
+++ websites/staging/mahout/trunk/content/users/clustering/clustering-of-synthetic-control-data.html
Thu Mar 13 11:55:26 2014
@@ -202,116 +202,47 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="example-synthetic-control-data">Example: Synthetic control data</h1>
-<ul>
-<li><a href="#Clusteringofsyntheticcontroldata-Introduction">Introduction</a></li>
-<li><a href="#Clusteringofsyntheticcontroldata-Problemdescription">Problem description</a></li>
-<li><a href="#Clusteringofsyntheticcontroldata-Pre-Prep">Pre-Prep</a></li>
-<li><a href="#Clusteringofsyntheticcontroldata-PerformClustering">Perform Clustering</a></li>
-<li><a href="#Clusteringofsyntheticcontroldata-Read/AnalyzeOutput">Read / Analyze
Output</a></li>
-</ul>
-<p><a name="Clusteringofsyntheticcontroldata-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
-<p>The example will demonstrate clustering of control charts which exhibits a
-time series. <a href="http://en.wikipedia.org/wiki/Control_chart">Control charts </a>
- are tools used to determine whether or not a manufacturing or business
-process is in a state of statistical control. Such control charts are
-generated / simulated over equal time interval and available for use in UCI
-machine learning database. The data is described <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html">here</a>
-.</p>
-<p><a name="Clusteringofsyntheticcontroldata-Problemdescription"></a></p>
-<h1 id="problem-description">Problem description</h1>
-<p>A time series of control charts needs to be clustered into their close knit
-groups. The data set we use is synthetic and so resembles real world
-information in an anonymized format. It contains six different classes
-(Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward
-shift). With these trends occurring on the input data set, the Mahout
-clustering algorithm will cluster the data into their corresponding class
-buckets. At the end of this example, you'll get to learn how to perform
-clustering using Mahout.</p>
-<p><a name="Clusteringofsyntheticcontroldata-Pre-Prep"></a></p>
-<h1 id="pre-prep">Pre-Prep</h1>
-<p>Make sure you have the following covered before you work out the example.</p>
+    <h1 id="clustering-synthetic-control-data">Clustering synthetic control data</h1>
+<h2 id="introduction">Introduction</h2>
+<p>This example will demonstrate clustering of time series data, specifically control
charts. <a href="http://en.wikipedia.org/wiki/Control_chart">Control charts</a>
are tools used to determine whether a manufacturing or business process is in a state of statistical
control. Such control charts are generated / simulated repeatedly at equal time intervals.
A <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html">simulated
dataset</a> is available for use in UCI machine learning repository.</p>
+<p>A time series of control charts needs to be clustered into their close knit groups.
The data set we use is synthetic and is meant to resemble real world information in an anonymized
format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend,
Upward shift, Downward shift. In this example we will use Mahout to cluster the data into
corresponding class buckets. </p>
+<p><em>For the sake of simplicity, we won't use a cluster in this example, but
instead show you the commands to run the clustering examples locally with Hadoop</em>.</p>
+<h2 id="setup">Setup</h2>
+<p>We need to do some initial setup before we are able to run the example. </p>
 <ol>
-<li>Input data set. Download it <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data">here
</a>.</li>
-<li>Sample input data:
-Input consists of 600 rows and 60 columns. The rows from  1 - 100 contains
-Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html">here
</a>. Sample of how the data looks is like below.</li>
-</ol>
-<table>
-<tr><th> \_time </th><th> \_time+x </th><th> \_time+2x
</th><th> .. </th><th> \_time+60x </th></tr>
-<tr><td> 28.7812 </td><td> 34.4632 </td><td> 31.3381
</td><td> .. </td><td> 31.2834 </td></tr>
-<tr><td> 24.8923 </td><td> 25.741 </td><td> 27.5532 </td><td>
.. </td><td> 32.8217 </td></tr>
-<tr><td> 35.5351 </td><td> 41.7067 </td><td> 39.1705
</td><td> 48.3964 </td><td> .. </td><td> 38.6103 </td></tr>
-<tr><td> 24.2104 </td><td> 41.7679 </td><td> 45.2228
</td><td> 43.7762 </td><td> .. </td><td> 48.8175 </td></tr>
-</table>
-
-<p>..
-..</p>
-<ol>
-<li>Setup Hadoop</li>
-<li>Assuming that you have installed the latest compatible Hadooop, start
-the daemons using {code}$HADOOP_HOME/bin/start-all.sh {code} If you have
-issues starting Hadoop, please reference the <a href="http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html">Hadoop
quick start guide</a></li>
-<li>
-<p>Copy the input to HDFS using </p>
-<p>$HADOOP_HOME/bin/hadoop fs -mkdir testdata
-$HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata</p>
+<li>
+<p>Start out by downloading the dataset to be clustered from the UCI Machine Learning
Repository: <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data">http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data</a>.</p>
 </li>
-</ol>
-<p>(HDFS input directory name should be testdata)</p>
-<ol>
-<li>Mahout Example job
-Mahout's mahout-examples-$MAHOUT_VERSION.job does the actual clustering
-task and so it needs to be created. This can be done as</li>
-<li>cd $MAHOUT_HOME</li>
 <li>
-<p>mvn clean install          // full build including all unit tests
-mvn clean install -DskipTests=true // fast build without running unit tests</p>
+<p>Download the <a href="/general/downloads.html">latest release of Mahout</a>.</p>
 </li>
-</ol>
-<p>You will see BUILD SUCCESSFUL once all the corresponding tasks are through.
-The job will be generated in $MAHOUT_HOME/examples/target/ and it's name
-will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.4
-release, the job will be mahout-examples-0.4.job.jar
-This completes the pre-requisites to perform clustering process using
-Mahout.</p>
-<p><a name="Clusteringofsyntheticcontroldata-PerformClustering"></a></p>
-<h1 id="perform-clustering">Perform Clustering</h1>
-<p>With all the pre-work done, clustering the control data gets real simple.</p>
-<ol>
-<li>Depending on which clustering technique to use, you can invoke the
-corresponding job as below</li>
-<li>For <a href="canopy-clustering.html">canopy </a></li>
-<li>For <a href="K-Means%20Clustering">kmeans</a></li>
-<li>For <a href="fuzzy-k-means.html">fuzzykmeans </a></li>
-<li>For <a href="Dirichlet%20Process%20Clustering">dirichlet</a></li>
-<li>
-<p>For <a href="mean-shift-clustering.html">meanshift</a> respectively:</p>
-<p>$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.${clustering.type}.Job</p>
-</li>
-<li>
-<p>Get the data out of HDFS (see <a href="http://hadoop.apache.org/core/docs/current/hdfs_shell.html.html">HDFS
Shell</a>
-The output directory is cleared when a new run starts
-so the results must be retrieved before a new run{footnote} and have a
-look. All jobs run ClusterDump after clustering with output data
-sent to the console by following the below steps.</p>
+<li>
+<p>Unpack the release binary and switch to the <em>mahout-distribution-0.x</em>
folder</p>
+</li>
+<li>
+<p>Make sure that the <em>JAVA_HOME</em> environment variable points to
your local java installation</p>
+</li>
+<li>
+<p>Create a folder called <em>testdata</em> in the current directory and
copy the dataset into this folder.</p>
 </li>
 </ol>
-<p><a name="Clusteringofsyntheticcontroldata-Read/AnalyzeOutput"></a></p>
-<h1 id="read-analyze-output">Read / Analyze Output</h1>
-<p>In order to read/analyze the output, you can use <a href="cluster-dumper.html">clusterdump</a>
- utility provided by Mahout. If you want to just read the output, follow
-the below steps. </p>
-<ol>
-<li>Use <code>$HADOOP_HOME/bin/hadoop fs -lsr output</code> to view all
-outputs.</li>
-<li>Use <code>$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples</code>
to copy them all to your local machine and the output data points
-are in vector format. This creates an output folder inside examples
-directory.</li>
-<li>Computed clusters are contained in <em>output/clusters-i</em></li>
-<li>All result clustered points are placed into <em>output/clusteredPoints</em></li>
-</ol>
+<h2 id="clustering-examples">Clustering Examples</h2>
+<p>Depending on the clustering algorithm you want to run, the following commands can
be used:</p>
+<ul>
+<li>
+<p><a href="/users/clustering/canopy-clustering.html">Canopy Clustering</a></p>
+<p>bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job</p>
+</li>
+<li>
+<p><a href="/users/clustering/k-means-clustering.html">k-Means Clustering</a></p>
+<p>bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job</p>
+</li>
+<li>
+<p><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means Clustering</a></p>
+<p>bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job</p>
+</li>
+</ul>
+<p>The clustering output will be produced in the <em>output</em> directory.
The output data points are in vector format. In order to read/analyze the output, you can
use the <a href="/users/clustering/cluster-dumper.html">clusterdump</a> utility
provided by Mahout.</p>
    </div>
   </div>     
 </div> 



Mime
View raw message