mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r944380 [13/24] - in /websites/staging/mahout/trunk/content: ./ developers/ general/ users/basics/ users/classification/ users/clustering/ users/dim-reduction/ users/mapreduce/ users/mapreduce/classification/ users/mapreduce/clustering/ use...
Date Thu, 19 Mar 2015 21:21:47 GMT
Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/canopy-clustering.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/mapreduce/clustering/canopy-clustering.html (added)
+++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/canopy-clustering.html Thu Mar 19 21:21:45 2015
@@ -0,0 +1,433 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data framework, data integration,
+        data matching, data mining, data mining algorithms, data mining analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning methods,
+        learning techniques, lucene, machine learning, machine translation, mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data mining">
+  <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico">
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+  <script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    tex2jax: {
+      skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+    }
+  });
+  MathJax.Hub.Queue(function() {
+    var all = MathJax.Hub.getAllJax(), i;
+    for(i = 0; i < all.length; i += 1) {
+      all[i].SourceElement().parentNode.className += ' has-jax';
+    }
+  });
+  </script>
+  <script type="text/javascript">
+    var mathjax = document.createElement('script'); 
+    mathjax.type = 'text/javascript'; 
+    mathjax.async = true;
+
+    mathjax.src = ('https:' == document.location.protocol) ?
+        'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : 
+        'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
+	
+	  var s = document.getElementsByTagName('script')[0]; 
+    s.parentNode.insertBefore(mathjax, s);
+  </script>
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org" name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a>
+                  <li><a href="/general/release-notes.html">Release Notes</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a>
+                  <li><a href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference Reading</a>
+                  <li><a href="/general/faq.html">FAQ</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Legal</li>
+                  <li><a href="http://www.apache.org/licenses/">License</a></li>
+                  <li><a href="http://www.apache.org/security/">Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer resources</a></li>
+                  <li><a href="/developers/version-control.html">Version control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue tracker</a></li>
+                  <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check list</a></li>
+                  <li><a href="/developers/github.html">Handling Github PRs</a></li>
+                  <li><a href="/developers/how-to-release.html">How to release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a>
+                  <li><a href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li>
+                </ul>
+                 </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; Spark Bindings Overview</a></li>
+                  <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li>
+			      <li class="divider"></li>
+                  <li><a href="/users/sparkbindings/faq.html">FAQ</a></li>
+                </ul>
+               </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li>
+                  <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li>
+                  <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li>
+                  <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li>
+                <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li>
+                <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li>
+                <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li>
+                <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li>
+                <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li>
+		<li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li>
+                <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Hadoop</li>
+                <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li>
+                <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li>
+                <li class="nav-header">Spark</li>
+                <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li>
+              </ul>
+            </li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+	<ul class="sidemenu">
+		<li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+	</ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li>
+      <li><a href="http://www.apache.org/dev/">Developer Resources</a></li>
+      <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
+      <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/">Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/">Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <p><a name="CanopyClustering-CanopyClustering"></a></p>
+<h1 id="canopy-clustering">Canopy Clustering</h1>
+<p><a href="http://www.kamalnigam.com/papers/canopy-kdd00.pdf">Canopy Clustering</a>
+ is a very simple, fast and surprisingly accurate method for grouping
+objects into clusters. All objects are represented as a point in a
+multidimensional feature space. The algorithm uses a fast approximate
+distance metric and two distance thresholds T1 &gt; T2 for processing. The
+basic algorithm is to begin with a set of points and remove one at random.
+Create a Canopy containing this point and iterate through the remainder of
+the point set. At each point, if its distance from the first point is &lt; T1,
+then add the point to the cluster. If, in addition, the distance is &lt; T2,
+then remove the point from the set. This way points that are very close to
+the original will avoid all further processing. The algorithm loops until
+the initial set is empty, accumulating a set of Canopies, each containing
+one or more points. A given point may occur in more than one Canopy.</p>
+<p>Canopy Clustering is often used as an initial step in more rigorous
+clustering techniques, such as <a href="k-means-clustering.html">K-Means Clustering</a>
+. By starting with an initial clustering the number of more expensive
+distance measurements can be significantly reduced by ignoring points
+outside of the initial canopies.</p>
+<p><strong>WARNING</strong>: Canopy is deprecated in the latest release and will be removed once streaming k-means becomes stable enough.</p>
+<p><a name="CanopyClustering-Strategyforparallelization"></a></p>
+<h2 id="strategy-for-parallelization">Strategy for parallelization</h2>
+<p>Looking at the sample Hadoop implementation in <a href="http://code.google.com/p/canopy-clustering/">http://code.google.com/p/canopy-clustering/</a>
+ the processing is done in 3 M/R steps:
+1. The data is massaged into suitable input format
+1. Each mapper performs canopy clustering on the points in its input set and
+outputs its canopies' centers
+1. The reducer clusters the canopy centers to produce the final canopy
+centers
+1. The points are then clustered into these final canopies</p>
+<p>Some ideas can be found in <a href="https://www.youtube.com/watch?v=yjPBkvYh-ss&amp;list=PLEFAB97242917704A">Cluster computing and MapReduce</a>
+ lecture video series [by Google(r)]; Canopy Clustering is discussed in <a href="https://www.youtube.com/watch?v=1ZDybXl212Q">lecture #4</a>
+. Finally here is the <a href="http://en.wikipedia.org/wiki/Canopy_clustering_algorithm">Wikipedia page</a>
+.</p>
+<p><a name="CanopyClustering-Designofimplementation"></a></p>
+<h2 id="design-of-implementation">Design of implementation</h2>
+<p>The implementation accepts as input Hadoop SequenceFiles containing
+multidimensional points (VectorWritable). Points may be expressed either as
+dense or sparse Vectors and processing is done in two phases: Canopy
+generation and, optionally, Clustering.</p>
+<p><a name="CanopyClustering-Canopygenerationphase"></a></p>
+<h3 id="canopy-generation-phase">Canopy generation phase</h3>
+<p>During the map step, each mapper processes a subset of the total points and
+applies the chosen distance measure and thresholds to generate canopies. In
+the mapper, each point which is found to be within an existing canopy will
+be added to an internal list of Canopies. After observing all its input
+vectors, the mapper updates all of its Canopies and normalizes their totals
+to produce canopy centroids which are output, using a constant key
+("centroid") to a single reducer. The reducer receives all of the initial
+centroids and again applies the canopy measure and thresholds to produce a
+final set of canopy centroids which is output (i.e. clustering the cluster
+centroids). The reducer output format is: SequenceFile(Text, Canopy) with
+the <em>key</em> encoding the canopy identifier. </p>
+<p><a name="CanopyClustering-Clusteringphase"></a></p>
+<h3 id="clustering-phase">Clustering phase</h3>
+<p>During the clustering phase, each mapper reads the Canopies produced by the
+first phase. Since all mappers have the same canopy definitions, their
+outputs will be combined during the shuffle so that each reducer (many are
+allowed here) will see all of the points assigned to one or more canopies.
+The output format will then be: SequenceFile(IntWritable,
+WeightedVectorWritable) with the <em>key</em> encoding the canopyId. The
+WeightedVectorWritable has two fields: a double weight and a VectorWritable
+vector. Together they encode the probability that each vector is a member
+of the given canopy.</p>
+<p><a name="CanopyClustering-RunningCanopyClustering"></a></p>
+<h2 id="running-canopy-clustering">Running Canopy Clustering</h2>
+<p>The canopy clustering algorithm may be run using a command-line invocation
+on CanopyDriver.main or by making a Java call to CanopyDriver.run(...).
+Both require several arguments:</p>
+<p>Invocation using the command line takes the form:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">canopy</span> <span class="o">\</span>
+    <span class="o">-</span><span class="nb">i</span> <span class="o">&lt;</span><span class="n">input</span> <span class="n">vectors</span> <span class="n">directory</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">o</span> <span class="o">&lt;</span><span class="n">output</span> <span class="n">working</span> <span class="n">directory</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">dm</span> <span class="o">&lt;</span><span class="n">DistanceMeasure</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">t1</span> <span class="o">&lt;</span><span class="n">T1</span> <span class="n">threshold</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">t2</span> <span class="o">&lt;</span><span class="n">T2</span> <span class="n">threshold</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">t3</span> <span class="o">&lt;</span><span class="n">optional</span> <span class="n">reducer</span> <span class="n">T1</span> <span class="n">threshold</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">t4</span> <span class="o">&lt;</span><span class="n">optional</span> <span class="n">reducer</span> <span class="n">T2</span> <span class="n">threshold</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">cf</span> <span class="o">&lt;</span><span class="n">optional</span> <span class="n">cluster</span> <span class="n">filter</span> <span class="nb">size</span> <span class="p">(</span><span class="n">default</span><span class="p">:</span> 0<span class="p">)</span><span class="o">&gt;</span> <span class="o">\</span>
+    <span class="o">-</span><span class="n">ow</span> <span class="o">&lt;</span><span class="n">overwrite</span> <span class="n">output</span> <span class="n">directory</span> <span class="k">if</span> <span class="n">present</span><span class="o">&gt;</span>
+    <span class="o">-</span><span class="n">cl</span> <span class="o">&lt;</span><span class="n">run</span> <span class="n">input</span> <span class="n">vector</span> <span class="n">clustering</span> <span class="n">after</span> <span class="n">computing</span> <span class="n">Canopies</span><span class="o">&gt;</span>
+    <span class="o">-</span><span class="n">xm</span> <span class="o">&lt;</span><span class="n">execution</span> <span class="n">method</span><span class="p">:</span> <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="o">&gt;</span>
+</pre></div>
+
+
+<p>Invocation using Java involves supplying the following arguments:</p>
+<ol>
+<li>input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file <em>key</em>
+is not used.</li>
+<li>output: a file path string to an empty directory which is used for all
+output from the algorithm.</li>
+<li>measure: the fully-qualified class name of an instance of DistanceMeasure
+which will be used for the clustering.</li>
+<li>t1: the T1 distance threshold used for clustering.</li>
+<li>t2: the T2 distance threshold used for clustering.</li>
+<li>t3: the optional T1 distance threshold used by the reducer for
+clustering. If not specified, T1 is used by the reducer.</li>
+<li>t4: the optional T2 distance threshold used by the reducer for
+clustering. If not specified, T2 is used by the reducer.</li>
+<li>clusterFilter: the minimum size for canopies to be output by the
+algorithm. Affects both sequential and mapreduce execution modes, and
+mapper and reducer outputs.</li>
+<li>runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.</li>
+<li>runSequential: a boolean indicating, if true, that the computation is to
+be run in memory using the reference Canopy implementation. Note: that the
+sequential implementation performs a single pass through the input vectors
+whereas the MapReduce implementation performs two passes (once in the
+mapper and again in the reducer). The MapReduce implementation will
+typically produce less clusters than the sequential implementation as a
+result.</li>
+</ol>
+<p>After running the algorithm, the output directory will contain:
+1. clusters-0: a directory containing SequenceFiles(Text, Canopy) produced
+by the algorithm. The Text <em>key</em> contains the cluster identifier of the
+Canopy.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable <em>key</em> is
+the canopyId. The WeightedVectorWritable <em>value</em> is a bean containing a
+double <em>weight</em> and a VectorWritable <em>vector</em> where the weight indicates
+the probability that the vector is a member of the canopy. For canopy
+clustering, the weights are computed as 1/(1+distance) where the distance
+is between the cluster center and the vector using the chosen
+DistanceMeasure.</p>
+<p><a name="CanopyClustering-Examples"></a></p>
+<h1 id="examples">Examples</h1>
+<p>The following images illustrate Canopy clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the <a href="https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt">/examples/src/main/java/org/apache/mahout/clustering/display/README.txt</a>
+ for details on running similar examples.</p>
+<p>The points are generated as follows:</p>
+<ul>
+<li>500 samples m=[1.0, 1.0](1.0,-1.0.html)
+ sd=3.0</li>
+<li>300 samples m=[1.0, 0.0](1.0,-0.0.html)
+ sd=0.5</li>
+<li>300 samples m=[0.0, 2.0](0.0,-2.0.html)
+ sd=0.1</li>
+</ul>
+<p>In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. </p>
+<p><img alt="sample data" src="../../images/SampleData.png" /></p>
+<p>In the second image, the resulting canopies are shown superimposed upon the
+sample data. Each canopy is represented by two circles, with radius T1 and
+radius T2.</p>
+<p><img alt="canopy" src="../../images/Canopy.png" /></p>
+<p>The third image uses the same values of T1 and T2 but only superimposes
+canopies covering more than 10% of the population. This is a bit better
+representation of the data but it still has lots of room for improvement.
+The advantage of Canopy clustering is that it is single-pass and fast
+enough to iterate runs using different T1, T2 parameters and display
+thresholds.</p>
+<p><img alt="canopy" src="../../images/Canopy10.png" /></p>
+   </div>
+  </div>     
+</div> 
+  <footer class="footer" align="center">
+    <div class="container">
+      <p>
+        Copyright &copy; 2014 The Apache Software Foundation, Licensed under
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </footer>
+  
+  <script src="/js/jquery-1.9.1.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <script>
+    (function() {
+      var cx = '012254517474945470291:vhsfv7eokdc';
+      var gcse = document.createElement('script');
+      gcse.type = 'text/javascript';
+      gcse.async = true;
+      gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+          '//www.google.com/cse/cse.js?cx=' + cx;
+      var s = document.getElementsByTagName('script')[0];
+      s.parentNode.insertBefore(gcse, s);
+    })();
+  </script>
+</body>
+</html>

Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/canopy-commandline.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/mapreduce/clustering/canopy-commandline.html (added)
+++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/canopy-commandline.html Thu Mar 19 21:21:45 2015
@@ -0,0 +1,340 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data framework, data integration,
+        data matching, data mining, data mining algorithms, data mining analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning methods,
+        learning techniques, lucene, machine learning, machine translation, mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data mining">
+  <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico">
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+  <script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    tex2jax: {
+      skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+    }
+  });
+  MathJax.Hub.Queue(function() {
+    var all = MathJax.Hub.getAllJax(), i;
+    for(i = 0; i < all.length; i += 1) {
+      all[i].SourceElement().parentNode.className += ' has-jax';
+    }
+  });
+  </script>
+  <script type="text/javascript">
+    var mathjax = document.createElement('script'); 
+    mathjax.type = 'text/javascript'; 
+    mathjax.async = true;
+
+    mathjax.src = ('https:' == document.location.protocol) ?
+        'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : 
+        'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
+	
+	  var s = document.getElementsByTagName('script')[0]; 
+    s.parentNode.insertBefore(mathjax, s);
+  </script>
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org" name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a>
+                  <li><a href="/general/release-notes.html">Release Notes</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a>
+                  <li><a href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference Reading</a>
+                  <li><a href="/general/faq.html">FAQ</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Legal</li>
+                  <li><a href="http://www.apache.org/licenses/">License</a></li>
+                  <li><a href="http://www.apache.org/security/">Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer resources</a></li>
+                  <li><a href="/developers/version-control.html">Version control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue tracker</a></li>
+                  <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check list</a></li>
+                  <li><a href="/developers/github.html">Handling Github PRs</a></li>
+                  <li><a href="/developers/how-to-release.html">How to release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a>
+                  <li><a href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li>
+                </ul>
+                 </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; Spark Bindings Overview</a></li>
+                  <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li>
+			      <li class="divider"></li>
+                  <li><a href="/users/sparkbindings/faq.html">FAQ</a></li>
+                </ul>
+               </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li>
+                  <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li>
+                  <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li>
+                  <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li>
+                <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li>
+                <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li>
+                <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li>
+                <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li>
+                <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li>
+		<li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li>
+                <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Hadoop</li>
+                <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li>
+                <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li>
+                <li class="nav-header">Spark</li>
+                <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li>
+              </ul>
+            </li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+	<ul class="sidemenu">
+		<li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+	</ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li>
+      <li><a href="http://www.apache.org/dev/">Developer Resources</a></li>
+      <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
+      <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/">Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/">Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <p><a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a></p>
+<h1 id="running-canopy-clustering-from-the-command-line">Running Canopy Clustering from the Command Line</h1>
+<p>Mahout's Canopy clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Canopy on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.</p>
+<div class="codehilite"><pre><span class="o">./</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">canopy</span> <span class="o">&lt;</span><span class="n">OPTIONS</span><span class="o">&gt;</span>
+</pre></div>
+
+
+<ul>
+<li>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job</li>
+</ul>
+<p><a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a></p>
+<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single machine w/o cluster</h2>
+<ul>
+<li>Put the data: cp <PATH TO DATA> testdata</li>
+<li>
+<p>Run the Job: </p>
+<p>./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2</p>
+</li>
+</ul>
+<p><a name="canopy-commandline-Runningitonthecluster"></a></p>
+<h2 id="running-it-on-the-cluster">Running it on the cluster</h2>
+<ul>
+<li>(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh</li>
+<li>Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</li>
+<li>
+<p>Run the Job: </p>
+<p>export HADOOP_HOME=<Hadoop Home Directory>
+export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2</p>
+</li>
+<li>
+<p>Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.</p>
+</li>
+</ul>
+<p><a name="canopy-commandline-Commandlineoptions"></a></p>
+<h1 id="command-line-options">Command line options</h1>
+<div class="codehilite"><pre>  <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span>                 <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span><span class="n">Must</span>  
+                         <span class="n">be</span> <span class="n">a</span> <span class="n">SequenceFile</span> <span class="n">of</span>       
+                         <span class="n">VectorWritable</span>         
+  <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span>               <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span> <span class="n">output</span><span class="p">.</span> 
+  <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span>              <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> <span class="n">output</span>    
+                         <span class="n">directory</span> <span class="n">before</span> <span class="n">running</span> <span class="n">job</span>   
+  <span class="o">--</span><span class="n">distanceMeasure</span> <span class="p">(</span><span class="o">-</span><span class="n">dm</span><span class="p">)</span> <span class="n">distanceMeasure</span>    <span class="n">The</span> <span class="n">classname</span> <span class="n">of</span> <span class="n">the</span>       
+                         <span class="n">DistanceMeasure</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span>    
+                         <span class="n">SquaredEuclidean</span>           
+  <span class="o">--</span><span class="n">t1</span> <span class="p">(</span><span class="o">-</span><span class="n">t1</span><span class="p">)</span> <span class="n">t1</span>                  <span class="n">T1</span> <span class="n">threshold</span> <span class="n">value</span>         
+  <span class="o">--</span><span class="n">t2</span> <span class="p">(</span><span class="o">-</span><span class="n">t2</span><span class="p">)</span> <span class="n">t2</span>                  <span class="n">T2</span> <span class="n">threshold</span> <span class="n">value</span>         
+  <span class="o">--</span><span class="n">clustering</span> <span class="p">(</span><span class="o">-</span><span class="n">cl</span><span class="p">)</span>                 <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">run</span> <span class="n">clustering</span> <span class="n">after</span>   
+                         <span class="n">the</span> <span class="n">iterations</span> <span class="n">have</span> <span class="n">taken</span> <span class="n">place</span>     
+  <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span>                    <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
+</pre></div>
+   </div>
+  </div>     
+</div> 
+  <footer class="footer" align="center">
+    <div class="container">
+      <p>
+        Copyright &copy; 2014 The Apache Software Foundation, Licensed under
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </footer>
+  
+  <script src="/js/jquery-1.9.1.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <script>
+    (function() {
+      var cx = '012254517474945470291:vhsfv7eokdc';
+      var gcse = document.createElement('script');
+      gcse.type = 'text/javascript';
+      gcse.async = true;
+      gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+          '//www.google.com/cse/cse.js?cx=' + cx;
+      var s = document.getElementsByTagName('script')[0];
+      s.parentNode.insertBefore(gcse, s);
+    })();
+  </script>
+</body>
+</html>

Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/cluster-dumper.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/mapreduce/clustering/cluster-dumper.html (added)
+++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/cluster-dumper.html Thu Mar 19 21:21:45 2015
@@ -0,0 +1,345 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data framework, data integration,
+        data matching, data mining, data mining algorithms, data mining analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning methods,
+        learning techniques, lucene, machine learning, machine translation, mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data mining">
+  <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico">
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+  <script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    tex2jax: {
+      skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+    }
+  });
+  MathJax.Hub.Queue(function() {
+    var all = MathJax.Hub.getAllJax(), i;
+    for(i = 0; i < all.length; i += 1) {
+      all[i].SourceElement().parentNode.className += ' has-jax';
+    }
+  });
+  </script>
+  <script type="text/javascript">
+    var mathjax = document.createElement('script'); 
+    mathjax.type = 'text/javascript'; 
+    mathjax.async = true;
+
+    mathjax.src = ('https:' == document.location.protocol) ?
+        'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : 
+        'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
+	
+	  var s = document.getElementsByTagName('script')[0]; 
+    s.parentNode.insertBefore(mathjax, s);
+  </script>
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org" name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a>
+                  <li><a href="/general/release-notes.html">Release Notes</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a>
+                  <li><a href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference Reading</a>
+                  <li><a href="/general/faq.html">FAQ</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Legal</li>
+                  <li><a href="http://www.apache.org/licenses/">License</a></li>
+                  <li><a href="http://www.apache.org/security/">Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer resources</a></li>
+                  <li><a href="/developers/version-control.html">Version control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue tracker</a></li>
+                  <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check list</a></li>
+                  <li><a href="/developers/github.html">Handling Github PRs</a></li>
+                  <li><a href="/developers/how-to-release.html">How to release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a>
+                  <li><a href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li>
+                </ul>
+                 </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; Spark Bindings Overview</a></li>
+                  <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li>
+			      <li class="divider"></li>
+                  <li><a href="/users/sparkbindings/faq.html">FAQ</a></li>
+                </ul>
+               </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li>
+                  <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li>
+                  <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li>
+                  <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li>
+                <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li>
+                <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li>
+                <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li>
+                <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li>
+                <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li>
+                <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li>
+		<li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li>
+                <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Hadoop</li>
+                <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li>
+                <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li>
+                <li class="nav-header">Spark</li>
+                <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li>
+              </ul>
+            </li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+	<ul class="sidemenu">
+		<li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+	</ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li>
+      <li><a href="http://www.apache.org/dev/">Developer Resources</a></li>
+      <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
+      <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/">Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/">Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <p><a name="ClusterDumper-Introduction"></a></p>
+<h1 id="cluster-dumper-introduction">Cluster Dumper - Introduction</h1>
+<p>Clustering tasks in Mahout will output data in the format of a SequenceFile
+(Text, Cluster) and the Text is a cluster identifier string. To analyze
+this output we need to convert the sequence files to a human readable
+format and this is achieved using the clusterdump utility.</p>
+<p><a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a></p>
+<h1 id="steps-for-analyzing-cluster-output-using-clusterdump-utility">Steps for analyzing cluster output using clusterdump utility</h1>
+<p>After you've executed a clustering tasks (either examples or real-world),
+you can run clusterdumper in 2 modes.</p>
+<ol>
+<li><a href="#hadoopenvironment.html">Hadoop Environment</a></li>
+<li><a href="#standalonejavaprogram.html">Standalone Java Program </a></li>
+</ol>
+<p><a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a></p>
+<h3 id="hadoop-environment">Hadoop Environment</h3>
+<p>If you have setup your HADOOP_HOME environment variable, you can use the
+command line utility "mahout" to execute the ClusterDumper on Hadoop. In
+this case we wont need to get the output clusters to our local machines.
+The utility will read the output clusters present in HDFS and output the
+human-readable cluster values into our local file system. Say you've just
+executed the <a href="clustering-of-synthetic-control-data.html">synthetic control example </a>
+ and want to analyze the output, you can execute</p>
+<h3 id="standalone-java-program-anchorstandalonejavaprogram">Standalone Java Program {anchor:StandaloneJavaProgram}</h3>
+<p>ClusterDumper can be run using CLI. If your HADOOP_HOME environment
+variable is not set, you can execute ClusterDumper using "mahout" command
+line utility.</p>
+<p>Get the output data from hadoop into your local machine. For example, in
+the case where you've executed a clustering example use</p>
+<p>This will create a folder called output inside your $MAHOUT_HOME/examples
+and will have sub-folders for each cluster outputs and ClusteredPoints</p>
+<p>Run the clusterdump utility as follows as a standalone Java Program through Eclipse - if you are using eclipse, setup mahout-utils as a project as specified in <a href="../developers/buildingmahout.html">Working with Maven in Eclipse</a>.
+    To execute ClusterDumper.java,</p>
+<ul>
+<li>Under mahout-utils, Right-Click on ClusterDumper.java</li>
+<li>Choose Run-As, Run Configurations</li>
+<li>On the left menu, click on Java Application</li>
+<li>On the top-bar click on "New Launch Configuration"</li>
+<li>
+<p>A new launch should be automatically created with project as</p>
+<p>"mahout-utils" and Main Class as "org.apache.mahout.utils.clustering.ClusterDumper"</p>
+</li>
+</ul>
+<p>In the arguments tab, specify the below arguments</p>
+<div class="codehilite"><pre><span class="o">--</span><span class="n">seqFileDir</span> <span class="o">&lt;</span><span class="n">MAHOUT_HOME</span><span class="o">&gt;/</span><span class="n">examples</span><span class="o">/</span><span class="n">output</span><span class="o">/</span><span class="n">clusters</span><span class="o">-</span>10 
+<span class="o">--</span><span class="n">pointsDir</span> <span class="o">&lt;</span><span class="n">MAHOUT_HOME</span><span class="o">&gt;/</span><span class="n">examples</span><span class="o">/</span><span class="n">output</span><span class="o">/</span><span class="n">clusteredPoints</span> 
+<span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">MAHOUT_HOME</span><span class="o">&gt;/</span><span class="n">examples</span><span class="o">/</span><span class="n">output</span><span class="o">/</span><span class="n">clusteranalyze</span><span class="p">.</span><span class="n">txt</span>
+<span class="n">replace</span> <span class="o">&lt;</span><span class="n">MAHOUT_HOME</span><span class="o">&gt;</span> <span class="n">with</span> <span class="n">the</span> <span class="n">actual</span> <span class="n">path</span> <span class="n">of</span> <span class="n">your</span> $<span class="n">MAHOUT_HOME</span>
+</pre></div>
+
+
+<ul>
+<li>Hit run to execute the ClusterDumper using Eclipse. Setting breakpoints etc should just work fine.</li>
+</ul>
+<p>Reading the output file</p>
+<p>This will output the clusters into a file called clusteranalyze.txt inside $MAHOUT_HOME/examples/output
+Sample data will look like</p>
+<p>CL-0 { n=116 c=<a href="29.922,-30.407,-30.373,-30.094,-29.886,-29.937,-29.751,-30.054,-30.039,-30.126,-29.764,-29.835,-30.503,-29.876,-29.990,-29.605,-29.379,-30.120,-29.882,-30.161,-29.825,-30.074,-30.001,-30.421,-29.867,-29.736,-29.760,-30.192,-30.134,-30.082,-29.962,-29.512,-29.736,-29.594,-29.493,-29.761,-29.183,-29.517,-29.273,-29.161,-29.215,-29.731,-29.154,-29.113,-29.348,-28.981,-29.543,-29.192,-29.479,-29.406,-29.715,-29.344,-29.628,-29.074,-29.347,-29.812,-29.058,-29.177,-29.063,-29.607.html">29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 29.177, 29.063, 29.607</a>
+ r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 3.531, 3.439, 3.472,
+3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 3.302, 3.511,
+3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 3.483,
+3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731,
+4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607,
+4.672, 4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] }</p>
+<p>and on...</p>
+<p>where CL-0 is the Cluster 0 and n=116 refers to the number of points observed by this cluster and c = [29.922 ...]
+ refers to the center of Cluster as a vector and r = [3.463 ..] refers to
+the radius of the cluster as a vector.</p>
+   </div>
+  </div>     
+</div> 
+  <footer class="footer" align="center">
+    <div class="container">
+      <p>
+        Copyright &copy; 2014 The Apache Software Foundation, Licensed under
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </footer>
+  
+  <script src="/js/jquery-1.9.1.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <script>
+    (function() {
+      var cx = '012254517474945470291:vhsfv7eokdc';
+      var gcse = document.createElement('script');
+      gcse.type = 'text/javascript';
+      gcse.async = true;
+      gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+          '//www.google.com/cse/cse.js?cx=' + cx;
+      var s = document.getElementsByTagName('script')[0];
+      s.parentNode.insertBefore(gcse, s);
+    })();
+  </script>
+</body>
+</html>



Mime
View raw message