mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r906046 [3/4] - in /websites/staging/mahout/trunk/content: ./ users/basics/ users/classification/ users/dim-reduction/ users/misc/ users/stuff/
Date Sun, 13 Apr 2014 19:20:27 GMT
Added: websites/staging/mahout/trunk/content/users/misc/parallel-frequent-pattern-mining.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/misc/parallel-frequent-pattern-mining.html (added)
+++ websites/staging/mahout/trunk/content/users/misc/parallel-frequent-pattern-mining.html Sun Apr 13 19:20:27 2014
@@ -0,0 +1,421 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data framework, data integration,
+        data matching, data mining, data mining algorithms, data mining analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning methods,
+        learning techniques, lucene, machine learning, machine translation, mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data mining">
+  <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico">
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+	
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+	<script type="text/x-mathjax-config">
+	MathJax.Hub.Config({
+		tex2jax: {
+			skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+		}
+	});
+	MathJax.Hub.Queue(function() {
+		var all = MathJax.Hub.getAllJax(), i;
+		for(i = 0; i < all.length; i += 1) {
+			all[i].SourceElement().parentNode.className += ' has-jax';
+		}
+	});
+	</script>
+	<script type="text/javascript"
+	  src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
+	</script>	
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org" name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a>
+                  <li><a href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference Reading</a>
+		  <li><a href="/general/faq.html">FAQ</a>
+		  <li class="divider"></li>
+		  <li class="nav-header">Legal</li>
+		  <li><a href="http://www.apache.org/licenses/">License</a></li>
+		  <li><a href="http://www.apache.org/security/">Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer resources</a></li>
+                  <li><a href="/developers/version-control.html">Version control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue tracker</a></li>
+      		  <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check list</a></li>
+                  <li><a href="/developers/how-to-release.html">How to release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a>
+                  <li><a href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a href="/users/basics/dimensional-reduction.html">Singular Value Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Mahout On Spark</li>      
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; Spark Bindings *new*</a></li>
+                </ul>
+                 </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+
+              		<li><a href="/users/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a href="/users/stuff/hidden-markov-models.html">Hidden Markov Models</a></li>
+                  <li><a href="/users/classification/logistic-regression.html">Logistic Regression</a></li>
+                  <li><a href="/users/stuff/partial-implementation.html">Random Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a href="/users/classification/wikipedia-bayes-example.html">Wikipedia example</a></li>
+                  <li><a href="/users/clustering/twenty-newsgroups.html">20 newsgroups example</a></li>
+                  <li><a href="/users/classification/breiman-example.html">Breiman example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a href="/users/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a href="/users/clustering/spectral-clustering.html">Spectral Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a href="/users/clustering/k-means-commandline.html">Options for k-Means</a></li>
+                <li><a href="/users/clustering/canopy-commandline.html">Options for Canopy</a></li>
+            		<li><a href="/users/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a href="/users/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a href="/users/clustering/cluster-dumper.html">Cluster Dumper tool</a></li>
+                <li><a href="/users/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/recommender/quickstart.html">Quickstart</a></li>
+                <li><a href="/users/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li>
+                <li><a href="/users/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li>
+	        <li><a href="/users/recommender/recommender-documentation.html">General</a></li>
+                </ul></li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+	<ul class="sidemenu">
+		<li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+	</ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li>
+      <li><a href="http://www.apache.org/dev/">Developer Resources</a></li>
+      <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
+      <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/">Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/">Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <p>Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper <a href="http://infolab.stanford.edu/~echang/recsys08-69.pdf">http://infolab.stanford.edu/~echang/recsys08-69.pdf</a>
+ with some optimisations in mining the data.</p>
+<p>Given a huge transaction list, the algorithm finds all unique features(sets
+of field values) and eliminates those features whose frequency in the whole
+dataset is less that minSupport. Using these remaining features N, we find
+the top K closed patterns for each of them, generating a total of NxK
+patterns. FPGrowth Algorithm is a generic implementation, we can use any
+Object type to denote a feature. Current implementation requires you to use
+a String as the object type. You may implement a version for any object by
+creating Iterators, Convertors and TopKPatternWritable for that particular
+object. For more information please refer the package
+org.apache.mahout.fpm.pfpgrowth.convertors.string</p>
+<div class="codehilite"><pre><span class="n">e</span><span class="p">.</span><span class="n">g</span><span class="p">:</span>
+ <span class="n">FPGrowth</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span> <span class="n">fp</span> <span class="p">=</span> <span class="n">new</span> <span class="n">FPGrowth</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span><span class="p">();</span>
+ <span class="n">Set</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span> <span class="n">features</span> <span class="p">=</span> <span class="n">new</span> <span class="n">HashSet</span><span class="o">&lt;</span><span class="n">String</span><span class="o">&gt;</span><span class="p">();</span>
+ <span class="n">fp</span><span class="p">.</span><span class="n">generateTopKStringFrequentPatterns</span><span class="p">(</span>
+     <span class="n">new</span> <span class="n">StringRecordIterator</span><span class="p">(</span><span class="n">new</span> <span class="n">FileLineIterable</span><span class="p">(</span><span class="n">new</span> <span class="n">File</span><span class="p">(</span><span class="n">input</span><span class="p">),</span>
+</pre></div>
+
+
+<p>encoding, false), pattern),
+        fp.generateFList(
+          new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern), minSupport),
+         minSupport,
+        maxHeapSize,
+        features,
+        new StringOutputConvertor(new SequenceFileOutputCollector<Text,
+TopKStringPatterns>(writer))
+      );</p>
+<ul>
+<li>The first argument is the iterator of transaction in this case its
+Iterator<List<String>&gt;</li>
+<li>The second argument is the output of generateFList function, which
+returns the frequent items and their frequencies from the given database
+transaction iterator</li>
+<li>The third argument is the minimum Support of the pattern to be generated</li>
+<li>The fourth argument is the maximum number of patterns to be mined for
+each feature</li>
+<li>The fifth argument is the set of features for which the frequent patterns
+has to be mined</li>
+<li>The last argument is an output collector which takes [key, value](key,-value.html)
+ of Feature and TopK Patterns of the format [String,
+List<Pair<List<String>, Long&gt;&gt;] and writes them to the appropriate writer
+class which takes care of storing the object, in this case in a Sequence
+File Output format</li>
+</ul>
+<p><a name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a></p>
+<h2 id="running-frequent-pattern-growth-via-command-line">Running Frequent Pattern Growth via command line</h2>
+<p>The command line launcher for string transaction data
+org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
+specifying the regex pattern for spitting a string line of a transaction
+into the constituent features.</p>
+<p>Input files have to be in the following format.</p>
+<p><optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....</p>
+<p>instead of tab you could use , or \| as the default tokenization is done using a java Regex pattern {code}<a href=",\t.html">,\t</a>
+<em>[,|\t][ ,\t]</em>{code}
+You can override this parameter to parse your log files or transaction
+files (each line is a transaction.) The FPGrowth algorithm mines the top K
+frequently occurring sets of items and their counts from the given input
+data</p>
+<p>$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
+format. 
+Other sample files are accident.dat.gz from <a href="http://fimi.cs.helsinki.fi/data/">http://fimi.cs.helsinki.fi/data/</a>
+. As a quick test, try this:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">fpg</span> <span class="o">\</span>
+     <span class="o">-</span><span class="nb">i</span> <span class="n">core</span><span class="o">/</span><span class="n">src</span><span class="o">/</span><span class="n">test</span><span class="o">/</span><span class="n">resources</span><span class="o">/</span><span class="n">retail</span><span class="p">.</span><span class="n">dat</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">o</span> <span class="n">patterns</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">k</span> 50 <span class="o">\</span>
+     <span class="o">-</span><span class="n">method</span> <span class="n">sequential</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">regex</span> <span class="s">&#39;</span><span class="err">[\ ]</span>
+</pre></div>
+
+
+<p>' \
+         -s 2</p>
+<p>The minimumSupport parameter -s is the minimum number of times a pattern
+or a feature needs to occur in the dataset so that it is included in the
+patterns generated. You can speed up the process by having a large value of
+s. There are cases where you will have less than k patterns for a
+particular feature as the rest don't for qualify the minimum support
+criteria</p>
+<p>Note that the input to the algorithm, could be uncompressed or compressed
+gz file or even a directory containing any number of such files.
+We modified the regex to use space to split the token. Note that input
+regex string is escaped.</p>
+<p><a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a></p>
+<h2 id="running-parallel-fpgrowth">Running Parallel FPGrowth</h2>
+<p>Running parallel FPGrowth is as easy as adding changing the flag -method
+mapreduce and adding the number of groups parameter e.g. -g 20 for 20
+groups. First, let's run the above sample test in map-reduce mode:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">fpg</span> <span class="o">\</span>
+     <span class="o">-</span><span class="nb">i</span> <span class="n">core</span><span class="o">/</span><span class="n">src</span><span class="o">/</span><span class="n">test</span><span class="o">/</span><span class="n">resources</span><span class="o">/</span><span class="n">retail</span><span class="p">.</span><span class="n">dat</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">o</span> <span class="n">patterns</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">k</span> 50 <span class="o">\</span>
+     <span class="o">-</span><span class="n">method</span> <span class="n">mapreduce</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">regex</span> <span class="s">&#39;</span><span class="err">[\ ]</span>
+</pre></div>
+
+
+<p>' \
+         -s 2</p>
+<p>The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
+the sequential mode, (with 5 gigs of ram allocated). In a separate test,
+the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
+seconds in sequential mode.</p>
+<p>Here is another dataset which, while several times larger, requires much
+less time to find frequent patterns, as there are very few. Get
+accidents.dat.gz from <a href="http://fimi.cs.helsinki.fi/data/">http://fimi.cs.helsinki.fi/data/</a>
+ and place it on your hdfs in a folder named accidents. Then, run the
+hadoop version of the FPGrowth job:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">fpg</span> <span class="o">\</span>
+     <span class="o">-</span><span class="nb">i</span> <span class="n">accidents</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">o</span> <span class="n">patterns</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">k</span> 50 <span class="o">\</span>
+     <span class="o">-</span><span class="n">method</span> <span class="n">mapreduce</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">regex</span> <span class="s">&#39;</span><span class="err">[\ ]</span>
+</pre></div>
+
+
+<p>' \
+         -s 2</p>
+<p>OR to run a dataset of this size in sequential mode on a single machine
+let's give Mahout a lot more memory and only keep features with more than
+300 members:</p>
+<div class="codehilite"><pre><span class="n">export</span> <span class="n">MAHOUT_HEAPSIZE</span><span class="p">=</span><span class="o">-</span><span class="n">Xmx5000m</span>
+<span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">fpg</span> <span class="o">\</span>
+     <span class="o">-</span><span class="nb">i</span> <span class="n">accidents</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">o</span> <span class="n">patterns</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">k</span> 50 <span class="o">\</span>
+     <span class="o">-</span><span class="n">method</span> <span class="n">sequential</span> <span class="o">\</span>
+     <span class="o">-</span><span class="n">regex</span> <span class="s">&#39;</span><span class="err">[\ ]</span>
+</pre></div>
+
+
+<p>' \
+         -s 2</p>
+<p>The numGroups parameter -g in FPGrowthJob specifies the number of groups
+into which transactions have to be decomposed. The default of 1000 works
+very well on a single-machine cluster; this may be very different on large
+clusters.</p>
+<p>Note that accidents.dat has 340 unique features. So we chose -g 10 to
+split the transactions across 10 shards where 34 patterns are mined from
+each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
+takes care of calculating the split. For better performance in large
+datasets and clusters, try not to mine for more than 20-25 features per
+shard. Stick to the defaults on a small machine.</p>
+<p>The numTreeCacheEntries parameter -tc specifies the number of generated
+conditional FP-Trees to be kept in memory so that subsequent operations do
+not to regenerate them. Increasing this number increases the memory
+consumption but might improve speed until a certain point. This depends
+entirely on the dataset in question. A value of 5-10 is recommended for
+mining up to top 100 patterns for each feature.</p>
+<p><a name="ParallelFrequentPatternMining-Viewingtheresults"></a></p>
+<h2 id="viewing-the-results">Viewing the results</h2>
+<p>The output will be dumped to a SequenceFile in the frequentpatterns
+directory in Text=&gt;TopKStringPatterns format. Run this command to see a few
+of the Frequent Patterns:</p>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seqdumper</span> <span class="o">\</span>
+     <span class="o">-</span><span class="nb">i</span> <span class="n">patterns</span><span class="o">/</span><span class="n">frequentpatterns</span><span class="o">/</span><span class="n">part</span><span class="o">-</span>?<span class="o">-</span>00000 <span class="o">\</span>
+     <span class="o">-</span><span class="n">n</span> 4
+</pre></div>
+
+
+<p>or replace -n 4 with -c for the count of patterns.</p>
+<p>Open questions: how does one experiment and monitor with these various
+parameters?</p>
+   </div>
+  </div>     
+</div> 
+  <footer class="footer" align="center">
+    <div class="container">
+      <p>
+        Copyright &copy; 2014 The Apache Software Foundation, Licensed under
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </footer>
+  
+  <script src="/js/jquery-1.9.1.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <script>
+    (function() {
+      var cx = '012254517474945470291:vhsfv7eokdc';
+      var gcse = document.createElement('script');
+      gcse.type = 'text/javascript';
+      gcse.async = true;
+      gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+          '//www.google.com/cse/cse.js?cx=' + cx;
+      var s = document.getElementsByTagName('script')[0];
+      s.parentNode.insertBefore(gcse, s);
+    })();
+  </script>
+</body>
+</html>

Added: websites/staging/mahout/trunk/content/users/misc/perceptron-and-winnow.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/misc/perceptron-and-winnow.html (added)
+++ websites/staging/mahout/trunk/content/users/misc/perceptron-and-winnow.html Sun Apr 13 19:20:27 2014
@@ -0,0 +1,281 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data framework, data integration,
+        data matching, data mining, data mining algorithms, data mining analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning methods,
+        learning techniques, lucene, machine learning, machine translation, mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data mining">
+  <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico">
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+	
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+	<script type="text/x-mathjax-config">
+	MathJax.Hub.Config({
+		tex2jax: {
+			skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+		}
+	});
+	MathJax.Hub.Queue(function() {
+		var all = MathJax.Hub.getAllJax(), i;
+		for(i = 0; i < all.length; i += 1) {
+			all[i].SourceElement().parentNode.className += ' has-jax';
+		}
+	});
+	</script>
+	<script type="text/javascript"
+	  src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
+	</script>	
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org" name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a>
+                  <li><a href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference Reading</a>
+		  <li><a href="/general/faq.html">FAQ</a>
+		  <li class="divider"></li>
+		  <li class="nav-header">Legal</li>
+		  <li><a href="http://www.apache.org/licenses/">License</a></li>
+		  <li><a href="http://www.apache.org/security/">Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer resources</a></li>
+                  <li><a href="/developers/version-control.html">Version control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue tracker</a></li>
+      		  <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check list</a></li>
+                  <li><a href="/developers/how-to-release.html">How to release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a>
+                  <li><a href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a href="/users/basics/dimensional-reduction.html">Singular Value Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Mahout On Spark</li>      
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; Spark Bindings *new*</a></li>
+                </ul>
+                 </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+
+              		<li><a href="/users/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a href="/users/stuff/hidden-markov-models.html">Hidden Markov Models</a></li>
+                  <li><a href="/users/classification/logistic-regression.html">Logistic Regression</a></li>
+                  <li><a href="/users/stuff/partial-implementation.html">Random Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a href="/users/classification/wikipedia-bayes-example.html">Wikipedia example</a></li>
+                  <li><a href="/users/clustering/twenty-newsgroups.html">20 newsgroups example</a></li>
+                  <li><a href="/users/classification/breiman-example.html">Breiman example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a href="/users/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a href="/users/clustering/spectral-clustering.html">Spectral Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a href="/users/clustering/k-means-commandline.html">Options for k-Means</a></li>
+                <li><a href="/users/clustering/canopy-commandline.html">Options for Canopy</a></li>
+            		<li><a href="/users/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a href="/users/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a href="/users/clustering/cluster-dumper.html">Cluster Dumper tool</a></li>
+                <li><a href="/users/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/recommender/quickstart.html">Quickstart</a></li>
+                <li><a href="/users/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li>
+                <li><a href="/users/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li>
+	        <li><a href="/users/recommender/recommender-documentation.html">General</a></li>
+                </ul></li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+	<ul class="sidemenu">
+		<li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+	</ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li>
+      <li><a href="http://www.apache.org/dev/">Developer Resources</a></li>
+      <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
+      <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/">Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/">Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <p><a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a></p>
+<h1 id="classification-with-perceptron-or-winnow">Classification with Perceptron or Winnow</h1>
+<p>Both algorithms are comparably simple linear classifiers. Given training
+data in some n-dimensional vector space that is annotated with binary
+labels the algorithms are guaranteed to find a linear separating hyperplane
+if one exists. In contrast to the Perceptron, Winnow works only for binary
+feature vectors.</p>
+<p>For more information on the Perceptron see for instance:
+http://en.wikipedia.org/wiki/Perceptron</p>
+<p>Concise course notes on both algorithms:
+http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf</p>
+<p>Although the algorithms are comparably simple they still work pretty well
+for text classification and are fast to train even for huge example sets.
+In contrast to Naive Bayes they are not based on the assumption that all
+features (in the domain of text classification: all terms in a document)
+are independent.</p>
+<p><a name="PerceptronandWinnow-Strategyforparallelisation"></a></p>
+<h2 id="strategy-for-parallelisation">Strategy for parallelisation</h2>
+<p>Currently the strategy for parallelisation is simple: Given there is enough
+training data, split the training data. Train the classifier on each split.
+The resulting hyperplanes are then averaged.</p>
+<p><a name="PerceptronandWinnow-Roadmap"></a></p>
+<h2 id="roadmap">Roadmap</h2>
+<p>Currently the patch only contains the code for the classifier itself. It is
+planned to provide unit tests and at least one example based on the WebKB
+dataset by the end of November for the serial version. After that the
+parallelisation will be added.</p>
+   </div>
+  </div>     
+</div> 
+  <footer class="footer" align="center">
+    <div class="container">
+      <p>
+        Copyright &copy; 2014 The Apache Software Foundation, Licensed under
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </footer>
+  
+  <script src="/js/jquery-1.9.1.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <script>
+    (function() {
+      var cx = '012254517474945470291:vhsfv7eokdc';
+      var gcse = document.createElement('script');
+      gcse.type = 'text/javascript';
+      gcse.async = true;
+      gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+          '//www.google.com/cse/cse.js?cx=' + cx;
+      var s = document.getElementsByTagName('script')[0];
+      s.parentNode.insertBefore(gcse, s);
+    })();
+  </script>
+</body>
+</html>

Added: websites/staging/mahout/trunk/content/users/misc/testing.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/misc/testing.html (added)
+++ websites/staging/mahout/trunk/content/users/misc/testing.html Sun Apr 13 19:20:27 2014
@@ -0,0 +1,289 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data framework, data integration,
+        data matching, data mining, data mining algorithms, data mining analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning methods,
+        learning techniques, lucene, machine learning, machine translation, mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data mining">
+  <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico">
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+	
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+	<script type="text/x-mathjax-config">
+	MathJax.Hub.Config({
+		tex2jax: {
+			skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+		}
+	});
+	MathJax.Hub.Queue(function() {
+		var all = MathJax.Hub.getAllJax(), i;
+		for(i = 0; i < all.length; i += 1) {
+			all[i].SourceElement().parentNode.className += ' has-jax';
+		}
+	});
+	</script>
+	<script type="text/javascript"
+	  src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
+	</script>	
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org" name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a>
+                  <li><a href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference Reading</a>
+		  <li><a href="/general/faq.html">FAQ</a>
+		  <li class="divider"></li>
+		  <li class="nav-header">Legal</li>
+		  <li><a href="http://www.apache.org/licenses/">License</a></li>
+		  <li><a href="http://www.apache.org/security/">Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer resources</a></li>
+                  <li><a href="/developers/version-control.html">Version control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue tracker</a></li>
+      		  <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check list</a></li>
+                  <li><a href="/developers/how-to-release.html">How to release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a>
+                  <li><a href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a href="/users/basics/dimensional-reduction.html">Singular Value Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Mahout On Spark</li>      
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; Spark Bindings *new*</a></li>
+                </ul>
+                 </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+
+              		<li><a href="/users/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a href="/users/stuff/hidden-markov-models.html">Hidden Markov Models</a></li>
+                  <li><a href="/users/classification/logistic-regression.html">Logistic Regression</a></li>
+                  <li><a href="/users/stuff/partial-implementation.html">Random Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a href="/users/classification/wikipedia-bayes-example.html">Wikipedia example</a></li>
+                  <li><a href="/users/clustering/twenty-newsgroups.html">20 newsgroups example</a></li>
+                  <li><a href="/users/classification/breiman-example.html">Breiman example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a href="/users/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a href="/users/clustering/spectral-clustering.html">Spectral Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a href="/users/clustering/k-means-commandline.html">Options for k-Means</a></li>
+                <li><a href="/users/clustering/canopy-commandline.html">Options for Canopy</a></li>
+            		<li><a href="/users/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a href="/users/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a href="/users/clustering/cluster-dumper.html">Cluster Dumper tool</a></li>
+                <li><a href="/users/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a href="/users/recommender/quickstart.html">Quickstart</a></li>
+                <li><a href="/users/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li>
+                <li><a href="/users/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li>
+	        <li><a href="/users/recommender/recommender-documentation.html">General</a></li>
+                </ul></li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+	<ul class="sidemenu">
+		<li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+	</ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li>
+      <li><a href="http://www.apache.org/dev/">Developer Resources</a></li>
+      <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
+      <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/">Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/">Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <p><a name="Testing-Intro"></a></p>
+<h1 id="intro">Intro</h1>
+<p>As Mahout matures, solid testing procedures are needed.  This page and its
+children capture test plans along with ideas for improving our testing.</p>
+<p><a name="Testing-TestPlans"></a></p>
+<h1 id="test-plans">Test Plans</h1>
+<ul>
+<li><a href="0.6.html">0.6</a></li>
+<li>Test Plans for the 0.6 release
+There are no special plans except for unit tests, and user testing of the
+Hadoop jobs.</li>
+</ul>
+<p><a name="Testing-TestIdeas"></a></p>
+<h1 id="test-ideas">Test Ideas</h1>
+<p><a name="Testing-Regressions/Benchmarks/Integrations"></a></p>
+<h2 id="regressionsbenchmarksintegrations">Regressions/Benchmarks/Integrations</h2>
+<ul>
+<li>Algorithmic quality and speed are not tested, except in a few instances.
+Such tests often require much longer run times (minutes to hours), a
+running Hadoop cluster, and downloads of large datasets (in the megabytes). </li>
+<li>Standardized speed tests are difficult on different hardware. </li>
+<li>Unit tests of external integrations require access to externals: HDFS,
+S3, JDBC, Cassandra, etc. </li>
+</ul>
+<p>Apache Jenkins is not able to support these environments. Commercial
+donations would help. </p>
+<p><a name="Testing-UnitTests"></a></p>
+<h2 id="unit-tests">Unit Tests</h2>
+<p>Mahout's current tests are almost entirely unit tests. Algorithm tests
+generally supply a few numbers to code paths and verify that expected
+numbers come out. 'mvn test' runs these tests. There is "positive" coverage
+of a great many utilities and algorithms. A much smaller percent include
+"negative" coverage (bogus setups, inputs, combinations).</p>
+<p><a name="Testing-Other"></a></p>
+<h2 id="other">Other</h2>
+   </div>
+  </div>     
+</div> 
+  <footer class="footer" align="center">
+    <div class="container">
+      <p>
+        Copyright &copy; 2014 The Apache Software Foundation, Licensed under
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache and the Apache feather logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </footer>
+  
+  <script src="/js/jquery-1.9.1.min.js"></script>
+  <script src="/js/bootstrap.min.js"></script>
+  <script>
+    (function() {
+      var cx = '012254517474945470291:vhsfv7eokdc';
+      var gcse = document.createElement('script');
+      gcse.type = 'text/javascript';
+      gcse.async = true;
+      gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
+          '//www.google.com/cse/cse.js?cx=' + cx;
+      var s = document.getElementsByTagName('script')[0];
+      s.parentNode.insertBefore(gcse, s);
+    })();
+  </script>
+</body>
+</html>



Mime
View raw message