Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C1231073D for ; Thu, 21 Nov 2013 11:07:42 +0000 (UTC) Received: (qmail 3323 invoked by uid 500); 21 Nov 2013 11:07:30 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 3277 invoked by uid 500); 21 Nov 2013 11:07:30 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 3270 invoked by uid 99); 21 Nov 2013 11:07:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Nov 2013 11:07:30 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Nov 2013 11:07:28 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id CF5C1238896F; Thu, 21 Nov 2013 11:07:08 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1544109 - /mahout/site/mahout_cms/trunk/content/users/clustering/dirichlet-process-clustering.mdtext Date: Thu, 21 Nov 2013 11:07:08 -0000 To: commits@mahout.apache.org From: isabel@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20131121110708.CF5C1238896F@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: isabel Date: Thu Nov 21 11:07:08 2013 New Revision: 1544109 URL: http://svn.apache.org/r1544109 Log: MAHOUT-1245 - fixing images Modified: mahout/site/mahout_cms/trunk/content/users/clustering/dirichlet-process-clustering.mdtext Modified: mahout/site/mahout_cms/trunk/content/users/clustering/dirichlet-process-clustering.mdtext URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/clustering/dirichlet-process-clustering.mdtext?rev=1544109&r1=1544108&r2=1544109&view=diff ============================================================================== --- mahout/site/mahout_cms/trunk/content/users/clustering/dirichlet-process-clustering.mdtext (original) +++ mahout/site/mahout_cms/trunk/content/users/clustering/dirichlet-process-clustering.mdtext Thu Nov 21 11:07:08 2013 @@ -1,6 +1,7 @@ Title: Dirichlet Process Clustering + -# Overview +# Overview Dirichlet process clustering The Dirichlet Process Clustering algorithm performs Bayesian mixture modeling. @@ -99,10 +100,8 @@ Invocation using the command line takes -a0 -x \ -k \ - -md -\ - -mp \ + -md \ + -mp \ -dm -ow -cl @@ -171,9 +170,9 @@ The points are generated as follows: In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed. It is, of course, impossible to tell which model actually generated each point as there is some probability - -perhaps small - that any of the models could have generated every point. +perhaps small - that any of the models could have generated every point -!SampleData.png! +![dirichlet](../../images/SampleData.png) In the next image, the Dirichlet Process Clusterer is run against the sample points using a NormalModelDistribution with m=\[0.0, 0.0\](0.0,-0.0\.html) sd=1.0. This distribution represents the least amount of prior @@ -190,7 +189,7 @@ As Dirichlet clustering is an iterative . These illustrate the cluster convergence process over the last several iterations and can be helpful in tuning the algorithm. -!DirichletN.png! +![dirichlet](../../images/DirichletN.png) The next image improves upon this situation by using a SampledNormalDistribution. In this distribution, the prior models have @@ -201,14 +200,14 @@ different pdf for each point and the ite more-likely models given this value. The result is a decent capture of the sample data parameters but there is still some over-fitting. -!DirichletSN.png! +![dirichlet](../../images/DirichletSN.png) The above image was run through 20 iterations and the cluster assignments are clearly moving indicating the clustering is not yet converged. The next image runs the same model for 40 iterations, producing an accurate model of the input data. -!DirichletSN40.png! +![dirichlet](../../images/DirichletSN40.png) The next image uses an AsymmetricSampledNormalDistribution in which the model's standard deviation is also represented as a 2-d vector. This causes @@ -218,7 +217,7 @@ the actual sample data quite well. Had w generated in a similar manner then this distribution would have been the most logical model. -!DirichletASN.png! +![dirichlet](../../images/DirichletASN.png) In order to explore an asymmetrical sample data distribution, the following image shows a number of points generated according to the following @@ -231,14 +230,14 @@ parameters. Again, the generator's 3-sig * 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html) sd=\[0.1, 0.5\] -!AsymmetricSampleData.png! +![dirichlet](../../images/AsymmetricSampleData.png) The following image shows the results of applying the symmetrical SampledNormalDistribution to the asymmetrically-generated sample data. It does a valiant effort but does not capture a very good set of models because the circular model assumption does not fit the data. -!2dDirichletSN.png! +![dirichlet](../../images/2dDirichletSN.png) Finally, the AsymmetricSampledNormalDistribution is run against the asymmetrical sample data. Though there is some over-fitting, it does a @@ -248,8 +247,8 @@ slightly different results. Compare the for 20 iterations with another run of numClusters=40 models for 40 iterations. -!2dDirichletASN.png! -!2dDirichletASN4040.png! +![dirichlet](../../images/2dDirichletASN.png) +![dirichlet](../../images/2dDirichletASN4040.png) # References @@ -263,4 +262,4 @@ model is found. The Neal and Blei references from the McCullagh and Yang paper are also good. Zoubin Gharamani has some very [nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool](http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf) -, there are video versions about as well. +, there are video versions about as well. \ No newline at end of file