Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1FB29EE21 for ; Sun, 24 Feb 2013 17:12:25 +0000 (UTC) Received: (qmail 22143 invoked by uid 500); 24 Feb 2013 17:12:23 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 21991 invoked by uid 500); 24 Feb 2013 17:12:23 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 21983 invoked by uid 99); 24 Feb 2013 17:12:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Feb 2013 17:12:23 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-ia0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Feb 2013 17:12:17 +0000 Received: by mail-ia0-f172.google.com with SMTP id l29so1871191iag.3 for ; Sun, 24 Feb 2013 09:11:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=1GIJv24M3Q65sYCLY9ueyI58xoMleVtZyLdatmH0HRQ=; b=ATT9dG8VQStUMlixZgKYy3xSqL9MUX4vtKLNaKzqPIQ6EqvLmS9gC4/UNbI9GpEVtf tTlhRlBNIs6daGYZn9eOA2QcUBSfLGdgS4oGLQXwwaOFk3RzPGGu7CleJ+/9zN/Zd0wP Ozlw8VTP/BeYQDU010J/ahAUHM0Zf2oVgz7ezahCBDpZ6lXAqNnvVRjEjGsLmLDsz0h+ OSOjsiIMQs1ypByO3TYlShvdMwHmFKdUeZYR5WsfOPuc72IuzRC1gvH5QrhMrHzQDTeJ rb0fHxx9t1IrFv8flXV3i6DPIgzYRoe4zyPNFEY50+pGCGBFyJ7rd9Qpcz4cTKN4ETqU EdMg== X-Received: by 10.42.98.76 with SMTP id r12mr3336155icn.10.1361725916040; Sun, 24 Feb 2013 09:11:56 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.106.74 with HTTP; Sun, 24 Feb 2013 09:11:24 -0800 (PST) In-Reply-To: References: From: Ted Dunning Date: Sun, 24 Feb 2013 09:11:24 -0800 Message-ID: Subject: Re: Plotting cluster quality To: user@mahout.apache.org, david_murgatroyd@hotmail.com Content-Type: multipart/alternative; boundary=90e6ba614a9681ccc704d67b85a0 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba614a9681ccc704d67b85a0 Content-Type: text/plain; charset=UTF-8 I spoke off-line to Dan and he confirmed your inference. Color was just there for visual esthetics. On Sun, Feb 24, 2013 at 6:18 AM, David Murgatroyd wrote: > >What does color mean here? What about width of the box? > FWIW, I infer color is solely for visual distinction -- rotating through > orange, red, yellow, pink from left to right. I infer width is proportional > to count of items in each cluster, though apparently not linearly. > > I agree that a single plot comparing the algorithms is important since the > purpose of the plot is to compare the algorithms rather than better > understand the data on which they've been run. I haven't thought of a good > way to do that while still having a cluster-by-cluster visual element. > > On Fri, Feb 22, 2013 at 12:47 PM, Ted Dunning > wrote: > > > What does color mean here? > > > > What about width of the box? > > > > When you say median or mean of all cluster distances, do you mean across > > that single run? > > > > I think that this plot is fine as it is except that it needs a legend > that > > explains all of these issues. My general rule of thumb is that most > > figures should have what I call a "Kipling caption". See the caption of > > the first image here: http://www.boop.org/jan/justso/butter.htm to see > > what > > I mean by this. Imagine that there is a very mathematically inclined 4 > > year old who is looking at your diagram and quizzing you about every > part. > > Answer all their questions in the caption and you have a Kipling > caption. > > > > For comparing different runs of the clustering or different algorithms, I > > think that a cumulative distribution plot (using plot.ecdf) with all of > the > > different algorithms on one plot would be the best comparison tool. > > > > On Fri, Feb 22, 2013 at 8:33 AM, Dan Filimon < > dangeorge.filimon@gmail.com > > >wrote: > > > > > As most of the regulars know, I'm working with Ted Dunning on a new > > > clustering framework for Mahout that should land in 0.8. > > > > > > Part of my work is comparing the clustering quality of the new code > > > with the existing Mahout implementation. > > > > > > I compiled a CSV of the quality data [1]. I ran 5 runs of the > > > clustering on the 20 newsgroups data set comparing Mahout KMeans (km), > > > Ball KMeans (bkm), Streaming KMeans (skm) and Streaming KMeans > > > followed by Ball KMeans (bskm). > > > > > > I'm looking at now making some appealing plots for the data. For > > > instance, I think want to make box plots of individual clustering > > > runs. Here's an example [2] of what a clustering looks like for one > > > run of Mahout's standard k-means. > > > > > > There's a box for each cluster, the mean distance is the thick line, > > > the limits are the 1st and 3rd quartiles and the whiskers are the min > > > and max distances. > > > The blue horizontal line is the mean of all average cluster distances. > > > The green horizontal line is the median of all average cluster > distances. > > > > > > I intend on making similar plots for the other runs and then > > > aggregating the means of the runs into box plots for the different > > > classes of k-means. > > > The main result being that streaming k-means + ball k-means (as done > > > in the MR) gives a high quality clustering. > > > > > > How do you feel about this plot? Is it too dense? Too colorful? Should > > > I not draw the median any more? > > > What are some other good ways of plotting the quality given the data > set? > > > > > > Thanks! > > > > > > [1] > > > > > > https://github.com/dfilimon/mahout/blob/skm/examples/src/main/resources/kmeans-comparison-nospace.csv > > > [2] > > > > > > http://swarm.cs.pub.ro/~dfilimon/skm-mahout/Mahout%20KMeans%20Run%201.pdf > > > > > > --90e6ba614a9681ccc704d67b85a0--