Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 4941 invoked from network); 7 Apr 2010 18:49:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Apr 2010 18:49:06 -0000 Received: (qmail 57477 invoked by uid 500); 7 Apr 2010 18:21:06 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 57455 invoked by uid 500); 7 Apr 2010 18:21:06 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 57447 invoked by uid 99); 7 Apr 2010 18:21:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Apr 2010 18:21:06 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.208.4.194] (HELO mout.perfora.net) (74.208.4.194) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Apr 2010 18:20:57 +0000 Received: from jeff-eastmans-macbook-pro.local (c-71-198-0-148.hsd1.ca.comcast.net [71.198.0.148]) by mrelay.perfora.net (node=mrus3) with ESMTP (Nemesis) id 0MD9dM-1Nj7Jl3h2i-00GcOK; Wed, 07 Apr 2010 14:20:35 -0400 Message-ID: <4BBCCCF1.3080207@windwardsolutions.com> Date: Wed, 07 Apr 2010 11:20:33 -0700 From: Jeff Eastman User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: MAHOUT-236 Cluster Evaluation Tools? References: <4BBB6BF6.7050807@windwardsolutions.com> <4BBB7416.5080902@windwardsolutions.com> <4BBBB15F.3050108@windwardsolutions.com> <4BBBE0F1.5020509@windwardsolutions.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX18DWoZyACjWQakw42GiceJLbW/A/b5f5IgEVCd SdXipdkPf75L5V/ehbeh8rPoPRr+DuPHae/TP9tV7Wns4nDS+I UkjCojjR24ZfMSA/Lk3KTqxbhSVOulVkrHFbpg0RUg= X-Virus-Checked: Checked by ClamAV on apache.org Hi Robin, Interesting paper. I'm beginning to see how to MR the representative point selection already. The rest will hopefully become clearer with more study. Lots of MR jobs are needed to: a) get the data into Vectors, b) iterate (e.g. kmeans) over the data to produce a set of clusters, c) cluster the data, d) iterate over the clustered data to derive representative points for each cluster, and finally e) produce the CDbw. And, of course all of this is again iterated with different values for the clustering algorithm's parameters. Should keep the lights on at PG&E producing power for the server farms. Robin Anil wrote: > Hi Jeff, > This is an good paper with a simple measure of cluster quality > measurement based on intra cluster density and inter cluster separation. Its > pretty easy to compute. Need to make it a map/reduce job > http://docs.google.com/viewer?a=v&q=cache:z5p9n04cBQEJ:www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf+clustering+quality&hl=en&gl=in&pid=bl&srcid=ADGEESiC-ocW6IWrKR4cb1t1ZqkzRKQ3tDv4UFBkVaUKU0gG3kADcPWIjs-60A0912nu8MFPsVM3pf9jKrP98dL-B-BaiOC9LObBS3VkJK6Mu6josZtVegLxp3BftduD3hFxtGOVZK_b&sig=AHIEtbSZwtgw9wmJoojQn7Dlz5OL67vICw > Robin > > >