Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 34974 invoked from network); 27 Jun 2009 15:42:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Jun 2009 15:42:30 -0000 Received: (qmail 73107 invoked by uid 500); 27 Jun 2009 15:42:40 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 73050 invoked by uid 500); 27 Jun 2009 15:42:40 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 73040 invoked by uid 99); 27 Jun 2009 15:42:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Jun 2009 15:42:40 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.208.4.195] (HELO mout.perfora.net) (74.208.4.195) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 27 Jun 2009 15:42:31 +0000 Received: from jeff-eastmans-macbook-pro.local (c-71-198-3-140.hsd1.ca.comcast.net [71.198.3.140]) by mrelay.perfora.net (node=mrus0) with ESMTP (Nemesis) id 0MKp8S-1MKa2Z3kKE-000SSS; Sat, 27 Jun 2009 11:42:09 -0400 Received: from jeff-eastmans-macbook-pro.local by jeff-eastmans-macbook-pro.local (PGP Universal service); Sat, 27 Jun 2009 08:42:09 -0700 X-PGP-Universal: processed; by jeff-eastmans-macbook-pro.local on Sat, 27 Jun 2009 08:42:09 -0700 Message-ID: <4A463DCB.2040701@windwardsolutions.com> Date: Sat, 27 Jun 2009 08:42:03 -0700 From: Jeff Eastman User-Agent: Thunderbird 2.0.0.22 (Macintosh/20090605) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: kMeans Help References: <9D2894FE-5DBC-45C7-B8EC-AF3CD01E91CB@apache.org> <4A453BF5.70401@windwardsolutions.com> <4A455DB5.1040500@windwardsolutions.com> <7FC9E5D4-3AF4-44A1-A819-18C9ED627F0C@apache.org> <28304FA2-6615-4D2F-94D3-1EBFC86B8229@apache.org> <3E9D6FD3-CC7E-4C8D-9732-1E2AAB0300EE@apache.org> <4A79CB0A-6EE7-4640-A3C3-22F0B0B41746@apache.org> <37ffc8080906270718j725fd486qe2635f36d7158c87@mail.gmail.com> In-Reply-To: <37ffc8080906270718j725fd486qe2635f36d7158c87@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1/YRyPMzOW9J67R7mfaaYqJhwkNNcZGyxWyjtl g8yw59bnX4HhcQffpNDY1Q3CTQzGlmOBJVzOMvcGJpC6Yskx60 ZT8w4SeTKSkhwtfS25w6fWx+m0T6FlzwNR5T2j2NrQ= X-Virus-Checked: Checked by ClamAV on apache.org I think this comment is on the right track. During an iteration, each cluster is created with a center and no points. Then, as each point is compared against the cluster centers, it is added to the closest cluster. If the initial center is considered to be a point, then it will bias the new centroid calculation towards its center, incorrectly, as shown below. One could argue that the centroid of a degenerate cluster with no points ought to be its center and not a zero vector, but clusters with points should have centroids that do not include it. nfantone wrote: > On Sat, Jun 27, 2009 at 8:10 AM, Grant Ingersoll wrote: > >> On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote: >> >> >>> The semantics of constructing a Cluster are odd to me. Do I always have >>> to immediately add a point to the Cluster in order for it to be "real", >>> despite the fact that I added a Center? Isn't adding a Center effectively >>> giving the Cluster one point? >>> >>> > > Perhaps I misunderstood you, but I think that by assigning a new point > (by calling addPoint(Vector)) to a Cluster does not mean you are > "adding a center". A center is specified at the beginning of the > algorithm and every iteration, after including a set of new points, > recalculates that center by determining a new means - which is now the > centroid of that particular Cluster. So, clearly, the center itself is > a proper point in the Cluster and you don't need to add it after being > selected as that in order for it to be "real". > > >> And if you add the center, why isn't it the centroid until other points are >> added? >> >> > > Again, the centroid is the result of a recalculation of a means and > may or may not be a real point. By having just one point in a Cluster > - that is to say, its center - there's no "recalculation" to be done. > Conceptually, you could say the centroid lies, in fact, in the center > - though, it's not relevant to the algorithm. > > A final example. Let's say you create a Cluster C with point (1,1) as > its center. Then, you add (3,3) to it. > > Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2) > > Now, you create another Cluster C' with the same center, but decide to > add the point again. Then, (3,3) is added. > > Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid (5/3, 5/3). > > Ok, that was an unnecesary example. Got it. But it shows that C and C' > are not the same cluster, based on the fact that point repetition > contribute to a general means. > > >