mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: kMeans Help
Date Sat, 27 Jun 2009 14:18:46 GMT
On Sat, Jun 27, 2009 at 8:10 AM, Grant Ingersoll<gsingers@apache.org> wrote:
>
> On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
>
>>
>> The semantics of constructing a Cluster are odd to me.  Do I always have
>> to immediately add a point to the Cluster in order for it to be "real",
>> despite the fact that I added a Center?  Isn't adding a Center effectively
>> giving the Cluster one point?
>>

Perhaps I misunderstood you, but I think that by assigning a new point
(by calling addPoint(Vector)) to a Cluster does not mean you are
"adding a center". A center is specified at the beginning of the
algorithm and every iteration, after including a set of new points,
recalculates that center by determining a new means - which is now the
centroid of that particular Cluster. So, clearly, the center itself is
a proper point in the Cluster and you don't need to add it after being
selected as that in order for it to be "real".

> And if you add the center, why isn't it the centroid until other points are
> added?
>

Again, the centroid is the result of a recalculation of a means and
may or may not be a real point. By having just one point in a Cluster
- that is to say, its center - there's no "recalculation" to be done.
Conceptually, you could say the centroid lies, in fact, in the center
- though, it's not relevant to the algorithm.

A final example. Let's say you create a Cluster C with point (1,1) as
its center. Then, you add (3,3) to it.

Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)

Now, you create another Cluster C' with the same center, but decide to
add the point again. Then, (3,3) is added.

Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid (5/3, 5/3).

Ok, that was an unnecesary example. Got it. But it shows that C and C'
are not the same cluster, based on the fact that point repetition
contribute to a general means.

Mime
View raw message