mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: kMeans Help
Date Sun, 28 Jun 2009 21:55:01 GMT

On Jun 28, 2009, at 4:56 PM, Grant Ingersoll wrote:

> I get all of this, my point is that when you rehydrate the Cluster,  
> it doesn't properly report the centroid per my email all because  
> numPoints == 0 and pointTotal is a a vector that is the same as the  
> passed in center vector, but initialized to 0.
>

In other words, the simple act of serializing a Cluster to HDFS and  
then reconstituting it should not alter the result one gets, which I  
believe is what happens if one dumps out the clusters that have been  
calculated after the whole process is done.


>
>
> On Jun 27, 2009, at 11:42 AM, Jeff Eastman wrote:
>
>> I think this comment is on the right track. During an iteration,  
>> each cluster is created with a center and no points. Then, as each  
>> point is compared against the cluster centers, it is added to the  
>> closest cluster. If the initial center is considered to be a point,  
>> then it will bias the new centroid calculation towards its center,  
>> incorrectly, as shown below.
>>
>> One could argue that the centroid of a degenerate cluster with no  
>> points ought to be its center and not a zero vector, but clusters  
>> with points should have centroids that do not include it.
>>
>> nfantone wrote:
>>> On Sat, Jun 27, 2009 at 8:10 AM, Grant  
>>> Ingersoll<gsingers@apache.org> wrote:
>>>
>>>> On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:
>>>>
>>>>
>>>>> The semantics of constructing a Cluster are odd to me.  Do I  
>>>>> always have
>>>>> to immediately add a point to the Cluster in order for it to be  
>>>>> "real",
>>>>> despite the fact that I added a Center?  Isn't adding a Center  
>>>>> effectively
>>>>> giving the Cluster one point?
>>>>>
>>>>>
>>>
>>> Perhaps I misunderstood you, but I think that by assigning a new  
>>> point
>>> (by calling addPoint(Vector)) to a Cluster does not mean you are
>>> "adding a center". A center is specified at the beginning of the
>>> algorithm and every iteration, after including a set of new points,
>>> recalculates that center by determining a new means - which is now  
>>> the
>>> centroid of that particular Cluster. So, clearly, the center  
>>> itself is
>>> a proper point in the Cluster and you don't need to add it after  
>>> being
>>> selected as that in order for it to be "real".
>>>
>>>
>>>> And if you add the center, why isn't it the centroid until other  
>>>> points are
>>>> added?
>>>>
>>>>
>>>
>>> Again, the centroid is the result of a recalculation of a means and
>>> may or may not be a real point. By having just one point in a  
>>> Cluster
>>> - that is to say, its center - there's no "recalculation" to be  
>>> done.
>>> Conceptually, you could say the centroid lies, in fact, in the  
>>> center
>>> - though, it's not relevant to the algorithm.
>>>
>>> A final example. Let's say you create a Cluster C with point (1,1)  
>>> as
>>> its center. Then, you add (3,3) to it.
>>>
>>> Cluster C: (1,1);(3,3) - original center: (1,1) - centroid: (2,2)
>>>
>>> Now, you create another Cluster C' with the same center, but  
>>> decide to
>>> add the point again. Then, (3,3) is added.
>>>
>>> Cluster C': (1,1);(1,1);(3,3) - original center: (1,1) - centroid  
>>> (5/3, 5/3).
>>>
>>> Ok, that was an unnecesary example. Got it. But it shows that C  
>>> and C'
>>> are not the same cluster, based on the fact that point repetition
>>> contribute to a general means.
>>>
>>>
>>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message