mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Emitting distance from centroid for K-Means
Date Thu, 14 Jul 2011 00:34:37 GMT

On Jul 13, 2011, at 6:42 PM, Jeff Eastman wrote:

> +1 Patch looks reasonable enough. You'd need to modify the other clustering algorithms
to achieve uniformity.

Not sure if it needs uniformity, but I can.  As you pointed out, some of the other implementations
don't have the same info, so they need not go to the trouble of doing it.   Also, the change
is only on output of --clustering, so it shouldn't effect the iterations, right?

> 
> The assumption about input seeds originally came from using Canopy to prime KMeans but
it has become the prior set of clusters since the algorithms have converged on common formats
& models. Each iteration reads in the set of clusters-n and outputs clusters-n+1, so changing
this would have broad impact. FuzzyK and Dirichlet use the same iteration semantics and the
ClusterIterator depends on this for unification with classification interfaces.



> 
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org] 
> Sent: Wednesday, July 13, 2011 3:08 PM
> To: dev@mahout.apache.org
> Subject: Re: Emitting distance from centroid for K-Means
> 
> I put up a patch, do you think that it looks reasonable?  I'm not totally thrilled by
it, but it is a start.
> 
> On a related note, is there any reason why the input seeds can't be Vectors as an alternative
to Cluster?
> 
> -Grant
> 
> On Jul 13, 2011, at 5:38 PM, Jeff Eastman wrote:
> 
>> Mostly. Clustering assigns points to one or more clusters, and it uses the distance
measure or model pdf to do this. So the distance from each point to the cluster center is
calculated in this step but thrown away once the assignment(s) is(are) made. This information
could be output to another file or a different version could output the distance directly
instead of the pdf. I don't know what that would mean for Dirichlet; however, since it only
plays with pdf values.
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org] 
>> Sent: Wednesday, July 13, 2011 1:36 PM
>> To: dev@mahout.apache.org
>> Subject: Re: Emitting distance from centroid for K-Means
>> 
>> Isn't --clustering the post processing step that already does it?
>> 
>> On Jul 13, 2011, at 4:31 PM, Jeff Eastman wrote:
>> 
>>> Well, distance is dependent upon the distance measure you want to use. A post-processing
step could easily calculate this. The ClusterEvaluator may have some methods that could be
useful. It calculates a set of representative points for each cluster and calculates interCluster
and intraCluster densities from that. 
>>> 
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org] 
>>> Sent: Wednesday, July 13, 2011 1:28 PM
>>> To: dev@mahout.apache.org
>>> Subject: Re: Emitting distance from centroid for K-Means
>>> 
>>> Good to know.  Next question, what's the preferred way, then, to get out either
the distance or what Ted said?
>>> 
>>> -Grant
>>> 
>>> On Jul 13, 2011, at 4:25 PM, Ted Dunning wrote:
>>> 
>>>> I take back what I said.
>>>> 
>>>> Jeff is correct.
>>>> 
>>>> On Wed, Jul 13, 2011 at 1:23 PM, Jeff Eastman <jeastman@narus.com>
wrote:
>>>> 
>>>>> The weight is the probability the vector is a member of the cluster.
For
>>>>> FuzzyK and Dirichlet it is fractional, for KMeans it is 1 as the algorithm
>>>>> is maximum likelihood and each point is only assigned to a single cluster.
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>>>>> Sent: Wednesday, July 13, 2011 1:11 PM
>>>>> To: dev@mahout.apache.org
>>>>> Subject: Emitting distance from centroid for K-Means
>>>>> 
>>>>> Does it make sense to output the distance to the cluster as the weight
in
>>>>> the KMeansClusterer.outputPointWithClusterInfo method instead of 1? 
What's
>>>>> the purpose of the 1 as the weight?
>>>>> 
>>>>> -Grant
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> 
>>> 
>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> 
>> 
>> 
> 
> --------------------------
> Grant Ingersoll
> 
> 
> 

--------------------------
Grant Ingersoll




Mime
View raw message