mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Emitting distance from centroid for K-Means
Date Wed, 13 Jul 2011 21:38:36 GMT
Mostly. Clustering assigns points to one or more clusters, and it uses the distance measure
or model pdf to do this. So the distance from each point to the cluster center is calculated
in this step but thrown away once the assignment(s) is(are) made. This information could be
output to another file or a different version could output the distance directly instead of
the pdf. I don't know what that would mean for Dirichlet; however, since it only plays with
pdf values.

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Wednesday, July 13, 2011 1:36 PM
To: dev@mahout.apache.org
Subject: Re: Emitting distance from centroid for K-Means

Isn't --clustering the post processing step that already does it?

On Jul 13, 2011, at 4:31 PM, Jeff Eastman wrote:

> Well, distance is dependent upon the distance measure you want to use. A post-processing
step could easily calculate this. The ClusterEvaluator may have some methods that could be
useful. It calculates a set of representative points for each cluster and calculates interCluster
and intraCluster densities from that. 
> 
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org] 
> Sent: Wednesday, July 13, 2011 1:28 PM
> To: dev@mahout.apache.org
> Subject: Re: Emitting distance from centroid for K-Means
> 
> Good to know.  Next question, what's the preferred way, then, to get out either the distance
or what Ted said?
> 
> -Grant
> 
> On Jul 13, 2011, at 4:25 PM, Ted Dunning wrote:
> 
>> I take back what I said.
>> 
>> Jeff is correct.
>> 
>> On Wed, Jul 13, 2011 at 1:23 PM, Jeff Eastman <jeastman@narus.com> wrote:
>> 
>>> The weight is the probability the vector is a member of the cluster. For
>>> FuzzyK and Dirichlet it is fractional, for KMeans it is 1 as the algorithm
>>> is maximum likelihood and each point is only assigned to a single cluster.
>>> 
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>>> Sent: Wednesday, July 13, 2011 1:11 PM
>>> To: dev@mahout.apache.org
>>> Subject: Emitting distance from centroid for K-Means
>>> 
>>> Does it make sense to output the distance to the cluster as the weight in
>>> the KMeansClusterer.outputPointWithClusterInfo method instead of 1?  What's
>>> the purpose of the 1 as the weight?
>>> 
>>> -Grant
>>> 
>>> 
>>> 
> 
> --------------------------
> Grant Ingersoll
> 
> 
> 

--------------------------
Grant Ingersoll




Mime
View raw message