mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Francisco Contreras Gaitan <juanfcocontre...@hotmail.com>
Subject RE: String clustering and other newbie questions
Date Tue, 01 Sep 2009 12:06:47 GMT

Ok, I see. Sorry for my unknowledge on these matters (I am going to read all the documentation
you gave me closely).

But if I understood you well, and as far as I know, Mahout has its own k-means implementation.
Then, could I use it for my purposes instead of DP like setup?

Thank you very much, Isabel.

Regards,
jfcg

> Date: Tue, 1 Sep 2009 08:23:05 +0200
> From: isabel@apache.org
> To: mahout-user@lucene.apache.org
> Subject: Re: String clustering and other newbie questions
> 
> On Mon, 31 Aug 2009 14:02:08 +0200
> Juan Francisco Contreras Gaitan <juanfcocontreras@hotmail.com> wrote:
> 
> > Thank you very much for your answer, but I think I can't understand
> > it very well. Could you give me some more details?
> 
> Taking up that question, Ted, please correct me anywhere where I'm
> wrong.
> 
> 
> > For example, what does 'DP' stand for?
> 
> DP stands for Dirichlet Process, sometimes also referred to as "chinese
> restaurant process". There is a nice wikipedia page on dirichlet
> processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process
> 
> An explanation of how they were employed to implement a clustering
> algorithm in Mahout is explained on one of our wiki pages (including
> references to the original papers):
> 
> http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html
> 
> 
> > You can see an example of what I would like to
> > do in my previous answer.
> 
> In a k-Means like setup, you would implement your own distance
> (Levenstein in your case) and use that to assign items to clusters
> during the E(stimation)-step. After that you would employ your own
> implementation of a centroid selection algorithm for recomputing
> cluster-centroids during the M(aximisation)-step.
> 
> In a DP like setup it would look a little different: During the E step
> instead of having k cluster centers, computing distances to these
> clusters and doing hard assignments you would have k cluster models
> and compute a probability of the strings being generated by each
> model. During the M step you would then recompute each cluster model
> based how likely each string was found to be generated by that model.
> To arrive at a final assignment, after the assignment probabilities
> become stable you could choose to assign each point to the model with
> highest probability.
> 
>  
> Isabel

_________________________________________________________________
Messenger cumple 10 años ¡Descárgate ya los nuevos emoticonos!
http://www.vivelive.com/felicidades
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message