mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Francisco Contreras Gaitan <>
Subject RE: String clustering and other newbie questions
Date Tue, 01 Sep 2009 12:06:47 GMT

Ok, I see. Sorry for my unknowledge on these matters (I am going to read all the documentation
you gave me closely).

But if I understood you well, and as far as I know, Mahout has its own k-means implementation.
Then, could I use it for my purposes instead of DP like setup?

Thank you very much, Isabel.


> Date: Tue, 1 Sep 2009 08:23:05 +0200
> From:
> To:
> Subject: Re: String clustering and other newbie questions
> On Mon, 31 Aug 2009 14:02:08 +0200
> Juan Francisco Contreras Gaitan <> wrote:
> > Thank you very much for your answer, but I think I can't understand
> > it very well. Could you give me some more details?
> Taking up that question, Ted, please correct me anywhere where I'm
> wrong.
> > For example, what does 'DP' stand for?
> DP stands for Dirichlet Process, sometimes also referred to as "chinese
> restaurant process". There is a nice wikipedia page on dirichlet
> processes themselves:
> An explanation of how they were employed to implement a clustering
> algorithm in Mahout is explained on one of our wiki pages (including
> references to the original papers):
> > You can see an example of what I would like to
> > do in my previous answer.
> In a k-Means like setup, you would implement your own distance
> (Levenstein in your case) and use that to assign items to clusters
> during the E(stimation)-step. After that you would employ your own
> implementation of a centroid selection algorithm for recomputing
> cluster-centroids during the M(aximisation)-step.
> In a DP like setup it would look a little different: During the E step
> instead of having k cluster centers, computing distances to these
> clusters and doing hard assignments you would have k cluster models
> and compute a probability of the strings being generated by each
> model. During the M step you would then recompute each cluster model
> based how likely each string was found to be generated by that model.
> To arrive at a final assignment, after the assignment probabilities
> become stable you could choose to assign each point to the model with
> highest probability.
> Isabel

Messenger cumple 10 años ¡Descárgate ya los nuevos emoticonos!
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message