mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject Re: [jira] Commented: (MAHOUT-30) dirichlet process implementation
Date Thu, 13 Nov 2008 01:38:51 GMT
Hi Ted,

Indeed, it was precisely all that sampling that confounded me for so 
long, especially in untyped R. All the other clustering algorithms can 
be thought of as sampling too, but their pdfs ==1 for the model chosen. 
I think it was actually a conversation with a statistics guy at Yahoo! 
when I gave a Mahout intro last summer that got me thinking outside of 
that box. He noted that, for large data sets, it is really not necessary 
to process all the points to get meaningful clusters; just to sample 
from them. That took a few months to really sink in and then the ah-ha 
happened. I think I did a posting to this list at that point. Your 
refactoring of my initial abstractions cemented the deal :).

If I continue down the path of using running sums to compute the new 
model parameters, I think I can eliminate materializing the set of 
points that are assigned to each model in recomputeModels(). I need to 
add an observe() method to the Model interface and do some more 
refactoring of ModelDistribution and it is all a little half-baked right 
now. I'll post that to Jira if I get it working, but the basic idea 
would be to create a new set of prior models in the 
assignPointsToModels() method, ask the assigned model to observe() as I 
iterate through the points, and then just compute posterior parameters 
in recomputeModels. Of course, I'll have to figure out how to compute 
the new mixtures differently, without z, but I have some ideas.

I'll keep you all posted,

Ted Dunning (JIRA) wrote:
>     [
> Ted Dunning commented on MAHOUT-30:
> -----------------------------------
> Jeff,
> These look like really nice refactorings.  The process is nice and clear.
> The only key trick that may confuse people is that each step is a sampling.  Thus assignment
to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the
mixture parameters and model pdf's.  Likewise, model computation does NOT compute the best
model, it samples from the distribution given by the data.  Same is true for the mixture parameters.
> Your code does this.  I just think that this is a hard point for people to understand
in these techniques. 
>> dirichlet process implementation
>> --------------------------------
>>                 Key: MAHOUT-30
>>                 URL:
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Clustering
>>            Reporter: Isabel Drost
>>         Attachments: MAHOUT-30.patch
>> Copied over from original issue:
>>> Further extension can also be made by assuming an infinite mixture model. The
implementation is only slightly more difficult and the result is a (nearly)
>>> non-parametric clustering algorithm.

View raw message