Hi Ted,
Indeed, it was precisely all that sampling that confounded me for so
long, especially in untyped R. All the other clustering algorithms can
be thought of as sampling too, but their pdfs ==1 for the model chosen.
I think it was actually a conversation with a statistics guy at Yahoo!
when I gave a Mahout intro last summer that got me thinking outside of
that box. He noted that, for large data sets, it is really not necessary
to process all the points to get meaningful clusters; just to sample
from them. That took a few months to really sink in and then the ahha
happened. I think I did a posting to this list at that point. Your
refactoring of my initial abstractions cemented the deal :).
If I continue down the path of using running sums to compute the new
model parameters, I think I can eliminate materializing the set of
points that are assigned to each model in recomputeModels(). I need to
add an observe() method to the Model interface and do some more
refactoring of ModelDistribution and it is all a little halfbaked right
now. I'll post that to Jira if I get it working, but the basic idea
would be to create a new set of prior models in the
assignPointsToModels() method, ask the assigned model to observe() as I
iterate through the points, and then just compute posterior parameters
in recomputeModels. Of course, I'll have to figure out how to compute
the new mixtures differently, without z, but I have some ideas.
I'll keep you all posted,
Jeff
Ted Dunning (JIRA) wrote:
> [ https://issues.apache.org/jira/browse/MAHOUT30?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12646967#action_12646967
]
>
> Ted Dunning commented on MAHOUT30:
> 
>
> Jeff,
>
> These look like really nice refactorings. The process is nice and clear.
>
> The only key trick that may confuse people is that each step is a sampling. Thus assignment
to clusters does NOT assign to the best cluster, it picks a cluster at random, biased by the
mixture parameters and model pdf's. Likewise, model computation does NOT compute the best
model, it samples from the distribution given by the data. Same is true for the mixture parameters.
>
> Your code does this. I just think that this is a hard point for people to understand
in these techniques.
>
>
>> dirichlet process implementation
>> 
>>
>> Key: MAHOUT30
>> URL: https://issues.apache.org/jira/browse/MAHOUT30
>> Project: Mahout
>> Issue Type: New Feature
>> Components: Clustering
>> Reporter: Isabel Drost
>> Attachments: MAHOUT30.patch
>>
>>
>> Copied over from original issue:
>>
>>> Further extension can also be made by assuming an infinite mixture model. The
implementation is only slightly more difficult and the result is a (nearly)
>>> nonparametric clustering algorithm.
>>>
>
>
