mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Dirchlet
Date Wed, 02 Nov 2011 23:56:52 GMT

On Nov 2, 2011, at 5:29 PM, Jeff Eastman wrote:

> I think the scalability problems you are seeing are a consequence of using the default
GaussianCluster models. These models perform especially poorly for large text clustering problems
such as email. The pdf() calculation over wide topic vectors does a lot of complicated math
for each term pdf and then underflows on the combined pdf() product to boot. I've updated to use a CosineDistanceMeasure and a DistanceMeasureCluster instead and the
performance has improved over 100x on Reuters. So has, evidently, the quality of the clustering.
See recent posts "Dirichlet Process Clustering not working".

I shall try that on the script.
View raw message