mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cont...@dhuebner.com
Subject DirichletDriver vector cardinality and heap usage
Date Mon, 13 May 2013 10:34:13 GMT
I am trying to run the Dirichlet Process Clustering on the cooccurrence 
matrix output of the RowSimilarityJob. Since RowSimilarityJob creates 
RandomAccessSparseVectors with a cardinality of Integer.MAX_VALUE, I 
used the following code to run the clustering:


ModelDistribution<VectorWritable> modelDist = new 
GaussianClusterDistribution(new VectorWritable(new DenseVector(2)));
DistributionDescription description = new 
DistributionDescription(modelDist.getClass().getName(), 
RandomAccessSparseVector.class.getName(), 
CosineDistanceMeasure.class.getName(), Integer.MAX_VALUE);

DirichletDriver.run(conf, cooccurrenceMatrixPath, clusteringOutput, 
description, 10, 20, 1.0, true, true, 0, false);



Using Integer.MAX_VALUE for the DistributionDescription results in an 
exploding heap space usage. Is there a way to circumvent this problem?

Mime
View raw message