mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject DirichletDriver vector cardinality and heap usage
Date Mon, 13 May 2013 10:34:13 GMT
I am trying to run the Dirichlet Process Clustering on the cooccurrence 
matrix output of the RowSimilarityJob. Since RowSimilarityJob creates 
RandomAccessSparseVectors with a cardinality of Integer.MAX_VALUE, I 
used the following code to run the clustering:

ModelDistribution<VectorWritable> modelDist = new 
GaussianClusterDistribution(new VectorWritable(new DenseVector(2)));
DistributionDescription description = new 
CosineDistanceMeasure.class.getName(), Integer.MAX_VALUE);, cooccurrenceMatrixPath, clusteringOutput, 
description, 10, 20, 1.0, true, true, 0, false);

Using Integer.MAX_VALUE for the DistributionDescription results in an 
exploding heap space usage. Is there a way to circumvent this problem?

View raw message