mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jure Jeseni─Źnik <>
Subject Canopy memory consumption
Date Wed, 17 Nov 2010 11:54:11 GMT
Hi Guys.

What I'm trying to do is the basic news clustering, that will group the news about the same
topic into clusters.  I have the data in a database so I took the following approach:

1.       Wrote a small program that puts the data from the db into a Lucene Index.

2.       Created vectors from index with the following command:
mahout lucene.vector -d newsindex -f text -o input/out.txt -t dict.txt -i link -n 2

3.       Ran canopy, to get initial clusters:
mahout canopy -i input/ -o output-canopy/ -t1 1 -t2 1.4 -ow

4.       Ran the kmeans to perform the final clustering:
mahout kmeans -i input/ -o output-kmeans/ -c output-canopy/clusters-0 -x 10 -cl -ow

5.       Do the clusterdump to view results:
mahout clusterdump -s output-kmeans/clusters-2 -d dict.txt -p output-kmeans/clusteredPoints
-dt text -b 100 -n 10 > result.txt

When I run this with cca 1000 records (8000 distinct terms), the results are just perfect.
I get exactly the clusters I want. The problems start when I try the same steps with a bit
more data.

With 6000 records (28000 terms) or even the half of that, the process fails at the canopy
step with Java heap space OutOfMemoryError. The  MAHOUT_HEAPSIZE variable value on my local
machine is 1024.  I even tried running it on our development hadoop cluster with approximately
the same amount of memory, but it failed with the same error.

I realize  that software needs a certain amount of memory to work properly but I find it hard
to believe that 1 GB is not enough for processing a 3.1 MB file, which is the size of the
vectors file produced by the second step. We're hoping to use this solution on a hundreds
of thousands of records and I can't help but to wonder what sort of hardware we'll be needing
in order to process them if such memory consumption is a normal thing.

Am I missing something here? Are there any other setting that I should be taking into consideration.

And one more thing. I tried the meanshift implementation and it seems to be working fine,
with that much data.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message