mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <>
Subject Error: Java heap space on mahout cvb command
Date Fri, 25 May 2012 23:07:37 GMT
I’m running the new (since Mahout 0.6) CVB algorithm (LDA variation).

Previously I successfully clustered the Reuter’s 21K collection.  For that case I ran algorithm
for 10 iterations into 60 clusters. Now I want to cluster a different 80K file test collection. 
Some of the documents are larger than the reuters files but most are not particularly large

When attempting to cluster that collection, I get a “Java heap space” error at start of
first iteration of the “mahout cvb” run.  I wanted to run for 4 iterations and generate
200 clusters.
The command I ran was: 

mahout cvb –i /tmp/sparse-vectors-cvb –o /tmp/cvb –k 200 –ow –x 4 –dt /tmp/doc-topic-cvb
–dict /tmp/out-seqdir-sparse-cvb/dictionary.file-0 –mt /tmp/topicModelState
Right before running that command I ran the following two commands to convert my sparse vectors
(earlier steps not shown here) to the proper form needed for cvb command:

mahout rowid –i /tmp/out-seqdir-sparse-cvb/tf-vectors -o /tmp/sparse-vectors-cvb
hadoop fs –mv /tmp/sparse-vectors-cvb/docIndex /tmp/sparse-vectors-index-cvb (note: this
step was needed to move the generated docIndex file out so cvb command would not blowup).
The pertinent error log excerpt follows:
12/05/25 08:47.03 INFO cvb.CVB0Driver: Current iteration number: 0
12/05/25 08:47.03 INFO About to run iteration 1 of 4
12/05/25 08:47.03 INFO About to run: Iteration 1 of 4, input path: /tmp/topicModelState/model-0
12/05/25 08:47.03 INFO input.FileInputFormat: Total input paths to process: 1
12/05/25 08:47.03 INFO mapred.JobClient: Running job: job id
12/05/25 08:47.03 INFO mapred.JobClient: map 0% reduce 0%
12/05/25 08:47.03 INFO mapred.JobClient: Task Id : attempt id, Status : FAILED
12/05/25 08:47.03 Error: Java heap space
I kept on lowering the number of documents to be clustered until it finally worked when I
had less than 10K files.  I also changed the number of clusters to 
generate (k) to 40 (I don't think this was an issue).  I am interested in being able to cluster
very large sets with CVB (possibly hundreds of thousands of files (or more)) so hope cvb can
scale to that.
I ran the above on a 3 node cluster.
Thanks, Dan
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message