mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <>
Subject Re: Error: Java heap space on mahout cvb command
Date Fri, 25 May 2012 23:25:09 GMT
Hi Andy,
I ran this at work so don't have the data and log now but somehow I seem to recall log output
(after the rowid step) saying there were around 90K terms/columns in the resulting matrix...but
I would have to check next week.
So, I guess the key is the jack up the map task heap space to support a dense matrix? 
So per your O(num topics * num terms) below, I guess "k - #topics" could also have been a
culprit, in particular when I had k=200.
Out of curiosity, if one were to cluster 1 million documents, what would be a reasonable k? 
I guess it depends to the nature of the data (domain) and application but it would seem if
k is too small then the clusters would be way too fat and noisy.


 From: Andy Schlaikjer <>
To:; DAN HELM <> 
Sent: Friday, May 25, 2012 7:12 PM
Subject: Re: Error: Java heap space on mahout cvb command
Hi Dan,

Each map task must have enough heap to store a dense matrix O(num topics *
num terms). Size of input documents shouldn't matter unless you've got
really huge (sparse) term vectors.

What's the size of your input vocabulary?


On Fri, May 25, 2012 at 4:07 PM, DAN HELM <> wrote:

> I’m running the new (since Mahout 0.6) CVB algorithm (LDA variation).
> Previously I successfully clustered the Reuter’s 21K collection.  For that
> case I ran algorithm for 10 iterations into 60 clusters. Now I want to
> cluster a different 80K file test collection.  Some of the documents are
> larger than the reuters files but most are not particularly large files.
> When attempting to cluster that collection, I get a “Java heap space”
> error at start of first iteration of the “mahout cvb” run.  I wanted to run
> for 4 iterations and generate 200 clusters.
> The command I ran was:
> mahout cvb –i /tmp/sparse-vectors-cvb –o /tmp/cvb –k 200 –ow –x 4 –dt
> /tmp/doc-topic-cvb –dict /tmp/out-seqdir-sparse-cvb/dictionary.file-0 –mt
> /tmp/topicModelState
> Right before running that command I ran the following two commands to
> convert my sparse vectors (earlier steps not shown here) to the proper form
> needed for cvb command:
> mahout rowid –i /tmp/out-seqdir-sparse-cvb/tf-vectors -o
> /tmp/sparse-vectors-cvb
> hadoop fs –mv /tmp/sparse-vectors-cvb/docIndex
> /tmp/sparse-vectors-index-cvb (note: this step was needed to move the
> generated docIndex file out so cvb command would not blowup).
> The pertinent error log excerpt follows:
> ....
> ....
> 12/05/25 08:47.03 INFO cvb.CVB0Driver: Current iteration number: 0
> 12/05/25 08:47.03 INFO About to run iteration 1 of 4
> 12/05/25 08:47.03 INFO About to run: Iteration 1 of 4, input path:
> /tmp/topicModelState/model-0
> 12/05/25 08:47.03 INFO input.FileInputFormat: Total input paths to
> process: 1
> 12/05/25 08:47.03 INFO mapred.JobClient: Running job: job id
> 12/05/25 08:47.03 INFO mapred.JobClient: map 0% reduce 0%
> 12/05/25 08:47.03 INFO mapred.JobClient: Task Id : attempt id, Status :
> 12/05/25 08:47.03 Error: Java heap space
> ....
> ....
> I kept on lowering the number of documents to be clustered until it
> finally worked when I had less than 10K files.  I also changed the number
> of clusters to
> generate (k) to 40 (I don't think this was an issue).  I am interested in
> being able to cluster very large sets with CVB (possibly hundreds of
> thousands of files (or more)) so hope cvb can scale to that.
> I ran the above on a 3 node cluster.
> Thanks, Dan
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message