mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DAN HELM <danielh...@verizon.net>
Subject Re: Error: Java heap space on mahout cvb command
Date Fri, 25 May 2012 23:25:09 GMT
Hi Andy,
 
I ran this at work so don't have the data and log now but somehow I seem to recall log output
(after the rowid step) saying there were around 90K terms/columns in the resulting matrix...but
I would have to check next week.
 
So, I guess the key is the jack up the map task heap space to support a dense matrix? 
So per your O(num topics * num terms) below, I guess "k - #topics" could also have been a
culprit, in particular when I had k=200.
 
Out of curiosity, if one were to cluster 1 million documents, what would be a reasonable k? 
I guess it depends to the nature of the data (domain) and application but it would seem if
k is too small then the clusters would be way too fat and noisy.

Thanks.
  

________________________________
 From: Andy Schlaikjer <andrew.schlaikjer@gmail.com>
To: user@mahout.apache.org; DAN HELM <danielhelm@verizon.net> 
Sent: Friday, May 25, 2012 7:12 PM
Subject: Re: Error: Java heap space on mahout cvb command
  
Hi Dan,

Each map task must have enough heap to store a dense matrix O(num topics *
num terms). Size of input documents shouldn't matter unless you've got
really huge (sparse) term vectors.

What's the size of your input vocabulary?

Andy


On Fri, May 25, 2012 at 4:07 PM, DAN HELM <danielhelm@verizon.net> wrote:

> I’m running the new (since Mahout 0.6) CVB algorithm (LDA variation).
>
> Previously I successfully clustered the Reuter’s 21K collection.  For that
> case I ran algorithm for 10 iterations into 60 clusters. Now I want to
> cluster a different 80K file test collection.  Some of the documents are
> larger than the reuters files but most are not particularly large files.
>
> When attempting to cluster that collection, I get a “Java heap space”
> error at start of first iteration of the “mahout cvb” run.  I wanted to run
> for 4 iterations and generate 200 clusters.
>
> The command I ran was:
>
> mahout cvb –i /tmp/sparse-vectors-cvb –o /tmp/cvb –k 200 –ow –x 4 –dt
> /tmp/doc-topic-cvb –dict /tmp/out-seqdir-sparse-cvb/dictionary.file-0 –mt
> /tmp/topicModelState
>
> Right before running that command I ran the following two commands to
> convert my sparse vectors (earlier steps not shown here) to the proper form
> needed for cvb command:
>
> mahout rowid –i /tmp/out-seqdir-sparse-cvb/tf-vectors -o
> /tmp/sparse-vectors-cvb
>
> hadoop fs –mv /tmp/sparse-vectors-cvb/docIndex
> /tmp/sparse-vectors-index-cvb (note: this step was needed to move the
> generated docIndex file out so cvb command would not blowup).
>
> The pertinent error log excerpt follows:
> ....
> ....
> 12/05/25 08:47.03 INFO cvb.CVB0Driver: Current iteration number: 0
> 12/05/25 08:47.03 INFO About to run iteration 1 of 4
> 12/05/25 08:47.03 INFO About to run: Iteration 1 of 4, input path:
> /tmp/topicModelState/model-0
> 12/05/25 08:47.03 INFO input.FileInputFormat: Total input paths to
> process: 1
> 12/05/25 08:47.03 INFO mapred.JobClient: Running job: job id
> 12/05/25 08:47.03 INFO mapred.JobClient: map 0% reduce 0%
> 12/05/25 08:47.03 INFO mapred.JobClient: Task Id : attempt id, Status :
> FAILED
> 12/05/25 08:47.03 Error: Java heap space
> ....
> ....
>
> I kept on lowering the number of documents to be clustered until it
> finally worked when I had less than 10K files.  I also changed the number
> of clusters to
> generate (k) to 40 (I don't think this was an issue).  I am interested in
> being able to cluster very large sets with CVB (possibly hundreds of
> thousands of files (or more)) so hope cvb can scale to that.
>
> I ran the above on a 3 node cluster.
>
> Thanks, Dan
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message