Hi Andy,
I ran this at work so don't have the data and log now but somehow I seem to recall log output
(after the rowid step) saying there were around 90K terms/columns in the resulting matrix...but
I would have to check next week.
So, I guess the key is the jack up the map task heap space to support a dense matrix?
So per your O(num topics * num terms) below, I guess "k  #topics" could also have been a
culprit, in particular when I had k=200.
Out of curiosity, if one were to cluster 1 million documents, what would be a reasonable k?
I guess it depends to the nature of the data (domain) and application but it would seem if
k is too small then the clusters would be way too fat and noisy.
Thanks.
________________________________
From: Andy Schlaikjer <andrew.schlaikjer@gmail.com>
To: user@mahout.apache.org; DAN HELM <danielhelm@verizon.net>
Sent: Friday, May 25, 2012 7:12 PM
Subject: Re: Error: Java heap space on mahout cvb command
Hi Dan,
Each map task must have enough heap to store a dense matrix O(num topics *
num terms). Size of input documents shouldn't matter unless you've got
really huge (sparse) term vectors.
What's the size of your input vocabulary?
Andy
On Fri, May 25, 2012 at 4:07 PM, DAN HELM <danielhelm@verizon.net> wrote:
> I’m running the new (since Mahout 0.6) CVB algorithm (LDA variation).
>
> Previously I successfully clustered the Reuter’s 21K collection. For that
> case I ran algorithm for 10 iterations into 60 clusters. Now I want to
> cluster a different 80K file test collection. Some of the documents are
> larger than the reuters files but most are not particularly large files.
>
> When attempting to cluster that collection, I get a “Java heap space”
> error at start of first iteration of the “mahout cvb” run. I wanted to run
> for 4 iterations and generate 200 clusters.
>
> The command I ran was:
>
> mahout cvb –i /tmp/sparsevectorscvb –o /tmp/cvb –k 200 –ow –x 4 –dt
> /tmp/doctopiccvb –dict /tmp/outseqdirsparsecvb/dictionary.file0 –mt
> /tmp/topicModelState
>
> Right before running that command I ran the following two commands to
> convert my sparse vectors (earlier steps not shown here) to the proper form
> needed for cvb command:
>
> mahout rowid –i /tmp/outseqdirsparsecvb/tfvectors o
> /tmp/sparsevectorscvb
>
> hadoop fs –mv /tmp/sparsevectorscvb/docIndex
> /tmp/sparsevectorsindexcvb (note: this step was needed to move the
> generated docIndex file out so cvb command would not blowup).
>
> The pertinent error log excerpt follows:
> ....
> ....
> 12/05/25 08:47.03 INFO cvb.CVB0Driver: Current iteration number: 0
> 12/05/25 08:47.03 INFO About to run iteration 1 of 4
> 12/05/25 08:47.03 INFO About to run: Iteration 1 of 4, input path:
> /tmp/topicModelState/model0
> 12/05/25 08:47.03 INFO input.FileInputFormat: Total input paths to
> process: 1
> 12/05/25 08:47.03 INFO mapred.JobClient: Running job: job id
> 12/05/25 08:47.03 INFO mapred.JobClient: map 0% reduce 0%
> 12/05/25 08:47.03 INFO mapred.JobClient: Task Id : attempt id, Status :
> FAILED
> 12/05/25 08:47.03 Error: Java heap space
> ....
> ....
>
> I kept on lowering the number of documents to be clustered until it
> finally worked when I had less than 10K files. I also changed the number
> of clusters to
> generate (k) to 40 (I don't think this was an issue). I am interested in
> being able to cluster very large sets with CVB (possibly hundreds of
> thousands of files (or more)) so hope cvb can scale to that.
>
> I ran the above on a 3 node cluster.
>
> Thanks, Dan
