mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco <zentrop...@yahoo.co.uk>
Subject Re: Latent Dirichlet Allocatio (cvb)
Date Wed, 31 Jul 2013 14:02:13 GMT
oops! that did the trick.

nonetheless i think the fact that you have to do "rowid" and generate the matrix should be
added to the wiki.

after waiting for more than an hour i got and error on
Writing final document/topic inference from lda/matrix/matrix to jojoba/do-output   

the error is : org.apache.mahout.math.IndexException: Index 90011 is outside allowable range
of [0,90000)

Here is how I launched it:
mahout cvb -i jojoba/matrix/matrix -dict jojoba/vectors/dictionary.file-0 -o jojoba/to-output
-dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed
37 -block 1

weird thing is also that the job described as " Writing final topic/term distributions from
jojoba/mt/model-2 to jojoba/to-output" run successfully but if i now do a vectodump i always
get a Java Heaps Space error



________________________________
 Da: Suneel Marthi <suneel_marthi@yahoo.com>
A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <zentropa80@yahoo.co.uk>

Inviato: Mercoledì 31 Luglio 2013 11:01
Oggetto: Re: Latent Dirichlet Allocatio (cvb)
 

RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex (IntWritable, Text).

So you should be seeing 2 files generated -  jojoba/matrix/matrix and jojoba/matrix/docIndex.

Seems like you have been feeding docIndex as input to cvb which would cause this exception, 
its the matrix that needs to be fed as input to cvb.

So the input to vb needs to be "jojoba/matrix/matrix".

Give that a try and let us know.




________________________________
From: Marco <zentropa80@yahoo.co.uk>
To: "user@mahout.apache.org" <user@mahout.apache.org> 
Sent: Wednesday, July 31, 2013 4:20 AM
Subject: Latent Dirichlet Allocatio (cvb)


Hi, I'm new here so forgive my little experience with Mahout.

We're trying to use Mahout (on our hadoop cluster) for calculating topics on almost 14000
documents.

I've been following this wiki page (http://goo.gl/DcPVjB) but still getting errors.

Here's what I'm doing:

1) creating sequence file from text files (mahout seqdirectory -i jojoba/text-files -o jojoba/seqfiles)
2) creating vectors FROM sequence files (mahout seq2sparse -i jojoba/seqfiles -o jojoba/vectors
-wt tf 
 -nv)
3) launching CVB like this:
mahout cvb -i jojoba/vectors/tf-vectors/ -dict jojoba/vectors/dictionary.file-0 -o jojoba/to-output
-dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed
37 -block 1

and I get Exception in thread "main" java.lang.InterruptedException: Failed to complete iteration
1 stage 1

I later learned here (http://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8/)
that I should actually feed cvb a matrix and not the vectors (shouldn't it be clearly stated
in the wiki?).
So then I run:
mahout rowid -i jojoba/vectors/tf-vectors/ -o jojoba/matrix

3bis) I rerun CVB giving jojoba/matrix as input and I now get
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable

What am I missing?

Thanks
 a lot for your help
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message