mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco <zentrop...@yahoo.co.uk>
Subject Re: Latent Dirichlet Allocatio (cvb)
Date Wed, 31 Jul 2013 15:05:30 GMT
already looked there. no cvb examle or vectordump :(




________________________________
 Da: Suneel Marthi <suneel_marthi@yahoo.com>
A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <zentropa80@yahoo.co.uk>

Inviato: Mercoledì 31 Luglio 2013 16:55
Oggetto: Re: Latent Dirichlet Allocatio (cvb)
 

@Marco, look at examples/bin/cluster-reuters.sh for reference on how to run cvb (or any other
clustering algo in Mahout)
and also on how to invoke the vectordump with the option flags.




________________________________
From: Jake Mannix <jake.mannix@gmail.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <zentropa80@yahoo.co.uk>

Sent: Wednesday, July 31, 2013 10:51 AM
Subject: Re: Latent Dirichlet Allocatio (cvb)


On Wed, Jul 31, 2013 at 7:44 AM, Marco <zentropa80@yahoo.co.uk> wrote:

> ok. i'll re run it without that nt (which i supposed was NOT optional).
>

Well, it's not optional if you don't supply a dictionary (which is
optional) - one of the two is necessary, or else the system doesn't know
how big to make the model.


> meanwhile i've re-run it on a smallare datasets and though it run
> successfully (and faster!) when i run vectordump i always get Heap space
> issue even though we've updated MAHOUT_HEAPSIZE to 10000m
>

When you use vectordump, what flags are you giving it?  There may be a big
here.  Also, what version of Mahout are you using?


>
>
>
>
> ________________________________
>  Da: Jake Mannix <jake.mannix@gmail.com>
> A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <
> zentropa80@yahoo.co.uk>
> Cc: Suneel Marthi <suneel_marthi@yahoo.com>
> Inviato: Mercoledì 31 Luglio 2013 16:34
> Oggetto: Re: Latent Dirichlet Allocatio (cvb)
>
>
> If you're supplying a dictionary file (as you are), I'd suggest not
> specifying the "-nt 90000" option - you're apparently specifying a numTerms
> less than the actual number of terms in some of your vectors.  If you
> supply the -dict option, it'll infer the number of terms from reading the
> dictionary, and you don't need to specify it.
>
>
> On Wed, Jul 31, 2013 at 7:02 AM, Marco <zentropa80@yahoo.co.uk> wrote:
>
> > oops! that did the trick.
> >
> > nonetheless i think the fact that you have to do "rowid" and generate the
> > matrix should be added to the wiki.
> >
> > after waiting for more than an hour i got and error on
> > Writing final document/topic inference from lda/matrix/matrix to
> > jojoba/do-output
> >
> > the error is : org.apache.mahout.math.IndexException: Index 90011 is
> > outside allowable range of [0,90000)
> >
> > Here is how I launched it:
> > mahout cvb -i jojoba/matrix/matrix -dict jojoba/vectors/dictionary.file-0
> > -o jojoba/to-output -dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt
> > --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed 37 -block 1
> >
> > weird thing is also that the job described as " Writing final topic/term
> > distributions from jojoba/mt/model-2 to jojoba/to-output" run
> successfully
> > but if i now do a vectodump i always get a Java Heaps Space error
> >
> >
> >
> > ________________________________
> >  Da: Suneel Marthi <suneel_marthi@yahoo.com>
> > A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco <
> > zentropa80@yahoo.co.uk>
> > Inviato: Mercoledì 31 Luglio 2013 11:01
> > Oggetto: Re: Latent Dirichlet Allocatio (cvb)
> >
> >
> > RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex
> > (IntWritable, Text).
> >
> > So you should be seeing 2 files generated -  jojoba/matrix/matrix and
> > jojoba/matrix/docIndex.
> >
> > Seems like you have been feeding docIndex as input to cvb which would
> > cause this exception,  its the matrix that needs to be fed as input to
> cvb.
> >
> > So the input to vb needs to be "jojoba/matrix/matrix".
> >
> > Give that a try and let us know.
> >
> >
> >
> >
> > ________________________________
> > From: Marco <zentropa80@yahoo.co.uk>
> > To: "user@mahout.apache.org" <user@mahout.apache.org>
> > Sent: Wednesday, July 31, 2013 4:20 AM
> > Subject: Latent Dirichlet Allocatio (cvb)
> >
> >
> > Hi, I'm new here so forgive my little experience with Mahout.
> >
> > We're trying to use Mahout (on our hadoop cluster) for calculating topics
> > on almost 14000 documents.
> >
> > I've been following this wiki page (http://goo.gl/DcPVjB) but still
> > getting errors.
> >
> > Here's what I'm doing:
> >
> > 1) creating sequence file from text files (mahout seqdirectory -i
> > jojoba/text-files -o jojoba/seqfiles)
> > 2) creating vectors FROM sequence files (mahout seq2sparse -i
> > jojoba/seqfiles -o jojoba/vectors -wt tf
> >  -nv)
> > 3) launching CVB like this:
> > mahout cvb -i jojoba/vectors/tf-vectors/ -dict
> > jojoba/vectors/dictionary.file-0 -o jojoba/to-output -dt jojoba/do-output
> > -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed
> 37
> > -block 1
> >
> > and I get Exception in thread "main" java.lang.InterruptedException:
> > Failed to complete iteration 1 stage 1
> >
> > I later learned here (
> > http://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8/) that
> > I should actually feed cvb a matrix and not the vectors (shouldn't it be
> > clearly stated in the wiki?).
> > So then I run:
> > mahout rowid -i jojoba/vectors/tf-vectors/ -o jojoba/matrix
> >
> > 3bis) I rerun CVB giving jojoba/matrix as input and I now get
> > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> > org.apache.mahout.math.VectorWritable
> >
> > What am I missing?
> >
> > Thanks
> >  a lot for your help
> >
>
>
>
> --
>
>   -jake
>



-- 

  -jake
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message