Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C2FED10F86 for ; Wed, 31 Jul 2013 15:05:59 +0000 (UTC) Received: (qmail 3413 invoked by uid 500); 31 Jul 2013 15:05:57 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 3287 invoked by uid 500); 31 Jul 2013 15:05:56 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 3100 invoked by uid 99); 31 Jul 2013 15:05:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 15:05:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.136.218.212] (HELO nm7-vm5.bullet.mail.gq1.yahoo.com) (98.136.218.212) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 15:05:49 +0000 Received: from [98.137.12.59] by nm7.bullet.mail.gq1.yahoo.com with NNFMP; 31 Jul 2013 15:05:28 -0000 Received: from [98.137.12.205] by tm4.bullet.mail.gq1.yahoo.com with NNFMP; 31 Jul 2013 15:05:28 -0000 Received: from [127.0.0.1] by omp1013.mail.gq1.yahoo.com with NNFMP; 31 Jul 2013 15:05:28 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 712066.86898.bm@omp1013.mail.gq1.yahoo.com Received: (qmail 48195 invoked by uid 60001); 31 Jul 2013 15:05:28 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1375283128; bh=C5vK3k8eH8esqF64U+XkzykWQ6vHK+S9AG+nHXYE22s=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=swD4iBSlk9psrxHYO89Fso4iGmuMyV8PXKCvzb2XABfWvRZkLBvnlFH5yFwM9qfHssq2PKrfx7jD+xg8NzS/J0B3pzRvblKJayG8lO9SjAjZRy2pIXn94vyAcgpK8uL4FnPPNpizArVdAU1MfmBoHYAHjBQQ8VZtqHE/1EIUr8M= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=bxFKloKzl+7+gwq0bJ3wLoqfKssleQnvWY7i+xPQN+ZEbZawwJnfFguzfCQy/25jf+2jpDOvRveHK7ScO0zUWb3KP9ygG+LoyVUG9AAKAptnro8fRrYXnIRlsgotmqPROC/5Hkk9T9jsLIGRY7FHvRZa2sLDN7szMKODpwVm+RA=; X-YMail-OSG: pWRSm3wVM1mfTBTtLTCRkX7nWyeDNWRBcvq.htoPT2eKfEx 8CrIv4co5O6yHgZXZurxw Received: from [216.168.230.7] by web163501.mail.gq1.yahoo.com via HTTP; Wed, 31 Jul 2013 08:05:27 PDT X-Rocket-MIMEInfo: 002.001,UGxlYXNlIHdvcmsgb2ZmIG9mIE1haG91dCAwLjgsIHRoZXJlIGFyZSBsb3Qgb2YgZml4ZXMgYW5kIGltcHJvdmVtZW50cyB0aGF0IHdlbnQgZm9yIENWQjAgaW4gdGhpcyByZWxlYXNlLgpDb3JyZWN0IG1lIGhlcmUgSmFrZT8KCgoKCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCiBGcm9tOiBNYXJjbyA8emVudHJvcGE4MEB5YWhvby5jby51az4KVG86ICJ1c2VyQG1haG91dC5hcGFjaGUub3JnIiA8dXNlckBtYWhvdXQuYXBhY2hlLm9yZz4gClNlbnQ6IFdlZG5lc2RheSwgSnVseSAzMSwgMjAxMyABMAEBAQE- X-Mailer: YahooMailWebService/0.8.151.566 References: <1375258807.389.YahooMailNeo@web172403.mail.ir2.yahoo.com> <1375261289.6379.YahooMailNeo@web163503.mail.gq1.yahoo.com> <1375279333.98524.YahooMailNeo@web172405.mail.ir2.yahoo.com> <1375281844.16686.YahooMailNeo@web172401.mail.ir2.yahoo.com> <1375282878.40340.YahooMailNeo@web172405.mail.ir2.yahoo.com> Message-ID: <1375283127.29566.YahooMailNeo@web163501.mail.gq1.yahoo.com> Date: Wed, 31 Jul 2013 08:05:27 -0700 (PDT) From: Suneel Marthi Reply-To: Suneel Marthi Subject: Re: Latent Dirichlet Allocatio (cvb) To: "user@mahout.apache.org" , Marco In-Reply-To: <1375282878.40340.YahooMailNeo@web172405.mail.ir2.yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="633453505-1770886796-1375283127=:29566" X-Virus-Checked: Checked by ClamAV on apache.org --633453505-1770886796-1375283127=:29566 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Please work off of Mahout 0.8, there are lot of fixes and improvements that= went for CVB0 in this release.=0ACorrect me here Jake?=0A=0A=0A=0A=0A_____= ___________________________=0A From: Marco =0ATo: "= user@mahout.apache.org" =0ASent: Wednesday, July 3= 1, 2013 11:01 AM=0ASubject: Re: Latent Dirichlet Allocatio (cvb)=0A =0A=0Ar= unning:=0Amahout vectordump -i jojoba/to-output -d jojoba/vectors/dictionar= y.file-0 -dt sequencefile --vectorSize 10 -sort jojoba/to-output=0A=0Ait's = mahout 0.7 (we're using cloudera CDH4.2)=0A=0A=0A=0A=0A____________________= ____________=0ADa: Jake Mannix =0AA: "user@mahout.ap= ache.org" ; Marco =0AInvia= to: Mercoled=EC 31 Luglio 2013 16:51=0AOggetto: Re: Latent Dirichlet Alloca= tio (cvb)=0A=0A=0AOn Wed, Jul 31, 2013 at 7:44 AM, Marco wrote:=0A=0A> ok. i'll re run it without that nt (which i supposed w= as NOT optional).=0A>=0A=0AWell, it's not optional if you don't supply a di= ctionary (which is=0Aoptional) - one of the two is necessary, or else the s= ystem doesn't know=0Ahow big to make the model.=0A=0A=0A> meanwhile i've re= -run it on a smallare datasets and though it run=0A> successfully (and fast= er!) when i run vectordump i always get Heap space=0A> issue even though we= 've updated MAHOUT_HEAPSIZE to 10000m=0A>=0A=0AWhen you use vectordump, wha= t flags are you giving it?=A0 There may be a big=0Ahere.=A0 Also, what vers= ion of Mahout are you using?=0A=0A=0A>=0A>=0A>=0A>=0A> ____________________= ____________=0A>=A0 Da: Jake Mannix =0A> A: "user@ma= hout.apache.org" ; Marco <=0A> zentropa80@yahoo.co.= uk>=0A> Cc: Suneel Marthi =0A> Inviato: Mercoled= =EC 31 Luglio 2013 16:34=0A> Oggetto: Re: Latent Dirichlet Allocatio (cvb)= =0A>=0A>=0A> If you're supplying a dictionary file (as you are), I'd sugges= t not=0A> specifying the "-nt 90000" option - you're apparently specifying = a numTerms=0A> less than the actual number of terms in some of your vectors= .=A0 If you=0A> supply the -dict option, it'll infer the number of terms fr= om reading the=0A> dictionary, and you don't need to specify it.=0A>=0A>=0A= > On Wed, Jul 31, 2013 at 7:02 AM, Marco wrote:=0A= >=0A> > oops! that did the trick.=0A> >=0A> > nonetheless i think the fact = that you have to do "rowid" and generate the=0A> > matrix should be added t= o the wiki.=0A> >=0A> > after waiting for more than an hour i got and error= on=0A> > Writing final document/topic inference from lda/matrix/matrix to= =0A> > jojoba/do-output=0A> >=0A> > the error is : org.apache.mahout.math.I= ndexException: Index 90011 is=0A> > outside allowable range of [0,90000)=0A= > >=0A> > Here is how I launched it:=0A> > mahout cvb -i jojoba/matrix/matr= ix -dict jojoba/vectors/dictionary.file-0=0A> > -o jojoba/to-output -dt joj= oba/do-output -k 190 -nt 90000 -mt jojoba/mt=0A> > --maxIter 2 -mipd 1 -a 0= .01 -e 0.01 -seed 37 -block 1=0A> >=0A> > weird thing is also that the job = described as " Writing final topic/term=0A> > distributions from jojoba/mt/= model-2 to jojoba/to-output" run=0A> successfully=0A> > but if i now do a v= ectodump i always get a Java Heaps Space error=0A> >=0A> >=0A> >=0A> > ____= ____________________________=0A> >=A0 Da: Suneel Marthi =0A> > A: "user@mahout.apache.org" ; Marco <= =0A> > zentropa80@yahoo.co.uk>=0A> > Inviato: Mercoled=EC 31 Luglio 2013 11= :01=0A> > Oggetto: Re: Latent Dirichlet Allocatio (cvb)=0A> >=0A> >=0A> > R= owId job creates a matrix (IntWritable, VectorWritable) and a docIndex=0A> = > (IntWritable, Text).=0A> >=0A> > So you should be seeing 2 files generate= d -=A0 jojoba/matrix/matrix and=0A> > jojoba/matrix/docIndex.=0A> >=0A> > S= eems like you have been feeding docIndex as input to cvb which would=0A> > = cause this exception,=A0 its the matrix that needs to be fed as input to=0A= > cvb.=0A> >=0A> > So the input to vb needs to be "jojoba/matrix/matrix".= =0A> >=0A> > Give that a try and let us know.=0A> >=0A> >=0A> >=0A> >=0A> >= ________________________________=0A> > From: Marco =0A> > To: "user@mahout.apache.org" =0A> > Sent: W= ednesday, July 31, 2013 4:20 AM=0A> > Subject: Latent Dirichlet Allocatio (= cvb)=0A> >=0A> >=0A> > Hi, I'm new here so forgive my little experience wit= h Mahout.=0A> >=0A> > We're trying to use Mahout (on our hadoop cluster) fo= r calculating topics=0A> > on almost 14000 documents.=0A> >=0A> > I've been= following this wiki page (http://goo.gl/DcPVjB) but still=0A> > getting er= rors.=0A> >=0A> > Here's what I'm doing:=0A> >=0A> > 1) creating sequence f= ile from text files (mahout seqdirectory -i=0A> > jojoba/text-files -o jojo= ba/seqfiles)=0A> > 2) creating vectors FROM sequence files (mahout seq2spar= se -i=0A> > jojoba/seqfiles -o jojoba/vectors -wt tf=0A> >=A0 -nv)=0A> > 3)= launching CVB like this:=0A> > mahout cvb -i jojoba/vectors/tf-vectors/ -d= ict=0A> > jojoba/vectors/dictionary.file-0 -o jojoba/to-output -dt jojoba/d= o-output=0A> > -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -= e 0.01 -seed=0A> 37=0A> > -block 1=0A> >=0A> > and I get Exception in threa= d "main" java.lang.InterruptedException:=0A> > Failed to complete iteration= 1 stage 1=0A> >=0A> > I later learned here (=0A> > http://stackoverflow.co= m/questions/14757162/run-cvb-in-mahout-0-8/) that=0A> > I should actually f= eed cvb a matrix and not the vectors (shouldn't it be=0A> > clearly stated = in the wiki?).=0A> > So then I run:=0A> > mahout rowid -i jojoba/vectors/tf= -vectors/ -o jojoba/matrix=0A> >=0A> > 3bis) I rerun CVB giving jojoba/matr= ix as input and I now get=0A> > java.lang.ClassCastException: org.apache.ha= doop.io.Text cannot be cast to=0A> > org.apache.mahout.math.VectorWritable= =0A> >=0A> > What am I missing?=0A> >=0A> > Thanks=0A> >=A0 a lot for your = help=0A> >=0A>=0A>=0A>=0A> --=0A>=0A>=A0=A0 -jake=0A>=0A=0A=0A=0A-- =0A=0A= =A0 -jake --633453505-1770886796-1375283127=:29566--