mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vineeth <vineethrak...@gmail.com>
Subject Re: Using LDA in Mahout 0.0.7
Date Tue, 30 Oct 2012 03:56:18 GMT
Hello Dan,

Thank you for giving this reference. I was unable to succeed Mahout 
0.0.7 to run LDA so I downgraded to 0.5 to run the LDA and it worked. 
May be  I should try this.

Vineeth
On 12-10-29 02:02 PM, Diego Ceccarelli wrote:
> Thanks Dan, it solved.
>
> On Sun, Oct 28, 2012 at 10:40 PM, DAN HELM <danielhelm@verizon.net> wrote:
>> Hi Diego,
>> A number of us had the same issue when first working with the new CVB
>> algorithm. The vector keys for CVB need to be Integers. You can use the
>> rowid utility to convert the output from seq2sparse to the form needed by
>> CVB, e.g.,
>> http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
>> Dan
>>
>> From: Diego Ceccarelli <diego.ceccarelli@gmail.com>
>> To: user@mahout.apache.org
>> Sent: Sunday, October 28, 2012 5:21 PM
>> Subject: Using LDA in Mahout 0.0.7
>>
>> Dear all,
>>
>> I'm trying to use the LDA framework in Mahout and I'm experiencing
>> some troubles.
>> I saw these tutorials [1,2], and I decided to apply lda to a collection with
>> 1M of tweets to see how it works. I indexed them with lucene as suggested
>> in [2]. Then I discovered that in the last version this is not supported
>> and I had to to use a sequence file.
>> I saw the util 'seqdirectory' in [2] but it's a bit impractical to create
>> one million documents,
>> each one with a tweet. So I wrote a small java app that takes a file where
>> each line
>> is a document and creates a sequence file  <Text,Text>  containing the id
>> (line number)
>> and the tweet.
>> Then  I used seq2sparse util:
>>
>> ./bin/mahout seq2sparse -i ../lda-hello-world/tweet-sequence-file -o
>> /tmp/vector -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
>>
>> and I created the vectors. (it succeeded without problems)
>>
>> Now, I discovered that lda now it's called cvb (why did you change the name?
>> is
>> a bit confusing.. ) so I tried to run the command, but I got this error
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> org.apache.hadoop.io.IntWritable
>> (full stack trace here [3])
>>
>> I also tried the local version:
>>
>> ./bin/mahout cvb0_local -i /tmp/vector/tf-vectors  -d
>> /tmp/vector/dictionary.file-0 --numTopics 100 --docOutputFile /tmp/out
>> --topicOutputFile /tmp/topic
>>
>> (why the parameters' names are different???)
>> But i got a similar error:
>> Exception in thread "main" java.lang.ClassCastException: java.lang.Integer
>> cannot be cast to java.lang.String
>> (full stack trace here [4])
>>
>> Where i'm wrong?? could please help me?
>> Thanks
>> Diego
>>
>> [1] https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
>> [2] https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>> [3] http://pastebin.com/nV3T74fe
>> [4] http://pastebin.com/JH1xQHuC
>>
>>
>
>


Mime
View raw message