mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: 2 questions about lda implementation
Date Tue, 08 May 2012 17:00:25 GMT
Hi Ivan,

  First off, let me say that you should probably start migrating to using
the new
LDA implementation which came in 0.6, which is invoked via the "mahout
cvb..."
command, or by directly launching the o.a.m.clustering.lda.cvb.CVB0Driver
in your code, as the old LDA which you're referencing will be going away
soon.

  But for now, I'll try to answer your questions on the old impl:

On Tue, May 8, 2012 at 8:54 AM, ivan obeso <sendero.luminoso@gmail.com>wrote:

> Im using mahout 0.6. I had runned the "mahout lda..." tool for command line
> for apply lda method in a corpus. But now, i want to code it in my java
> program and Im having a lot of problems because it crashes. Can someone
> give me an example java code running correctly?
>
> Looking at the output of LDA, I have 2 folders:
> - docTopics: wich contains a Text key (the document ID) and a vector Value
> (that is the membership of this document to each topic).
> -state-n: I assume that the intPairWritable is (topicID, wordID) so it have
> as wordID as all the corpus for each topic. And the DoubleWritable Value I
> dont know what is. I think its the membership between the topic and the
> word, but i dont know what type of meassure method is used. For example,
> here is an split that I have printed:
>

You're correct here - the values are unnormalized log( p(wordId | topicId)
)
values.  To recover probabilities, you need to exponentiate them, and
normalize
so that if you sum over all the values for a given topicId, the sum == 1.


> ...
> (4, 17847) -28.424714110200803
> (4, 17848) -32.54168874531223
> (4, 17849) -51.954687480087074
> (4, 17850) -1.8811618929248652E-12
> (4, 17851) -7.102634146221668
> (4, 17852) 3.440324743165531
> (4, 17853) 1.118778127312393
> (4, 17854) 2.2973859313207385
> (4, 17855) 2.1602327860824015
> (4, 17856) -2.5362957334351677E-6
> (4, 17857) -32.80559170476965
> (4, 17858) -1.9791269423308222E-7
> ...
>
> Can somebody help me explaining me this?
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message