mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Hall <d...@cs.berkeley.edu>
Subject Re: extract p(doc|topic) from LDA
Date Mon, 07 Jun 2010 18:42:49 GMT
On Mon, Jun 7, 2010 at 5:06 AM, Avishay Livne1 <AVISHAYL@il.ibm.com> wrote:
> I modified
> $MAHOUT_HOME/utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
>  so the score is printed along each word., but the interpretation of the
> scores is somewhat obscure.
> I see values in the range of -8 to +6. I assumed the values should
> represent P(word | topic) or  log(P(word | topic)) but these values are of
> different range.
> How should I interpret these values? Is there a simple way to retrieve P
> (word | topic)?

Sorry about that. The scores are log p(word|topic) + constant, because
they're normalized online during the E-step, and so the serialized
values don't need to be serialized. You can normalize them by
computing the log-sum of all of those values and subtracting.

>
> Thanks,
> Avishay.
>
>
>
>  From:       Avishay Livne1/Haifa/IBM@IBMIL
>
>  To:         user@mahout.apache.org
>
>  Date:       06/06/2010 03:16 PM
>
>  Subject:    extract p(doc|topic) from LDA
>
>
>
>
>
>
>
> Hi,
>
> I'm trying to use LDA for a collaborative filtering task, where I need to
> predict the rating a user (document) will give to a movie (word).
> I ran LDA and constructed T topics, but I can only print the most frequent
> words (movies) per topic.
> Is it possible to extract p(documet|topic) or p(word|topic) from LDA's
> output? (document = new user, word = movie).
>
> Best regards,
> Avishay
>
>
>
>
>

Mime
View raw message