mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleksandr Petrov (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-458) The LDA output does not include the topic-probability distribution per document (p(z|d)). It outputs only the topics and corresponding words.
Date Tue, 28 Sep 2010 20:22:35 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915905#action_12915905
] 

Oleksandr Petrov commented on MAHOUT-458:
-----------------------------------------

I ran into the same exact problem as Himanshu did, although i disagree on that necessity.

At first, i need filenames to be in any format. When handling large datasets, i often need
to add letter as markers or periods, underscores / dashes.
At second, having a dictionary in non-text format, and having input vectors you can always
gather your words back together easily, and make sure which vectors that word/gram belongs
to (right from the vector). That's a map/reduce job. Space efficiency may become more important
than representativeness.

Although, the ideology of Mahout is not clear and somewhat inconsistance: LDA is implemented
that way. K-means does include names of source vector/file. DirichletCluster is implemented
in other way, it's generic and is not derived (at least in 0.3) from ClusterBase. That kind
of inconsistency is a potential source of big problems. Every driver should share the same
exact top-level ideology, even if "under the hood" there's a lot of different things.

Himanshu, thank you a whole lot for the contribution, you've done a great job on that. 

> The LDA output does not include the topic-probability distribution per document (p(z|d)).
It outputs only the topics and corresponding words.
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-458
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-458
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Himanshu Gahlot
>             Fix For: 0.4
>
>         Attachments: MAHOUT-458.patch
>
>
> The current implementation of LDA outputs only topics and their words. Many applications
need the p(z|d) values of a document to use this vector as a reduced representation of the
document (dimensionality reduction of document). We need to introduce a new key which would
keep track of the gamma values for each document (as obtained from the document.infer() method)
and writes these to the output stream and finally, PrintLDATopics should output these values
per document id. Also, outputting the probabilities of words in a topic would also provide
a more meaningful output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message