mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vasil Vasilev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-683) LDA Vectorization
Date Fri, 29 Apr 2011 08:12:03 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026901#comment-13026901
] 

Vasil Vasilev commented on MAHOUT-683:
--------------------------------------

Hi Jake,
Didn't know about MAHOUT-458.
When I took a look, seems that your patch collects the same information that I extract via
LDAVectorizer. May be it is really a better approach to make this part of the LDA inference
process. 
However, I think, it is worth having in the end set of vectors (a vector per document) which
can directly be used for clustering with some of the existing algorithms (KMeans, Canopy,
etc.)

One question: I noticed that you normalize the Gamma vector before writing it - why is that?

> LDA Vectorization
> -----------------
>
>                 Key: MAHOUT-683
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-683
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA., Vectorization
>         Attachments: MAHOUT-683.patch
>
>
> Currently the result of LDA clustering algorithm is a state which describes the probability
of words, part of a corpus of documents, to belong to given topics. This probability is calculated
for the whole corpus
> It is interesting, however, what is the average number of words of a given document that
comes from a given topic. This information comes from the gamma vector in the LDA inference
process. This vector can be used as representation of the given document for further clustering
purposes (using algorithms like KMeans, Dirichlet, etc.). In this manner the dimensions of
a document get reduced to the number of topics that is specified to the LDA clustering algorithm.
> With the proposed implementation from a corpus of documents described as vectors and
from the last state of LDA inference process a set of vectors with reduced dimensions is produced
(a vector per a document) which represent the set of documents

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message