hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From takuti <...@git.apache.org>
Subject [GitHub] incubator-hivemall issue #66: [WIP][HIVEMALL-91] Implement Online LDA
Date Thu, 06 Apr 2017 05:34:31 GMT
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/66
  
    @myui I considered to design prediction UDAF, but IMO your suggestion above `sum(t.value
* m.score) as score` is better for now. 
    
    In order to compute topic distribution based on the `lambda` values (i.e., LDA model),
I actually like to launch the E step for a test sample as [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/decomposition/online_lda.py#L546-L577)
and current `getTopicDistribution()` (80a31539bf653e50471346777842ff9478ae352d) do. However,
it requires prediction UDAF to know hyper-parameters (e.g., number of topics, alpha) which
were used for training, and it's essentially infeasible. In addition, since users sometimes
want to know posterior probabilities and their labels for each of all topics as follows, so
single-column-output of UDAF is not sufficient.
    
    | docid | label | prob |
    |:---:|:---:|:---|
    |1  |     0      | 0.9957867647115234
    |1   |    1     |  0.004213235288476648
    |2    |   0    |   0.0014898943734896843
    |2     |  1   |    0.9985101056265103
    
    See [HERE](https://gist.github.com/takuti/d24324e76d4b2ec7dc4b1d50a4d192d8) for detail.
    
    Of course, since we do not launch the "expectation" step as theory suggests, `prob` is
approximated value in some sense. But, I guess it's sufficient in practice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message