GitHub user jingyimei opened a pull request:
https://github.com/apache/madlib/pull/232
Multiple LDA improvements and fixes
Co-author: Nikhil Kak (nkak@pivotal.io)
This PR addresses the following issues:
JIRAs
MADLIB-1160
MADLIB-1201
1. Ensure that the output of lda_train is consistent with the output of lda_get_word_topic_count
2. Add a helper function, which will map each wordid with corresponding topicid that get
assigned in output table.
3. Address LDA topicid index inconsistency issue
4. Fix LDA lda_get_topic_desc getting wrong top_k words issue
All the commits are independent of each other and can be reviewed separately which might
be easier than reviewing the files.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib lda_output_fix_final
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/232.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #232
----
commit a99883dc60877974e1a651a48489a08ec66584a3
Author: Jingyi Mei and Nikhil Kak <jmei+nkak@...>
Date: 2018-01-30T21:58:11Z
Fix lda output inconsistency bug and add install check test
JIRA: MADLIB-1201
Fixed the issue of output of lda_train and lda_get_word_topic_count
not matching each other. Added test case in install check.
See jira for more details and example.
Also added a install check that validates that the output of lda_train and
lda_get_word_topic_count are consistent with each other.
See jira for more details and example.
commit f0664230153ebe254e4c98e51ebc41bc7faaf327
Author: Jingyi Mei <jmei@...>
Date: 2018-01-31T02:20:59Z
LDA: Add helper function to map wordid and topicid
JIRA: MADLIB-1160
This commit adds a helper function, which will map each wordid with
corresponding topicid that get assigned in output table. Duplicate lines
are removed from the final result.
Also adds a workaround for GPDB4.3 svec
In GPDB4.3, we cannot call madlib.svec directly on a text
format.Instead, we have to call madlib.svec_from_string to convert the
text. This commit fix this issue so the new helper function
madlib.lda_get_word_topic_mapping can work on both gpdb5 and gpdb4.
commit a062acbf85d7044eaa37627a3904e456ab4aa309
Author: Jingyi Mei <jmei@...>
Date: 2018-01-31T20:21:10Z
Address LDA topicid index inconsistency issue
JIRA:MADLIB-1160
This commit fixes the topicid inconsistency in madlib.lda_train
and madlib.lda_get_topic_desc, where the former one uses 0 based index
and the latter uses 1 index. Now they will all start at 0.
commit 7569049ba6bea5c4526db91478cbb165c79a2e60
Author: Jingyi Mei <jmei@...>
Date: 2018-01-31T20:32:19Z
Fix LDA lda_get_topic_desc getting wrong top_k words issue
JIRA: MADLIB-1160
Previously, madlib.lda_get_topic_desc got top k - 1 words in the result
table. This commit fixed it to be top k.
----
---
|