madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jingyimei <...@git.apache.org>
Subject [GitHub] madlib pull request #232: Multiple LDA improvements and fixes
Date Wed, 07 Feb 2018 22:43:40 GMT
GitHub user jingyimei opened a pull request:

    https://github.com/apache/madlib/pull/232

    Multiple LDA improvements and fixes

    Co-author: Nikhil Kak (nkak@pivotal.io)
    
    This PR addresses the following issues:
    
    JIRAs
    MADLIB-1160
    MADLIB-1201
    
    1. Ensure that the output of lda_train is consistent with the output of lda_get_word_topic_count
    2. Add a helper function, which will map each wordid with corresponding topicid that get
assigned in output table.
    3. Address LDA topicid index inconsistency issue
    4. Fix LDA lda_get_topic_desc getting wrong top_k words issue
    
    All the commits are independent of each other and can be reviewed separately which might
be easier than reviewing the files. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib lda_output_fix_final

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #232
    
----
commit a99883dc60877974e1a651a48489a08ec66584a3
Author: Jingyi Mei and Nikhil Kak <jmei+nkak@...>
Date:   2018-01-30T21:58:11Z

    Fix lda output inconsistency bug and add install check test
    
    JIRA: MADLIB-1201
    
    Fixed the issue of output of lda_train and lda_get_word_topic_count
    not matching each other. Added test case in install check.
    See jira for more details and example.
    
    Also added a install check that validates that the output of lda_train and
    lda_get_word_topic_count are consistent with each other.
    See jira for more details and example.

commit f0664230153ebe254e4c98e51ebc41bc7faaf327
Author: Jingyi Mei <jmei@...>
Date:   2018-01-31T02:20:59Z

    LDA: Add helper function to map wordid and topicid
    
    JIRA: MADLIB-1160
    
    This commit adds a helper function, which will map each wordid with
    corresponding topicid that get assigned in output table. Duplicate lines
    are removed from the final result.
    
    Also adds a workaround for GPDB4.3 svec
    
    In GPDB4.3, we cannot call madlib.svec directly on a text
    format.Instead, we have to call madlib.svec_from_string to convert the
    text. This commit fix this issue so the new helper function
    madlib.lda_get_word_topic_mapping can work on both gpdb5 and gpdb4.

commit a062acbf85d7044eaa37627a3904e456ab4aa309
Author: Jingyi Mei <jmei@...>
Date:   2018-01-31T20:21:10Z

    Address LDA topicid index inconsistency issue
    
    JIRA:MADLIB-1160
    
    This commit fixes the topicid inconsistency in madlib.lda_train
    and madlib.lda_get_topic_desc, where the former one uses 0 based index
    and the latter uses 1 index. Now they will all start at 0.

commit 7569049ba6bea5c4526db91478cbb165c79a2e60
Author: Jingyi Mei <jmei@...>
Date:   2018-01-31T20:32:19Z

    Fix LDA lda_get_topic_desc getting wrong top_k words issue
    
    JIRA: MADLIB-1160
    
    Previously, madlib.lda_get_topic_desc got top k - 1 words in the result
    table. This commit fixed it to be top k.

----


---

Mime
View raw message