From dev-return-2799-apmail-madlib-dev-archive=madlib.apache.org@madlib.apache.org Mon Feb 12 23:39:21 2018 Return-Path: X-Original-To: apmail-madlib-dev-archive@minotaur.apache.org Delivered-To: apmail-madlib-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 10A2A18192 for ; Mon, 12 Feb 2018 23:39:21 +0000 (UTC) Received: (qmail 67882 invoked by uid 500); 12 Feb 2018 23:39:20 -0000 Delivered-To: apmail-madlib-dev-archive@madlib.apache.org Received: (qmail 67835 invoked by uid 500); 12 Feb 2018 23:39:20 -0000 Mailing-List: contact dev-help@madlib.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.apache.org Delivered-To: mailing list dev@madlib.apache.org Received: (qmail 67794 invoked by uid 99); 12 Feb 2018 23:39:20 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Feb 2018 23:39:20 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 09518E178C; Mon, 12 Feb 2018 23:39:20 +0000 (UTC) From: jingyimei To: dev@madlib.apache.org Reply-To: dev@madlib.apache.org References: In-Reply-To: Subject: [GitHub] madlib pull request #232: Multiple LDA improvements and fixes Content-Type: text/plain Message-Id: <20180212233920.09518E178C@git1-us-west.apache.org> Date: Mon, 12 Feb 2018 23:39:20 +0000 (UTC) Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167720593 --- Diff: src/ports/postgres/modules/lda/lda.sql_in --- @@ -182,324 +105,789 @@ lda_train( data_table, \b Arguments
data_table
-
TEXT. The name of the table storing the training dataset. Each row is +
TEXT. Name of the table storing the training dataset. Each row is in the form <docid, wordid, count> where \c docid, \c wordid, and \c count - are non-negative integers. - + are non-negative integers. The \c docid column refers to the document ID, the \c wordid column is the word ID (the index of a word in the vocabulary), and \c count is the - number of occurrences of the word in the document. - - Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these - exact names in the data_table.
+ number of occurrences of the word in the document. Please note: + + - \c wordid must be + contiguous integers going from from 0 to \c voc_size − \c 1. + - column names for \c docid, \c wordid, and \c count are currently fixed, + so you must use these exact names in the data_table. + + The function Term Frequency + can be used to generate vocabulary in the required format from raw documents. +
model_table
-
TEXT. The name of the table storing the learned models. This table has one row and the following columns. +
TEXT. This is an output table generated by LDA which contains the learned model. + It has one row with the following columns: - + - + - + - +
voc_sizeINTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size − \c 1. A data validation routine is called to validate the dataset.INTEGER. Size of the vocabulary. As mentioned above for the input + table, \c wordid consists of contiguous integers going + from 0 to \c voc_size − \c 1. +
topic_num INTEGER. Number of topics.
alphaDOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).DOUBLE PRECISION. Dirichlet prior for the per-document + topic multinomial.
betaDOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01).DOUBLE PRECISION. Dirichlet prior for the per-topic + word multinomial.
modelBIGINT[].BIGINT[]. The encoded model description (not human readable).
output_data_table
-
TEXT. The name of the table to store the output data. It has the following columns: +
TEXT. The name of the table generated by LDA that stores + the output data. It has the following columns: - + - + - + - + - + - +
docidINTEGER.INTEGER. Document id from input 'data_table'.
wordcountINTEGER.INTEGER. Count of number of words in the document, + including repeats. For example, if a word appears 3 times + in the document, it is counted 3 times.
wordsINTEGER[].INTEGER[]. Array of \c wordid in the document, not + including repeats. For example, if a word appears 3 times + in the document, it appears only once in the \c words array.
countsINTEGER[].INTEGER[]. Frequency of occurance of a word in the document, + indexed the same as the \c words array above. For example, if the + 2nd element of the \c counts array is 4, it means that the word + in the 2nd element of the \c words array occurs 4 times in the + document.
topic_countINTEGER[].INTEGER[]. Array of the count of words in the document + that correspond to each topic.
topic_assignmentINTEGER[].INTEGER[]. Array indicating which topic each word in the + document corresponds to. This array is of length \c wordcount.
voc_size
-
INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size − \c 1. A data validation routine is called to validate the dataset.
+
INTEGER. Size of the vocabulary. As mentioned above for the + input 'data_table', \c wordid consists of continous integers going + from 0 to \c voc_size − \c 1. +
topic_num
-
INTEGER. Number of topics.
+
INTEGER. Desired number of topics.
iter_num
-
INTEGER. Number of iterations (e.g. 60).
+
INTEGER. Desired number of iterations.
alpha
-
DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num).
+
DOUBLE PRECISION. Dirichlet prior for the per-document topic + multinomial (e.g., 50/topic_num is a typical value to start with).
--- End diff -- I found different libraries do use different starting value, e.g. 1/k, 5/k and 0.1. We can mention this value (50/k) is suggested in Griffiths and Steyvers Paper. ---