Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/232#discussion_r167720593
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
\b Arguments
<dl class="arglist">
<dt>data_table</dt>
- <dd>TEXT. The name of the table storing the training dataset. Each row is
+ <dd>TEXT. Name of the table storing the training dataset. Each row is
in the form <tt><docid, wordid, count></tt> where \c docid,
\c wordid, and \c count
- are non-negative integers.
-
+ are non-negative integers.
The \c docid column refers to the document ID, the \c wordid column is the
word ID (the index of a word in the vocabulary), and \c count is the
- number of occurrences of the word in the document.
-
- Please note that column names for \c docid, \c wordid, and \c count are currently
fixed, so you must use these
- exact names in the data_table.</dd>
+ number of occurrences of the word in the document. Please note:
+
+ - \c wordid must be
+ contiguous integers going from from 0 to \c voc_size − \c 1.
+ - column names for \c docid, \c wordid, and \c count are currently fixed,
+ so you must use these exact names in the data_table.
+
+ The function <a href="group__grp__text__utilities.html">Term Frequency</a>
+ can be used to generate vocabulary in the required format from raw documents.
+ </dd>
<dt>model_table</dt>
- <dd>TEXT. The name of the table storing the learned models. This table has
one row and the following columns.
+ <dd>TEXT. This is an output table generated by LDA which contains the learned
model.
+ It has one row with the following columns:
<table class="output">
<tr>
<th>voc_size</th>
- <td>INTEGER. Size of the vocabulary. Note that the \c wordid should
be continous integers starting from 0 to \c voc_size − \c 1. A data validation
routine is called to validate the dataset.</td>
+ <td>INTEGER. Size of the vocabulary. As mentioned above for the
input
+ table, \c wordid consists of contiguous integers going
+ from 0 to \c voc_size − \c 1.
+ </td>
</tr>
<tr>
<th>topic_num</th>
<td>INTEGER. Number of topics.</td>
</tr>
<tr>
<th>alpha</th>
- <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic
multinomial (e.g. 50/topic_num).</td>
+ <td>DOUBLE PRECISION. Dirichlet prior for the per-document
+ topic multinomial.</td>
</tr>
<tr>
<th>beta</th>
- <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word
multinomial (e.g. 0.01).</td>
+ <td>DOUBLE PRECISION. Dirichlet prior for the per-topic
+ word multinomial.</td>
</tr>
<tr>
<th>model</th>
- <td>BIGINT[].</td>
+ <td>BIGINT[]. The encoded model description (not human readable).</td>
</tr>
</table>
</dd>
<dt>output_data_table</dt>
- <dd>TEXT. The name of the table to store the output data. It has the following
columns:
+ <dd>TEXT. The name of the table generated by LDA that stores
+ the output data. It has the following columns:
<table class="output">
<tr>
<th>docid</th>
- <td>INTEGER.</td>
+ <td>INTEGER. Document id from input 'data_table'.</td>
</tr>
<tr>
<th>wordcount</th>
- <td>INTEGER.</td>
+ <td>INTEGER. Count of number of words in the document,
+ including repeats. For example, if a word appears 3 times
+ in the document, it is counted 3 times.</td>
</tr>
<tr>
<th>words</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array of \c wordid in the document, not
+ including repeats. For example, if a word appears 3 times
+ in the document, it appears only once in the \c words array.</td>
</tr>
<tr>
<th>counts</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Frequency of occurance of a word in the document,
+ indexed the same as the \c words array above. For example, if the
+ 2nd element of the \c counts array is 4, it means that the word
+ in the 2nd element of the \c words array occurs 4 times in the
+ document.</td>
</tr>
<tr>
<th>topic_count</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array of the count of words in the document
+ that correspond to each topic.</td>
</tr>
<tr>
<th>topic_assignment</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array indicating which topic each word in the
+ document corresponds to. This array is of length \c wordcount.</td>
</tr>
</table>
</dd>
<dt>voc_size</dt>
- <dd>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous
integers starting from 0 to \c voc_size − \c 1. A data validation routine is called
to validate the dataset.</dd>
+ <dd>INTEGER. Size of the vocabulary. As mentioned above for the
+ input 'data_table', \c wordid consists of continous integers going
+ from 0 to \c voc_size − \c 1.
+ </dd>
<dt>topic_num</dt>
- <dd>INTEGER. Number of topics.</dd>
+ <dd>INTEGER. Desired number of topics.</dd>
<dt>iter_num</dt>
- <dd>INTEGER. Number of iterations (e.g. 60).</dd>
+ <dd>INTEGER. Desired number of iterations.</dd>
<dt>alpha</dt>
- <dd>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial
(e.g. 50/topic_num).</dd>
+ <dd>DOUBLE PRECISION. Dirichlet prior for the per-document topic
+ multinomial (e.g., 50/topic_num is a typical value to start with).</dd>
--- End diff --
I found different libraries do use different starting value, e.g. 1/k, 5/k and 0.1. We
can mention this value (50/k) is suggested in Griffiths and Steyvers Paper.
---
|