Github user jingyimei commented on a diff in the pull request:
https://github.com/apache/madlib/pull/232#discussion_r167709835
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
\b Arguments
<dl class="arglist">
<dt>data_table</dt>
- <dd>TEXT. The name of the table storing the training dataset. Each row is
+ <dd>TEXT. Name of the table storing the training dataset. Each row is
in the form <tt><docid, wordid, count></tt> where \c docid,
\c wordid, and \c count
- are non-negative integers.
-
+ are non-negative integers.
The \c docid column refers to the document ID, the \c wordid column is the
word ID (the index of a word in the vocabulary), and \c count is the
- number of occurrences of the word in the document.
-
- Please note that column names for \c docid, \c wordid, and \c count are currently
fixed, so you must use these
- exact names in the data_table.</dd>
+ number of occurrences of the word in the document. Please note:
+
+ - \c wordid must be
+ contiguous integers going from from 0 to \c voc_size − \c 1.
+ - column names for \c docid, \c wordid, and \c count are currently fixed,
+ so you must use these exact names in the data_table.
+
+ The function <a href="group__grp__text__utilities.html">Term Frequency</a>
+ can be used to generate vocabulary in the required format from raw documents.
+ </dd>
<dt>model_table</dt>
- <dd>TEXT. The name of the table storing the learned models. This table has
one row and the following columns.
+ <dd>TEXT. This is an output table generated by LDA which contains the learned
model.
+ It has one row with the following columns:
<table class="output">
<tr>
<th>voc_size</th>
- <td>INTEGER. Size of the vocabulary. Note that the \c wordid should
be continous integers starting from 0 to \c voc_size − \c 1. A data validation
routine is called to validate the dataset.</td>
+ <td>INTEGER. Size of the vocabulary. As mentioned above for the
input
+ table, \c wordid consists of contiguous integers going
+ from 0 to \c voc_size − \c 1.
+ </td>
</tr>
<tr>
<th>topic_num</th>
<td>INTEGER. Number of topics.</td>
</tr>
<tr>
<th>alpha</th>
- <td>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic
multinomial (e.g. 50/topic_num).</td>
+ <td>DOUBLE PRECISION. Dirichlet prior for the per-document
+ topic multinomial.</td>
</tr>
<tr>
<th>beta</th>
- <td>DOUBLE PRECISION. Dirichlet parameter for the per-topic word
multinomial (e.g. 0.01).</td>
+ <td>DOUBLE PRECISION. Dirichlet prior for the per-topic
+ word multinomial.</td>
</tr>
<tr>
<th>model</th>
- <td>BIGINT[].</td>
+ <td>BIGINT[]. The encoded model description (not human readable).</td>
</tr>
</table>
</dd>
<dt>output_data_table</dt>
- <dd>TEXT. The name of the table to store the output data. It has the following
columns:
+ <dd>TEXT. The name of the table generated by LDA that stores
+ the output data. It has the following columns:
<table class="output">
<tr>
<th>docid</th>
- <td>INTEGER.</td>
+ <td>INTEGER. Document id from input 'data_table'.</td>
</tr>
<tr>
<th>wordcount</th>
- <td>INTEGER.</td>
+ <td>INTEGER. Count of number of words in the document,
+ including repeats. For example, if a word appears 3 times
+ in the document, it is counted 3 times.</td>
</tr>
<tr>
<th>words</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array of \c wordid in the document, not
+ including repeats. For example, if a word appears 3 times
+ in the document, it appears only once in the \c words array.</td>
</tr>
<tr>
<th>counts</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Frequency of occurance of a word in the document,
+ indexed the same as the \c words array above. For example, if the
+ 2nd element of the \c counts array is 4, it means that the word
+ in the 2nd element of the \c words array occurs 4 times in the
+ document.</td>
</tr>
<tr>
<th>topic_count</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array of the count of words in the document
+ that correspond to each topic.</td>
</tr>
<tr>
<th>topic_assignment</th>
- <td>INTEGER[].</td>
+ <td>INTEGER[]. Array indicating which topic each word in the
+ document corresponds to. This array is of length \c wordcount.</td>
</tr>
</table>
</dd>
<dt>voc_size</dt>
- <dd>INTEGER. Size of the vocabulary. Note that the \c wordid should be continous
integers starting from 0 to \c voc_size − \c 1. A data validation routine is called
to validate the dataset.</dd>
+ <dd>INTEGER. Size of the vocabulary. As mentioned above for the
+ input 'data_table', \c wordid consists of continous integers going
+ from 0 to \c voc_size − \c 1.
+ </dd>
<dt>topic_num</dt>
- <dd>INTEGER. Number of topics.</dd>
+ <dd>INTEGER. Desired number of topics.</dd>
<dt>iter_num</dt>
- <dd>INTEGER. Number of iterations (e.g. 60).</dd>
+ <dd>INTEGER. Desired number of iterations.</dd>
<dt>alpha</dt>
- <dd>DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial
(e.g. 50/topic_num).</dd>
+ <dd>DOUBLE PRECISION. Dirichlet prior for the per-document topic
+ multinomial (e.g., 50/topic_num is a typical value to start with).</dd>
<dt>beta</dt>
- <dd>DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial
(e.g. 0.01).</dd>
+ <dd>DOUBLE PRECISION. Dirichlet prior for the per-topic
+ word multinomial (e.g., 0.01 is a typical value to start with).</dd>
</dl>
@anchor predict
@par Prediction Function
-Prediction—labelling test documents using a learned LDA model—is
accomplished with the following function:
+Prediction involves labelling test documents using a learned LDA model:
<pre class="syntax">
lda_predict( data_table,
model_table,
- output_table
+ output_predict_table
);
</pre>
-
-This function stores the prediction results in
-<tt><em>output_table</em></tt>. Each row in the table stores
the topic
-distribution and the topic assignments for a document in the dataset. The
-table has the following columns:
-<table class="output">
- <tr>
- <th>docid</th>
- <td>INTEGER.</td>
- </tr>
- <tr>
- <th>wordcount</th>
- <td>INTEGER.</td>
- </tr>
- <tr>
- <th>words</th>
- <td>INTEGER[]. List of word IDs in this document.</td>
- </tr>
- <tr>
- <th>counts</th>
- <td>INTEGER[]. List of word counts in this document.</td>
- </tr>
- <tr>
- <th>topic_count</th>
- <td>INTEGER[]. Of length topic_num, list of topic counts in this document.</td>
- </tr>
- <tr>
- <th>topic_assignment</th>
- <td>INTEGER[]. Of length wordcount, list of topic index for each word.</td>
- </tr>
-</table>
+\b Arguments
+<dl class="arglist">
+<dt>data_table</dt>
+ <dd>TEXT. Name of the table storing the test dataset
+ (new document to be labeled).
+ </dd>
+<dt>model_table</dt>
+ <dd>TEXT. The model table generated by the training process.
+ </dd>
+<dt>output_predict_table</dt>
+ <dd>TEXT. The prediction output table.
+ Each row in the table stores the topic
+ distribution and the topic assignments for a
+ document in the dataset. This table has the exact
+ same columns and interpretation as
+ the 'output_data_table' from the training function above.
+ </dd>
+</dl>
@anchor perplexity
-@par Perplexity Function
-This module provides a function for computing the perplexity.
+@par Perplexity
+Perplexity describes how well the model fits the data by
+computing word likelihoods averaged over the test documents.
+This function returns a single perplexity value.
<pre class="syntax">
lda_get_perplexity( model_table,
- output_data_table
+ output_predict_table
);
</pre>
+\b Arguments
+<dl class="arglist">
+<dt>model_table</dt>
+ <dd>TEXT. The model table generated by the training process.
+ </dd>
+<dt>output_predict_table</dt>
+ <dd>TEXT. The prediction output table generated by the
+ predict function above.
+ </dd>
+</dl>
+
+@anchor helper
+@par Helper Functions
+
+The helper functions can help to interpret the output
+from LDA training and LDA prediction.
+
+<b>Topic description by top-k words</b>
--- End diff --
top-k with highest probability. I saw u mention it later in the example, and I feel we
can also mention it here with 3 more words.
---
|