Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167720593 --- Diff: src/ports/postgres/modules/lda/lda.sql_in --- @@ -182,324 +105,789 @@ lda_train( data_table, \b Arguments
voc_size | -INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size − \c 1. A data validation routine is called to validate the dataset. | +INTEGER. Size of the vocabulary. As mentioned above for the input + table, \c wordid consists of contiguous integers going + from 0 to \c voc_size − \c 1. + |
---|---|---|
topic_num | INTEGER. Number of topics. | |
alpha | -DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num). | +DOUBLE PRECISION. Dirichlet prior for the per-document + topic multinomial. |
beta | -DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01). | +DOUBLE PRECISION. Dirichlet prior for the per-topic + word multinomial. |
model | -BIGINT[]. | +BIGINT[]. The encoded model description (not human readable). |
docid | -INTEGER. | +INTEGER. Document id from input 'data_table'. |
---|---|---|
wordcount | -INTEGER. | +INTEGER. Count of number of words in the document, + including repeats. For example, if a word appears 3 times + in the document, it is counted 3 times. |
words | -INTEGER[]. | +INTEGER[]. Array of \c wordid in the document, not + including repeats. For example, if a word appears 3 times + in the document, it appears only once in the \c words array. |
counts | -INTEGER[]. | +INTEGER[]. Frequency of occurance of a word in the document, + indexed the same as the \c words array above. For example, if the + 2nd element of the \c counts array is 4, it means that the word + in the 2nd element of the \c words array occurs 4 times in the + document. |
topic_count | -INTEGER[]. | +INTEGER[]. Array of the count of words in the document + that correspond to each topic. |
topic_assignment | -INTEGER[]. | +INTEGER[]. Array indicating which topic each word in the + document corresponds to. This array is of length \c wordcount. |