fmcquillan99 edited a comment on issue #432: MADLIB1351 : Added stopping criteria on perplexity
to LDA
URL: https://github.com/apache/madlib/pull/432#issuecomment549600980

Retest after latest commits
(1)
Please add `num_iterations` to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:
```
model_table
...
model BIGINT[]. The encoded model ...etc...
num_iterations INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity DOUBLE PRECISION[] Array of ...etc....
...
```
Now looks like:
```
[ RECORD 1 ]+
voc_size  384
topic_num  5
alpha  5
beta  0.01
num_iterations  9
perplexity  {196.148467882,192.142777576,193.872066117}
perplexity_iters  {3,6,9}
```
OK
(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below
a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.
```
DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
SELECT madlib.lda_train( 'documents_tf',  documents table in the form of term
frequency
'lda_model_perp',  model table created by LDA training
(not human readable)
'lda_output_data_perp',  readable output data table
103,  vocabulary size
5,  number of topics
10,  number of iterations
5,  Dirichlet prior for the perdoc topic
multinomial (alpha)
0.01,  Dirichlet prior for the pertopic
word multinomial (beta)
2,  Evaluate perplexity every 2 iterations
0.0  Set tolerance to 0 so runs full number
of iterations
);
```
produces
```
[ RECORD 1 ]+
voc_size  384
topic_num  5
alpha  5
beta  0.01
num_iterations  20
perplexity  {191.992070922,188.198782019,187.433873268,184.973287318,184.491077644,176.27420008,180.63646659,180.456641184,179.574266867,179.152413582}
perplexity_iters  {2,4,6,8,10,12,14,16,18,20}
```
OK
(3)
Last iteration value for perplexity doe not match final perplexity value:
```
DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);
INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable models, originally
developed for analyzing the semantic content of large document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of
the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average
of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with computational statistics;
a discipline that also specializes in predictionmaking. It has strong ties to mathematical
optimization, which deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific
Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert
areas in the southeast. The center of the state is dominated by the Central Valley, a major
agricultural area.'),
(4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular
approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem
may have different probability interpretations. With the Bayesian probability interpretation
the theorem expresses how a degree of belief, expressed as a probability, should rationally
change to account for availability of related evidence. Bayesian inference is fundamental
to Bayesian statistics.'),
(5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to groups,
and then map new data to these formed groups. The supportvector clustering algorithm, created
by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed
in the support vector machines algorithm, to categorize unlabeled data, and is one of the
most widely used clustering algorithms in industrial applications.'),
(6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent
neural networks and convolutional neural networks have been applied to fields including computer
vision, speech recognition, natural language processing, audio recognition, social network
filtering, machine translation, bioinformatics, drug design, medical image analysis, material
inspection and board game programs, where they have produced results comparable to and in
some cases superior to human experts.'),
(7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP
consists of at least three layers of nodes: an input layer, a hidden layer and an output layer.
Except for the input nodes, each node is a neuron that uses a nonlinear activation function.
MLP utilizes a supervised learning technique called backpropagation for training.'),
(8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that
for all points on the curve, the sum of the two distances to the focal points is a constant.'),
(9, 'In artificial neural networks, the activation function of a node defines the output
of that node given an input or set of inputs.'),
(10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures
used to model pairwise relations between objects. A graph in this context is made up of vertices
(also called nodes or points) which are connected by edges (also called links or lines). A
distinction is made between undirected graphs, where edges link two vertices symmetrically,
and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics)
for more detailed definitions and for other variations in the types of graph that are commonly
considered. Graphs are one of the prime objects of study in discrete mathematics.'),
(11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally
designed to perform a simple task in an indirect and overly complicated way. Usually, these
machines consist of a series of simple unrelated devices; the action of each triggers the
initiation of the next, eventually resulting in achieving a stated goal.'),
(12, 'In statistics, the logistic model (or logit model) is used to model the probability
of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
This can be extended to model several classes of events such as determining whether an image
contains a cat, dog, lion, etc... Each object being detected in the image would be assigned
a probability between 0 and 1 and the sum adding to one.'),
(13, 'kmeans clustering is a method of vector quantization, originally from signal processing,
that is popular for cluster analysis in data mining. kmeans clustering aims to partition
n observations into k clusters in which each observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster.'),
(14, 'In pattern recognition, the knearest neighbors algorithm (kNN) is a nonparametric
method used for classification and regression.');
ALTER TABLE documents ADD COLUMN words TEXT[];
UPDATE documents SET words =
regexp_split_to_array(lower(
regexp_replace(contents, E'[,.;\']','', 'g')
), E'[\\s+]');
DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
SELECT madlib.term_frequency('documents',  input table
'docid',  document id column
'words',  vector of words in document
'documents_tf',  output documents table with term frequency
TRUE);  TRUE to created vocabulary table
```
Train
```
DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
SELECT madlib.lda_train( 'documents_tf',  documents table in the form of term
frequency
'lda_model_perp',  model table created by LDA training
(not human readable)
'lda_output_data_perp',  readable output data table
384,  vocabulary size
5,  number of topics
100,  number of iterations
5,  Dirichlet prior for the perdoc topic
multinomial (alpha)
0.01,  Dirichlet prior for the pertopic
word multinomial (beta)
1,  Evaluate perplexity every n iterations
0.1  Stopping perplexity tolerance
);
SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp;
[ RECORD 1 ]+
voc_size  384
topic_num  5
alpha  5
beta  0.01
num_iterations  16
perplexity  {195.582090721,192.071728778,191.048336558,194.186905186,195.150503634,191.566207005,191.199131632,185.533220287,189.910983656,184.981903783,185.753724338,183.043524383,189.125703696,191.460991339,189.193774612,189.182916247}
perplexity_iters  {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
```
Perplexity on input data
```
SELECT madlib.lda_get_perplexity( 'lda_model_perp',
'lda_output_data_perp'
);
lda_get_perplexity

189.182916246556
(1 row)
```
which matches the last value in the array for the training function.
OK
(6) still has an issue
```
DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
SELECT madlib.lda_train( 'documents_tf',  documents table in the form of term
frequency
'lda_model_perp',  model table created by LDA training
(not human readable)
'lda_output_data_perp',  readable output data table
384,  vocabulary size
5,  number of topics
20,  number of iterations
5,  Dirichlet prior for the perdoc topic
multinomial (alpha)
0.01,  Dirichlet prior for the pertopic
word multinomial (beta)
2  Evaluate perplexity every n iterations
Done.
(psycopg2.ProgrammingError) function madlib.lda_train(unknown, unknown, unknown, integer,
integer, integer, integer, numeric, integer) does not exist
LINE 1: SELECT madlib.lda_train( 'documents_tf',  documen...
^
HINT: No function matches the given name and argument types. You might need to add explicit
type casts.
[SQL: "SELECT madlib.lda_train( 'documents_tf',  documents table in the form
of term frequency\n 'lda_model_perp',  model table created
by LDA training (not human readable)\n 'lda_output_data_perp', 
readable output data table \n 384,  vocabulary
size\n 5,  number of topics\n
20,  number of iterations\n 5,
 Dirichlet prior for the perdoc topic multinomial (alpha)\n
0.01,  Dirichlet prior for the pertopic word multinomial
(beta)\n 2  Evaluate perplexity every n iterations\n
);"]
```
This should be the same results as:
```
DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
SELECT madlib.lda_train( 'documents_tf',  documents table in the form of term
frequency
'lda_model_perp',  model table created by LDA training
(not human readable)
'lda_output_data_perp',  readable output data table
384,  vocabulary size
5,  number of topics
20,  number of iterations
5,  Dirichlet prior for the perdoc topic
multinomial (alpha)
0.01,  Dirichlet prior for the pertopic
word multinomial (beta)
2,  Evaluate perplexity every n iterations
NULL
);
```
which actually does work if you put `NULL` for the last param.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
