From dev-return-5582-archive-asf-public=cust-asf.ponee.io@madlib.apache.org  Mon Nov  4 23:57:27 2019
Return-Path: <dev-return-5582-archive-asf-public=cust-asf.ponee.io@madlib.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 4C5A8180658
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  5 Nov 2019 00:57:27 +0100 (CET)
Received: (qmail 66619 invoked by uid 500); 4 Nov 2019 23:57:26 -0000
Mailing-List: contact dev-help@madlib.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@madlib.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@madlib.apache.org>
List-Post: <mailto:dev@madlib.apache.org>
List-Id: <dev.madlib.apache.org>
Reply-To: dev@madlib.apache.org
Delivered-To: mailing list dev@madlib.apache.org
Received: (qmail 66608 invoked by uid 99); 4 Nov 2019 23:57:26 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Nov 2019 23:57:26 +0000
From: GitBox <git@apache.org>
To: dev@madlib.apache.org
Subject: [GitHub] [madlib] fmcquillan99 edited a comment on issue #432: MADLIB-1351 :
 Added stopping criteria on perplexity to LDA
Message-ID: <157291184657.1732.2320115852761185378.gitbox@gitbox.apache.org>
Date: Mon, 04 Nov 2019 23:57:26 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

fmcquillan99 edited a comment on issue #432: MADLIB-1351 : Added stopping criteria on perplexity to LDA
URL: https://github.com/apache/madlib/pull/432#issuecomment-549600980
 
 
   -----------------------------------------------------------------
   
   Re-test after latest commits
   
   
   (1)
   Please add `num_iterations` to the output table.  This is needed now because
   we have a perplexity tolerance, so training may not run the maximum number of iterations
   specified.  The model table should look like:
   
   ```
   model_table
   ...
   model	BIGINT[]. The encoded model ...etc...
   num_iterations	INTEGER. Number of iterations that training ran for,
   which may be less than the maximum value specified in the parameter 'iter_num' if
   the perplexity tolerance was reached.
   perplexity	DOUBLE PRECISION[] Array of ...etc....
   ...
   ```
   
   Now looks like:
   
   ```
   -[ RECORD 1 ]----+--------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   num_iterations   | 9
   perplexity       | {196.148467882,192.142777576,193.872066117}
   perplexity_iters | {3,6,9}
   ```
   
   OK
   
   
   (2)
   The parameter 'perplexity_tol' can be any value >= 0.0  Currently it errors out below a
   value of 0.1 which is not correct.  I may want to set it to 0.0 so that training runs
   for the full number of iterations.  So please change it to error out if 'perplexity_tol'<0.
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                            'lda_model_perp',        -- model table created by LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data table
                            103,                     -- vocabulary size
                            5,                       -- number of topics
                            10,                      -- number of iterations
                            5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                            2,                       -- Evaluate perplexity every 2 iterations
                            0.0                      -- Set tolerance to 0 so runs full number of iterations
                          );
   ```
   produces
   ```
   -[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   num_iterations   | 20
   perplexity       | {191.992070922,188.198782019,187.433873268,184.973287318,184.491077644,176.27420008,180.63646659,180.456641184,179.574266867,179.152413582}
   perplexity_iters | {2,4,6,8,10,12,14,16,18,20}
   ```
   
   OK
   
   (3)
   Last iteration value for perplexity doe not match final perplexity value:
   
   ```
   DROP TABLE IF EXISTS documents;
   CREATE TABLE documents(docid INT4, contents TEXT);
   
   INSERT INTO documents VALUES
   (0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'),
   (1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'),
   (2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'),
   (3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area.'),
   (4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.'),
   (5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.'),
   (6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.'),
   (7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.'),
   (8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant.'),
   (9, 'In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.'),
   (10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.'),
   (11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally designed to perform a simple task in an indirect and overly complicated way. Usually, these machines consist of a series of simple unrelated devices; the action of each triggers the initiation of the next, eventually resulting in achieving a stated goal.'),
   (12, 'In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.'),
   (13, 'k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.'),
   (14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.');
   
   
   ALTER TABLE documents ADD COLUMN words TEXT[];
   
   UPDATE documents SET words =
       regexp_split_to_array(lower(
       regexp_replace(contents, E'[,.;\']','', 'g')
       ), E'[\\s+]');
   
   
   DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
   
   SELECT madlib.term_frequency('documents',    -- input table
                                'docid',        -- document id column
                                'words',        -- vector of words in document
                                'documents_tf', -- output documents table with term frequency
                                TRUE);          -- TRUE to created vocabulary table
   ```
   
   Train
   
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                            'lda_model_perp',        -- model table created by LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data table
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            100,                      -- number of iterations
                            5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                            1,                       -- Evaluate perplexity every n iterations
                            0.1                      -- Stopping perplexity tolerance
                          );
   
   SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp;
   
   -[ RECORD 1 ]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   num_iterations   | 16
   perplexity       | {195.582090721,192.071728778,191.048336558,194.186905186,195.150503634,191.566207005,191.199131632,185.533220287,189.910983656,184.981903783,185.753724338,183.043524383,189.125703696,191.460991339,189.193774612,189.182916247}
   perplexity_iters | {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
   ```
   
   Perplexity on input data
   
   ```
   SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                     'lda_output_data_perp'
                                   );
   
    lda_get_perplexity 
   --------------------
      189.182916246556
   (1 row)
   
   ```
   which matches the last value in the array for the training function.
   
   OK
   
   
   (6) still has an issue
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                            'lda_model_perp',        -- model table created by LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data table 
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            20,                      -- number of iterations
                            5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                            2                       -- Evaluate perplexity every n iterations
   
   Done.
   (psycopg2.ProgrammingError) function madlib.lda_train(unknown, unknown, unknown, integer, integer, integer, integer, numeric, integer) does not exist
   LINE 1: SELECT madlib.lda_train( 'documents_tf',          -- documen...
                  ^
   HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
    [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2                       -- Evaluate perplexity every n iterations\n                       );"]
   ```
   
   This should be the same results as:
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                            'lda_model_perp',        -- model table created by LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data table 
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            20,                      -- number of iterations
                            5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                            2,                       -- Evaluate perplexity every n iterations
                            NULL
                          );
   ```
   which actually does work if you put `NULL` for the last param.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services