From dev-return-5582-archive-asf-public=cust-asf.ponee.io@madlib.apache.org Mon Nov 4 23:57:27 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4C5A8180658 for ; Tue, 5 Nov 2019 00:57:27 +0100 (CET) Received: (qmail 66619 invoked by uid 500); 4 Nov 2019 23:57:26 -0000 Mailing-List: contact dev-help@madlib.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.apache.org Delivered-To: mailing list dev@madlib.apache.org Received: (qmail 66608 invoked by uid 99); 4 Nov 2019 23:57:26 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Nov 2019 23:57:26 +0000 From: GitBox To: dev@madlib.apache.org Subject: [GitHub] [madlib] fmcquillan99 edited a comment on issue #432: MADLIB-1351 : Added stopping criteria on perplexity to LDA Message-ID: <157291184657.1732.2320115852761185378.gitbox@gitbox.apache.org> Date: Mon, 04 Nov 2019 23:57:26 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit fmcquillan99 edited a comment on issue #432: MADLIB-1351 : Added stopping criteria on perplexity to LDA URL: https://github.com/apache/madlib/pull/432#issuecomment-549600980 ----------------------------------------------------------------- Re-test after latest commits (1) Please add `num_iterations` to the output table. This is needed now because we have a perplexity tolerance, so training may not run the maximum number of iterations specified. The model table should look like: ``` model_table ... model BIGINT[]. The encoded model ...etc... num_iterations INTEGER. Number of iterations that training ran for, which may be less than the maximum value specified in the parameter 'iter_num' if the perplexity tolerance was reached. perplexity DOUBLE PRECISION[] Array of ...etc.... ... ``` Now looks like: ``` -[ RECORD 1 ]----+-------------------------------------------- voc_size | 384 topic_num | 5 alpha | 5 beta | 0.01 num_iterations | 9 perplexity | {196.148467882,192.142777576,193.872066117} perplexity_iters | {3,6,9} ``` OK (2) The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs for the full number of iterations. So please change it to error out if 'perplexity_tol'<0. ``` DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp; SELECT madlib.lda_train( 'documents_tf', -- documents table in the form of term frequency 'lda_model_perp', -- model table created by LDA training (not human readable) 'lda_output_data_perp', -- readable output data table 103, -- vocabulary size 5, -- number of topics 10, -- number of iterations 5, -- Dirichlet prior for the per-doc topic multinomial (alpha) 0.01, -- Dirichlet prior for the per-topic word multinomial (beta) 2, -- Evaluate perplexity every 2 iterations 0.0 -- Set tolerance to 0 so runs full number of iterations ); ``` produces ``` -[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------------------------------------------- voc_size | 384 topic_num | 5 alpha | 5 beta | 0.01 num_iterations | 20 perplexity | {191.992070922,188.198782019,187.433873268,184.973287318,184.491077644,176.27420008,180.63646659,180.456641184,179.574266867,179.152413582} perplexity_iters | {2,4,6,8,10,12,14,16,18,20} ``` OK (3) Last iteration value for perplexity doe not match final perplexity value: ``` DROP TABLE IF EXISTS documents; CREATE TABLE documents(docid INT4, contents TEXT); INSERT INTO documents VALUES (0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'), (1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'), (2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'), (3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area.'), (4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.'), (5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.'), (6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.'), (7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.'), (8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant.'), (9, 'In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.'), (10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.'), (11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally designed to perform a simple task in an indirect and overly complicated way. Usually, these machines consist of a series of simple unrelated devices; the action of each triggers the initiation of the next, eventually resulting in achieving a stated goal.'), (12, 'In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.'), (13, 'k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.'), (14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.'); ALTER TABLE documents ADD COLUMN words TEXT[]; UPDATE documents SET words = regexp_split_to_array(lower( regexp_replace(contents, E'[,.;\']','', 'g') ), E'[\\s+]'); DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary; SELECT madlib.term_frequency('documents', -- input table 'docid', -- document id column 'words', -- vector of words in document 'documents_tf', -- output documents table with term frequency TRUE); -- TRUE to created vocabulary table ``` Train ``` DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp; SELECT madlib.lda_train( 'documents_tf', -- documents table in the form of term frequency 'lda_model_perp', -- model table created by LDA training (not human readable) 'lda_output_data_perp', -- readable output data table 384, -- vocabulary size 5, -- number of topics 100, -- number of iterations 5, -- Dirichlet prior for the per-doc topic multinomial (alpha) 0.01, -- Dirichlet prior for the per-topic word multinomial (beta) 1, -- Evaluate perplexity every n iterations 0.1 -- Stopping perplexity tolerance ); SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp; -[ RECORD 1 ]----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- voc_size | 384 topic_num | 5 alpha | 5 beta | 0.01 num_iterations | 16 perplexity | {195.582090721,192.071728778,191.048336558,194.186905186,195.150503634,191.566207005,191.199131632,185.533220287,189.910983656,184.981903783,185.753724338,183.043524383,189.125703696,191.460991339,189.193774612,189.182916247} perplexity_iters | {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} ``` Perplexity on input data ``` SELECT madlib.lda_get_perplexity( 'lda_model_perp', 'lda_output_data_perp' ); lda_get_perplexity -------------------- 189.182916246556 (1 row) ``` which matches the last value in the array for the training function. OK (6) still has an issue ``` DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp; SELECT madlib.lda_train( 'documents_tf', -- documents table in the form of term frequency 'lda_model_perp', -- model table created by LDA training (not human readable) 'lda_output_data_perp', -- readable output data table 384, -- vocabulary size 5, -- number of topics 20, -- number of iterations 5, -- Dirichlet prior for the per-doc topic multinomial (alpha) 0.01, -- Dirichlet prior for the per-topic word multinomial (beta) 2 -- Evaluate perplexity every n iterations Done. (psycopg2.ProgrammingError) function madlib.lda_train(unknown, unknown, unknown, integer, integer, integer, integer, numeric, integer) does not exist LINE 1: SELECT madlib.lda_train( 'documents_tf', -- documen... ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. [SQL: "SELECT madlib.lda_train( 'documents_tf', -- documents table in the form of term frequency\n 'lda_model_perp', -- model table created by LDA training (not human readable)\n 'lda_output_data_perp', -- readable output data table \n 384, -- vocabulary size\n 5, -- number of topics\n 20, -- number of iterations\n 5, -- Dirichlet prior for the per-doc topic multinomial (alpha)\n 0.01, -- Dirichlet prior for the per-topic word multinomial (beta)\n 2 -- Evaluate perplexity every n iterations\n );"] ``` This should be the same results as: ``` DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp; SELECT madlib.lda_train( 'documents_tf', -- documents table in the form of term frequency 'lda_model_perp', -- model table created by LDA training (not human readable) 'lda_output_data_perp', -- readable output data table 384, -- vocabulary size 5, -- number of topics 20, -- number of iterations 5, -- Dirichlet prior for the per-doc topic multinomial (alpha) 0.01, -- Dirichlet prior for the per-topic word multinomial (beta) 2, -- Evaluate perplexity every n iterations NULL ); ``` which actually does work if you put `NULL` for the last param. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services