Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6449410405 for ; Wed, 20 Nov 2013 16:06:27 +0000 (UTC) Received: (qmail 20053 invoked by uid 500); 20 Nov 2013 16:06:22 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 19825 invoked by uid 500); 20 Nov 2013 16:06:21 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 19803 invoked by uid 99); 20 Nov 2013 16:06:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 16:06:20 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 16:06:16 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 6BF2A23888E7; Wed, 20 Nov 2013 16:05:55 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1543854 - /mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext Date: Wed, 20 Nov 2013 16:05:55 -0000 To: commits@mahout.apache.org From: isabel@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20131120160555.6BF2A23888E7@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: isabel Date: Wed Nov 20 16:05:55 2013 New Revision: 1543854 URL: http://svn.apache.org/r1543854 Log: MAHOUT-1245 - reformat collocations page Modified: mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext Modified: mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext?rev=1543854&r1=1543853&r2=1543854&view=diff ============================================================================== --- mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext (original) +++ mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext Wed Nov 20 16:05:55 2013 @@ -1,4 +1,5 @@ Title: Collocations + # Collocations in Mahout @@ -6,12 +7,12 @@ A collocation is defined as a sequence o more often than would be expected by chance. Statistically relevant combinations of terms identify additional lexical units which can be treated as features in a vector-based representation of a text. A detailed -discussion of collocations can be found on wikipedia [1](http://en.wikipedia.org/wiki/Collocation) -. - +discussion of collocations can be found on [Wikipedia](http://en.wikipedia.org/wiki/Collocation). + +See there for a more detailed discussion of collocations in the [Reuters example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685). -## Log-Likelihood based Collocation Identification +## Theory behind implementation: Log-Likelihood based Collocation Identification Mahout provides an implementation of a collocation identification algorithm which scores collocations using log-likelihood ratio. The log-likelihood @@ -20,7 +21,7 @@ term combinations in the text. Collocati particular corpus will generally be more useful as features. Calculating the LLR is very straightforward and is described concisely in -Ted Dunning's blog post [2](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) +[Ted Dunning's blog post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) . Ted describes the series of counts reqired to calculate the LLR for two events A and B in order to determine if they co-occur more often than pure chance. These counts include the number of times the events co-occur (k11), @@ -100,65 +101,57 @@ times. bin/mahout seq2sparse Usage: - [--minSupport --analyzerName --chunkSize - --output --input --minDF ---maxDFPercent - --weight --norm --minLLR ---numReducers - --maxNGramSize --overwrite --help - --sequentialAccessVector] + [--minSupport --analyzerName --chunkSize + --output --input --minDF + --maxDFPercent --weight --norm --minLLR + --numReducers --maxNGramSize --overwrite --help + --sequentialAccessVector] Options - --minSupport (-s) minSupport (Optional) Minimum Support. Default - Value: 2 - --analyzerName (-a) analyzerName The class name of the analyzer - --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 -MB - --output (-o) output The output directory - --input (-i) input input dir containing the documents in - sequence file format - --minDF (-md) minDF The minimum document frequency. -Default - is 1 - --maxDFPercent (-x) maxDFPercent The max percentage of docs for the -DF. - Can be used to remove really high - frequency terms. Expressed as an -integer - between 0 and 100. Default is 99. - --weight (-wt) weight The kind of weight to use. Currently -TF + + --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 + + --analyzerName (-a) analyzerName The class name of the analyzer + + --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000MB + + --output (-o) output The output directory + + --input (-i) input Input dir containing the documents in sequence file format + + --minDF (-md) minDF The minimum document frequency. Default is 1 + + --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove + really high frequency terms. Expressed as an + integer between 0 and 100. Default is 99. + + --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF - --norm (-n) norm The norm to use, expressed as either -a + + --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the - Infinite norm. Must be greater or -equal - to 0. The default is not to -normalize + Infinite norm. Must be greater orequal + to 0. The default is not to normalize + --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood - Ratio(Float) Default is 1.0 + Ratio(Float) Default is 1.0 + --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 - --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams -to - create (2 = bigrams, 3 = trigrams, -etc) - Default Value:2 - --overwrite (-w) If set, overwrite the output -directory + + --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to + create (2 = bigrams, 3 = trigrams, etc) + Default Value:2 + + --overwrite (-w) If set, overwrite the output directory --help (-h) Print out help - --sequentialAccessVector (-seq) (Optional) Whether output vectors -should - be SequentialAccessVectors If set -true + --sequentialAccessVector (-seq) (Optional) Whether output vectors should + be SequentialAccessVectors If set true else false ### CollocDriver -*TODO* - bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver @@ -166,32 +159,38 @@ true [--input --output --maxNGramSize --overwrite --minSupport --minLLR --numReducers --analyzerName --preprocess --unigram --help] + Options + --input (-i) input The Path for input files. + --output (-o) output The Path write output to - --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams -to - create (2 = bigrams, 3 = trigrams, -etc) - Default Value:2 - --overwrite (-w) If set, overwrite the output -directory + + --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngramsto + create (2 = bigrams, 3 = trigrams,etc) + Default Value:2 + + --overwrite (-w) If set, overwrite the outputdirectory + --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 - --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood - Ratio(Float) Default is 1.0 + + --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood + Ratio(Float) Default is 1.0 + --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 + --analyzerName (-a) analyzerName The class name of the analyzer - --preprocess (-p) If set, input is -SequenceFile - where the value is the document, -which + + --preprocess (-p) If set, input is SequenceFile + where the value is the document, which will be tokenized using the specified - analyzer. - --unigram (-u) If set, unigrams will be emitted in -the - final output alongside collocations + analyzer. + + --unigram (-u) If set, unigrams will be emitted inthe + final output alongside collocations + --help (-h) Print out help @@ -227,8 +226,11 @@ Once this is done, ngrams are split into head_key(EMPTY) -> (head subgram, head frequency) + head_key(ngram) -> (ngram, ngram frequency) + tail_key(EMPTY) -> (tail subgram, tail frequency) + tail_key(ngram) -> (ngram, ngram frequency) @@ -374,17 +376,3 @@ CollocDriver, unigrams (single tokens) w each token's frequency will be calculated. As with ngrams, unigrams are subject to filtering with minSupport and minLLR. - -## References - -\[1\](1\.html) - http://en.wikipedia.org/wiki/Collocation -\[2\](2\.html) - http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html - - - -## Discussion - -* http://comments.gmane.org/gmane.comp.apache.mahout.user/5685 - Reuters -example