Return-Path: Delivered-To: apmail-lucene-mahout-commits-archive@minotaur.apache.org Received: (qmail 84453 invoked from network); 17 Mar 2010 02:49:27 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Mar 2010 02:49:27 -0000 Received: (qmail 92533 invoked by uid 500); 17 Mar 2010 02:49:27 -0000 Delivered-To: apmail-lucene-mahout-commits-archive@lucene.apache.org Received: (qmail 92435 invoked by uid 500); 17 Mar 2010 02:49:27 -0000 Mailing-List: contact mahout-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-commits@lucene.apache.org Received: (qmail 92428 invoked by uid 99); 17 Mar 2010 02:49:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Mar 2010 02:49:26 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Mar 2010 02:49:23 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9C9F6234C48D for ; Wed, 17 Mar 2010 02:49:00 +0000 (UTC) Date: Wed, 17 Mar 2010 02:49:00 +0000 (UTC) From: confluence@apache.org To: mahout-commits@lucene.apache.org Message-ID: <408250118.3898.1268794140640.JavaMail.www-data@brutus.apache.org> Subject: [CONF] Apache Lucene Mahout > Collocations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT) Page: Collocations (http://cwiki.apache.org/confluence/display/MAHOUT/Collocations) Edited by Drew Farris: --------------------------------------------------------------------- h1. Collocations in Mahout A collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance. Statistically relevant combinations of terms identify additional lexical units which can be treated as features in a vector-based representation of a text. A detailed discussion of collocations can be found on wikipedia [1|http://en.wikipedia.org/wiki/Collocation]. h2. Log-Likelihood based Collocation Identification Mahout provides an implementation of a collocation identification algorithm which scores collocations using log-likelihood ratio. The log-likelihood score indicates the relative usefulness of a collocation with regards other term combinations in the text. Collocations with the highest scores in a particular corpus will generally be more useful as features. Calculating the LLR is very straightforward and is described concisely in Ted Dunning's blog post [2|http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html]. Ted describes the series of counts reqired to calculate the LLR for two events A and B in order to determine if they co-occur more often than pure chance. These counts include the number of times the events co-occur (k11), the number of times the events occur without each other (k12 and k21), and the number of times anything occurs. These counts are summarized in the following table: | | Event A | Everything but Event A | | Event B | A and B together (k11) | B but not A (k12) | | Everything but Event B | A but not B (k21) | Neither B nor A (k22) | For the purposes of collocation identification, it is useful to begin by thinking in word pairs, bigrams. In this case the leading or head term from the pair corresponds to A from the table above, B corresponds to the trailing or tail term, while neither B nor A is the total number of word pairs in the corpus less those containing B, A or both B and A. Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio is computed by looking at the number of occurences of that word pair in the corpus, the number of word pairs that begin with 'oscillation' but end with something other than 'overthruster', the number of word pairs that end with 'overthruster' begin with something other than 'oscillation' and the number of word pairs in the corpus that contain neither 'oscillation' and overthruster. This can be extended from bigrams to trigrams, 4-grams and beyond. In these cases, the current algorithm uses the first token of the ngram as the head of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is treated as the head while 'kong cavaliers' is treated as the tail. Future versions of this algorithm will allow for variations in which tokens of the ngram are treated as the head and tail. Beyond ngrams, it is often useful to inspect cases where individual words occur around other interesting features of the text such as sentence boundaries. h2. Generating NGrams The tools that the collocation identification algorithm are embeeded within either consume tokenized text as input or provide the ability to specify an implementation of the Lucene Analyzer class perform tokenization in order to form ngrams. The tokens are passed through a Lucene ShingleFilter to produce NGrams of the desired length. Given the text "Alice was beginning to get very tired" as an example, Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get', 'very' and 'tired', while the ShingleFilter with a max NGram size set to 3 produces the shingles 'alice beginning', 'alice beginning get', 'beginning get', 'beginning get very', 'get very', 'get very tired' and 'very tired'. Note that both bigrams and trigrams are produced here. A future enhancement to the existing algorithm would involve limiting the output to a particular gram size as opposed to solely specifiying a max ngram size. h2. Running the Collocation Identification Algorithm. There are a couple ways to run the llr-based collocation algorithm in mahout h3. When creating vectors from a sequence file The llr collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values. Collocations are generated when the --maxNGramSize (-ng) option is not specified and defaults to 2 or is set to a number of 2 or greater. The --minLLR option can be used to control the cutoff that prevents collocations below the specified LLR score from being emitted, and the --minSupport argument can be used to filter out collocations that appear below a certain number of times. {code} bin/mahout seq2sparse Usage: [--minSupport --analyzerName --chunkSize --output --input --minDF --maxDFPercent --weight --norm --minLLR --numReducers --maxNGramSize --overwrite --help --sequentialAccessVector] Options --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB --output (-o) output The output directory --input (-i) input input dir containing the documents in sequence file format --minDF (-md) minDF The minimum document frequency. Default is 1 --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0 --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to create (2 = bigrams, 3 = trigrams, etc) Default Value:2 --overwrite (-w) If set, overwrite the output directory --help (-h) Print out help --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors If set true else false {code} h3. CollocDriver *TODO* {code} bin/mahout org.apache.mahout.utils.nlp.collocations.llr.CollocDriver Usage: [--input --output --maxNGramSize --overwrite --minSupport --minLLR --numReducers --analyzerName --preprocess --unigram --help] Options --input (-i) input The Path for input files. --output (-o) output The Path write output to --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to create (2 = bigrams, 3 = trigrams, etc) Default Value:2 --overwrite (-w) If set, overwrite the output directory --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0 --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 --analyzerName (-a) analyzerName The class name of the analyzer --preprocess (-p) If set, input is SequenceFile where the value is the document, which will be tokenized using the specified analyzer. --unigram (-u) If set, unigrams will be emitted in the final output alongside collocations --help (-h) Print out help {code} h2. Algorithm details This section describes the implementation of the collocation identification algorithm in terms of the map-reduce phases that are used to generate ngrams and count the frequencies required to perform the log-likelihood calculation. Unless otherwise noted, classes that are indicated in CamelCase can be found in the mahout-utils module under the package org.apache.mahout.utils.nlp.collocations.llr The algorithm is implemented in two map-reduce passes: h3. Pass 1: CollocDriver.generateCollocations(...) Generates NGrams and counts frequencies for entire ngrams, and the left and right sub-grams. h4. Map: CollocMapper Key: Text document id Value: StringTuple, the tokens from the corresponding document Each call to the mapper passes in the full set of tokens for the corresponding document using a StringTuple. The ShingleFilter is run across these tokens to produce ngrams of the desired length. ngrams and frequencies are collected across the entire document. Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[] form of the ngram when collecting an ngram as the value. {code} head_key(EMPTY) -> (head subgram, head frequency) head_key(ngram) -> (ngram, ngram frequency) tail_key(EMPTY) -> (tail subgram, tail frequency) tail_key(ngram) -> (ngram, ngram frequency) {code} subgram and ngram values are packaged in Gram objects. For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When the pass is complete, this counter will hold the total number of ngrams encountered in the input which is used as a part of the LLR calculation. h4. Combiner: CollocCombiner This phase merges the counts for unique ngrams or ngram fragments across multiple documents. The combiner treates the entire GramKey as the key and as such, identical tuples from separate documents are passed into a single call to the combiner's reduce method. h4. Reduce: CollocReducer The CollocReducer employs the Hadoop secondary sort strategy to avoid caching ngram tuples in memory in order to calculate frequencies. The GramKeyPartitioner ensures that tuples with the same primary key are sent to the same reducer while the GramKeyGroupComparator ensures that iterator provided by the reduce method first returns the fragment value and then returns ngram values grouped by ngram. This eliminates the need to cache the values returned by the iterator in order to calculate total frequencies for both fragments and ngrams. There will be multiple frequencies for each fragment_key, fragment or fragment_key, ngram tuple; one from each map task executed in which the particular fragment was found. The input will be traversed in the following order: {code} (head subgram, frequency 1) (head subgram, frequency 2) ... (head subgram, frequency N) (ngram 1, frequency 1) (ngram 1, frequency 2) ... (ngram 1, frequency N) (ngram 2, frequency 1) (ngram 2, frequency 2) ... (ngram 2, frequency N) ... (ngram N, frequency 1) (ngram N, frequency 2) ... (ngram N, frequency N) {code} Where all of the ngrams above share the same head. Data is presented in the same manner for the tail subgrams. As the values for a subgram or ngram are traversed, frequencies are accumulated. Once all values for a subgram or ngram are processed the resulting key/value pairs are passed to the collector as long as the ngram frequency is equal to or greater than the specified minSupport. Skipping any ngram causes the Skipped.LESS_THAN_MIN_SUPPORT counter to be incremented. Pairs are passed to the collector in the following format: {code} ngram, ngram frequency -> subgram subgram frequency {code} In this manner, the output becomes an unsorted version of the following: {code} ngram 1, frequency -> ngram 1 head, head frequency ngram 1, frequency -> ngram 1 tail, tail frequency ngram 2, frequency -> ngram 2 head, head frequency ngram 2, frequency -> ngram 2 tail, tail frequency ngram N, frequency -> ngram N head, head frequency ngram N, frequency -> ngram N tail, tail frequency {code} h3. Pass 2: CollocDriver.computeNGramsPruneByLLR(...) Pass 1 has calculated full frequencies for ngrams and subgrams and has organized in such a way to make the LLR calculation efficient. Pass 2 performs this calculation. h4. Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper) This phase is a no-op. The data is passed through unchanged. The rest of the work for llr calculation is done in the reduce phase. h4. Reduce Phase: LLRReducer *TODO* h2. References [1] http://en.wikipedia.org/wiki/Collocation [2] http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action