Mailing-List: contact mahout-commits-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-dev@lucene.apache.org
Date: Wed, 17 Mar 2010 02:49:00 +0000 (UTC)
From: confluence@apache.org
To: mahout-commits@lucene.apache.org
Message-ID: <408250118.3898.1268794140640.JavaMail.www-data@brutus.apache.org>
Subject: [CONF] Apache Lucene Mahout > Collocations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Auto-Submitted: auto-generated

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collocations (http://cwiki.apache.org/confluence/display/MAHOUT/Collocations)


Edited by Drew Farris:
---------------------------------------------------------------------
h1. Collocations in Mahout

A collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance. Statistically relevant combinations of terms identify additional lexical units which can be treated as features in a vector-based representation of a text. A detailed discussion of collocations can be found on wikipedia [1|http://en.wikipedia.org/wiki/Collocation].
 

h2. Log-Likelihood based Collocation Identification

Mahout provides an implementation of a collocation identification algorithm which scores collocations using log-likelihood ratio. The log-likelihood score indicates the relative usefulness of a collocation with regards other term combinations in the text. Collocations with the highest scores in a particular corpus will generally be more useful as features.

Calculating the LLR is very straightforward and is described concisely in Ted Dunning's blog post [2|http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html]. Ted describes the series of counts reqired to calculate the LLR for two events A and B in order to determine if they co-occur more often than pure chance. These counts include the number of times the events co-occur (k11), the number of times the events occur without each other (k12 and k21), and the number of times anything occurs. These counts are summarized in the following table:

| | Event A | Everything but Event A |
| Event B | A and B together (k11) |  B but not A (k12) |
| Everything but Event B | A but not B (k21) | Neither B nor A (k22) |

For the purposes of collocation identification, it is useful to begin by thinking in word pairs, bigrams. In this case the leading or head term from the pair corresponds to A from the table above, B corresponds to the trailing or tail term, while neither B nor A is the total number of word pairs in the corpus less those containing B, A or both B and A.

Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio is computed by looking at the number of occurences of that word pair in the corpus, the number of word pairs that begin with 'oscillation' but end with something other than 'overthruster', the number of word pairs that end with 'overthruster' begin with something other than 'oscillation' and the number of word pairs in the corpus that contain neither 'oscillation' and overthruster.

This can be extended from bigrams to trigrams, 4-grams and beyond. In these cases, the current algorithm uses the first token of the ngram as the head of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is treated as the head while 'kong cavaliers' is treated as the tail. Future versions of this algorithm will allow for variations in which tokens of the ngram are treated as the head and tail.

Beyond ngrams, it is often useful to inspect cases where individual words occur around other interesting features of the text such as sentence boundaries.

h2. Generating NGrams

The tools that the collocation identification algorithm are embeeded within either consume tokenized text as input or provide the ability to specify an implementation of the Lucene Analyzer class perform tokenization in order to form ngrams. The tokens are passed through a Lucene ShingleFilter to produce NGrams of the desired length. 

Given the text "Alice was beginning to get very tired" as an example, Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get', 'very' and 'tired', while the ShingleFilter with a max NGram size set to 3 produces the shingles 'alice beginning', 'alice beginning get', 'beginning get', 'beginning get very', 'get very', 'get very tired' and 'very tired'. Note that both bigrams and trigrams are produced here. A future enhancement to the existing algorithm would involve limiting the output to a particular gram size as opposed to solely specifiying a max ngram size.

h2. Running the Collocation Identification Algorithm.

There are a couple ways to run the llr-based collocation algorithm in mahout

h3. When creating vectors from a sequence file

The llr collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values. Collocations are generated when the --maxNGramSize (-ng) option is not specified and defaults to 2 or is set to a number of 2 or greater. The --minLLR option can be used to control the cutoff that prevents collocations below the specified LLR score from being emitted, and the --minSupport argument can be used to filter out collocations that appear below a certain number of times. 

{code}
bin/mahout seq2sparse

Usage:                                                                          
 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize           
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFPercent    
<maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> --numReducers  
<numReducers> --maxNGramSize <ngramSize> --overwrite --help                     
--sequentialAccessVector]                                                       
Options                                                                         
  --minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                      Value: 2                                  
  --analyzerName (-a) analyzerName    The class name of the analyzer            
  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  
  --output (-o) output                The output directory                      
  --input (-i) input                  input dir containing the documents in     
                                      sequence file format                      
  --minDF (-md) minDF                 The minimum document frequency.  Default  
                                      is 1                                      
  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    
                                      Can be used to remove really high         
                                      frequency terms. Expressed as an integer  
                                      between 0 and 100. Default is 99.         
  --weight (-wt) weight               The kind of weight to use. Currently TF   
                                      or TFIDF                                  
  --norm (-n) norm                    The norm to use, expressed as either a    
                                      float or "INF" if you want to use the     
                                      Infinite norm.  Must be greater or equal  
                                      to 0.  The default is not to normalize    
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                      Ratio(Float)  Default is 1.0              
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                      Default Value: 1                          
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                      create (2 = bigrams, 3 = trigrams, etc)   
                                      Default Value:2                           
  --overwrite (-w)                    If set, overwrite the output directory    
  --help (-h)                         Print out help                            
  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should  
                                      be SequentialAccessVectors If set true    
                                      else false 
{code}

h3. CollocDriver

*TODO*

{code}
bin/mahout org.apache.mahout.utils.nlp.collocations.llr.CollocDriver

Usage:                                                                          
 [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite      
--minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers>         
--analyzerName <analyzerName> --preprocess --unigram --help]                    
Options                                                                         
  --input (-i) input                  The Path for input files.                 
  --output (-o) output                The Path write output to                  
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                      create (2 = bigrams, 3 = trigrams, etc)   
                                      Default Value:2                           
  --overwrite (-w)                    If set, overwrite the output directory    
  --minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                      Value: 2                                  
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                      Ratio(Float)  Default is 1.0              
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                      Default Value: 1                          
  --analyzerName (-a) analyzerName    The class name of the analyzer            
  --preprocess (-p)                   If set, input is SequenceFile<Text,Text>  
                                      where the value is the document,  which   
                                      will be tokenized using the specified     
                                      analyzer.                                 
  --unigram (-u)                      If set, unigrams will be emitted in the   
                                      final output alongside collocations       
  --help (-h)                         Print out help          
{code}

h2. Algorithm details

This section describes the implementation of the collocation identification algorithm in terms of the map-reduce phases that are used to generate ngrams and count the frequencies required to perform the log-likelihood calculation. Unless otherwise noted, classes that are indicated in CamelCase can be found in the mahout-utils module under the package org.apache.mahout.utils.nlp.collocations.llr

The algorithm is implemented in two map-reduce passes:

h3. Pass 1: CollocDriver.generateCollocations(...)

Generates NGrams and counts frequencies for entire ngrams, and the left and right sub-grams.

h4. Map: CollocMapper
Key: Text document id
Value: StringTuple, the tokens from the corresponding document

Each call to the mapper passes in the full set of tokens for the corresponding document using a StringTuple. The ShingleFilter is run across these tokens to produce ngrams of the desired length. ngrams and frequencies are collected across the entire document.

Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[] form of the ngram when collecting an ngram as the value.

{code}
head_key(EMPTY) -> (head subgram, head frequency)
head_key(ngram) -> (ngram, ngram frequency) 
tail_key(EMPTY) -> (tail subgram, tail frequency)
tail_key(ngram) -> (ngram, ngram frequency)
{code}

subgram and ngram values are packaged in Gram objects.

For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When the pass is complete, this counter will hold the total number of ngrams encountered in the input which is used as a part of the LLR calculation.

h4. Combiner: CollocCombiner

This phase merges the counts for unique ngrams or ngram fragments across multiple documents. The combiner treates the entire GramKey as the key and as such, identical tuples from separate documents are passed into a single call to the combiner's reduce method.

h4. Reduce: CollocReducer

The CollocReducer employs the Hadoop secondary sort strategy to avoid caching ngram tuples in memory in order to calculate frequencies. The GramKeyPartitioner ensures that tuples with the same primary key are sent to the same reducer while the GramKeyGroupComparator ensures that iterator provided by the reduce method first returns the fragment value and then returns ngram values grouped by ngram. This eliminates the need to cache the values returned by the iterator in order to calculate total frequencies for both fragments and ngrams. There will be multiple frequencies for each fragment_key, fragment or fragment_key, ngram tuple; one from each map task executed in which the particular fragment was found.
The input will be traversed in the following order:

{code}
(head subgram, frequency 1)
(head subgram, frequency 2)
... 
(head subgram, frequency N)
(ngram 1, frequency 1)
(ngram 1, frequency 2)
...
(ngram 1, frequency N)
(ngram 2, frequency 1)
(ngram 2, frequency 2)
...
(ngram 2, frequency N)
...
(ngram N, frequency 1)
(ngram N, frequency 2)
...
(ngram N, frequency N)
{code}

Where all of the ngrams above share the same head. Data is presented in the same manner for the tail subgrams.

As the values for a subgram or ngram are traversed, frequencies are accumulated. Once all values for a subgram or ngram are processed the resulting key/value pairs are passed to the collector as long as the ngram frequency is equal to or greater than the specified minSupport. Skipping any ngram causes the Skipped.LESS_THAN_MIN_SUPPORT counter to be incremented.

Pairs are passed to the collector in the following format:

{code}
ngram, ngram frequency -> subgram subgram frequency
{code}

In this manner, the output becomes an unsorted version of the following:

{code}
ngram 1, frequency -> ngram 1 head, head frequency
ngram 1, frequency -> ngram 1 tail, tail frequency
ngram 2, frequency -> ngram 2 head, head frequency
ngram 2, frequency -> ngram 2 tail, tail frequency
ngram N, frequency -> ngram N head, head frequency
ngram N, frequency -> ngram N tail, tail frequency
{code}

h3. Pass 2: CollocDriver.computeNGramsPruneByLLR(...)

Pass 1 has calculated full frequencies for ngrams and subgrams and has organized in such a way to make the LLR calculation efficient. Pass 2 performs this calculation.

h4. Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)

This phase is a no-op. The data is passed through unchanged. The rest of the work for llr calculation is done in the reduce phase.

h4. Reduce Phase: LLRReducer

*TODO*

h2. References

[1] http://en.wikipedia.org/wiki/Collocation
[2] http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action