mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From isa...@apache.org
Subject svn commit: r1543854 - /mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext
Date Wed, 20 Nov 2013 16:05:55 GMT
Author: isabel
Date: Wed Nov 20 16:05:55 2013
New Revision: 1543854

URL: http://svn.apache.org/r1543854
Log:
MAHOUT-1245 - reformat collocations page

Modified:
    mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext?rev=1543854&r1=1543853&r2=1543854&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/basics/collocations.mdtext Wed Nov 20 16:05:55
2013
@@ -1,4 +1,5 @@
 Title: Collocations
+
 <a name="Collocations-CollocationsinMahout"></a>
 # Collocations in Mahout
 
@@ -6,12 +7,12 @@ A collocation is defined as a sequence o
 more often than would be expected by chance. Statistically relevant
 combinations of terms identify additional lexical units which can be
 treated as features in a vector-based representation of a text. A detailed
-discussion of collocations can be found on wikipedia [1](http://en.wikipedia.org/wiki/Collocation)
-.
- 
+discussion of collocations can be found on [Wikipedia](http://en.wikipedia.org/wiki/Collocation).
+
+See there for a more detailed discussion of collocations in the [Reuters example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685).
 
 <a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a>
-## Log-Likelihood based Collocation Identification
+## Theory behind implementation: Log-Likelihood based Collocation Identification
 
 Mahout provides an implementation of a collocation identification algorithm
 which scores collocations using log-likelihood ratio. The log-likelihood
@@ -20,7 +21,7 @@ term combinations in the text. Collocati
 particular corpus will generally be more useful as features.
 
 Calculating the LLR is very straightforward and is described concisely in
-Ted Dunning's blog post [2](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
+[Ted Dunning's blog post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
 . Ted describes the series of counts reqired to calculate the LLR for two
 events A and B in order to determine if they co-occur more often than pure
 chance. These counts include the number of times the events co-occur (k11),
@@ -100,65 +101,57 @@ times. 
     bin/mahout seq2sparse
     
     Usage:									    
-     [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize	  
 
-    <chunkSize> --output <output> --input <input> --minDF <minDF>
---maxDFPercent	  
-    <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR>
---numReducers  
-    <numReducers> --maxNGramSize <ngramSize> --overwrite --help		    
-    --sequentialAccessVector]
+         [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize>
+          --output <output> --input <input> --minDF <minDF>
+          --maxDFPercent<maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR>
+          --numReducers  <numReducers> --maxNGramSize <ngramSize> --overwrite
--help		    
+          --sequentialAccessVector]
     Options 								    
-      --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
-    				      Value: 2				    
-      --analyzerName (-a) analyzerName    The class name of the analyzer	    
-      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
-MB  
-      --output (-o) output		      The output directory		    
-      --input (-i) input		      input dir containing the documents in 
-    				      sequence file format		    
-      --minDF (-md) minDF		      The minimum document frequency. 
-Default  
-    				      is 1				    
-      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
-DF.    
-    				      Can be used to remove really high     
-    				      frequency terms. Expressed as an
-integer  
-    				      between 0 and 100. Default is 99.     
-      --weight (-wt) weight 	      The kind of weight to use. Currently
-TF   
+
+      --minSupport (-s) minSupport	  (Optional) Minimum Support. Default Value: 2				   

+
+      --analyzerName (-a) analyzerName    The class name of the analyzer
+
+      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000MB
+
+      --output (-o) output		 The output directory
+
+      --input (-i) input		   Input dir containing the documents in sequence file format
+
+      --minDF (-md) minDF		  The minimum document frequency. Default is 1
+
+      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF. Can be used
to remove 
+                                          really high frequency terms. Expressed as an
+                                          integer between 0 and 100. Default is 99.     
+
+      --weight (-wt) weight 	      The kind of weight to use. Currently TF   
     				      or TFIDF				    
-      --norm (-n) norm		      The norm to use, expressed as either
-a    
+
+      --norm (-n) norm		      The norm to use, expressed as either a    
     				      float or "INF" if you want to use the 
-    				      Infinite norm.  Must be greater or
-equal  
-    				      to 0.  The default is not to
-normalize    
+    				      Infinite norm.  Must be greater orequal  
+    				      to 0.  The default is not to normalize    
+
       --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
-    				      Ratio(Float)  Default is 1.0	    
+    				      Ratio(Float)  Default is 1.0
+	    
       --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
     				      Default Value: 1			    
-      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
-to  
-    				      create (2 = bigrams, 3 = trigrams,
-etc)   
-    				      Default Value:2			    
-      --overwrite (-w)		      If set, overwrite the output
-directory    
+
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
+    				      create (2 = bigrams, 3 = trigrams, etc)   
+    				      Default Value:2			 
+   
+      --overwrite (-w)		      If set, overwrite the output directory    
       --help (-h)			      Print out help			    
-      --sequentialAccessVector (-seq)     (Optional) Whether output vectors
-should	
-    				      be SequentialAccessVectors If set
-true	
+      --sequentialAccessVector (-seq)     (Optional) Whether output vectors should	
+    				      be SequentialAccessVectors If set true	
     				      else false 
 
 
 <a name="Collocations-CollocDriver"></a>
 ### CollocDriver
 
-*TODO*
-
 
     bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
     
@@ -166,32 +159,38 @@ true	
      [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite
   
     --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers>
    
     --analyzerName <analyzerName> --preprocess --unigram --help]
+
     Options 								    
+
       --input (-i) input		      The Path for input files. 	    
+
       --output (-o) output		      The Path write output to		    
-      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
-to  
-    				      create (2 = bigrams, 3 = trigrams,
-etc)   
-    				      Default Value:2			    
-      --overwrite (-w)		      If set, overwrite the output
-directory    
+
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngramsto  
+    				      create (2 = bigrams, 3 = trigrams,etc)   
+    				      Default Value:2			
+    
+      --overwrite (-w)		      If set, overwrite the outputdirectory    
+
       --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
     				      Value: 2				    
-      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
-    				      Ratio(Float)  Default is 1.0	    
+
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood
+    				      Ratio(Float)  Default is 1.0	  
+  
       --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
     				      Default Value: 1			    
+
       --analyzerName (-a) analyzerName    The class name of the analyzer	    
-      --preprocess (-p)		      If set, input is
-SequenceFile<Text,Text>  
-    				      where the value is the document, 
-which	
+
+      --preprocess (-p)		      If set, input is SequenceFile<Text,Text>  
+    				      where the value is the document, which	
     				      will be tokenized using the specified 
-    				      analyzer. 			    
-      --unigram (-u)		      If set, unigrams will be emitted in
-the   
-    				      final output alongside collocations   
+    				      analyzer. 			
+    
+      --unigram (-u)		      If set, unigrams will be emitted inthe   
+    				      final output alongside collocations
+   
       --help (-h)			      Print out help	      
 
 
@@ -227,8 +226,11 @@ Once this is done, ngrams are split into
 
 
     head_key(EMPTY) -> (head subgram, head frequency)
+
     head_key(ngram) -> (ngram, ngram frequency) 
+
     tail_key(EMPTY) -> (tail subgram, tail frequency)
+
     tail_key(ngram) -> (ngram, ngram frequency)
 
 
@@ -374,17 +376,3 @@ CollocDriver, unigrams (single tokens) w
 each token's frequency will be calculated. As with ngrams, unigrams are
 subject to filtering with minSupport and minLLR.
 
-<a name="Collocations-References"></a>
-## References
-
-\[1\](1\.html)
- http://en.wikipedia.org/wiki/Collocation
-\[2\](2\.html)
- http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
-
-
-<a name="Collocations-Discussion"></a>
-## Discussion
-
-* http://comments.gmane.org/gmane.comp.apache.mahout.user/5685 - Reuters
-example



Mime
View raw message