mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Quick tour of Mahout text processing from the command line
Date Tue, 20 Mar 2012 19:42:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Quick tour of Mahout text processing from the command line (https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+Mahout+text+processing+from+the+command+line)

Added by Pat Ferrel:
---------------------------------------------------------------------
h1. {color:#000000}{*}Quick tour of Mahout text processing from the command line{*}{color}

{color:#000000}This is a concise quick tour of using the mahout command line to generate text
analysis data. It follows examples from the{color} [Mahout in Action|http://manning.com/owen/]{color:#000000}book
and uses the Reuters-21578 data set. This is one simple path through vectorizing text, creating
clusters and calculating similar documents. The examples will work locally or distributed
on a hadoop cluster. With the small data set provided a local installation is probably fast
enough.{color}

h1. {color:#000000}{*}Generate Mahout sequence files from text{*}{color}

{color:#000000}Get the{color} [Reuters-21578|http://www.daviddlewis.com/resources/testcollections/reuters21578/]
{color:#000000}(http://www.daviddlewis.com/resources/testcollections/reuters21578/) files
and extract them in “./reuters”. They are in SGML format.{color}# {color:#000000}We first
convert from SGML to text:{color}
{color:#000000}mvn \-e \-q exec:java \-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
\-Dexec.args="reuters/ reuters-extracted/"{color}
{color:#000000}If you plan to run this example on a hadoop cluster you will need to copy the
files to HDFS, which is not covered here.{color}
# {color:#000000}Now turn raw text in a directory into mahout sequence files:{color}
{color:#000000}mahout seqdirectory \{color}
{color:#000000}   -c UTF-8 \{color}
{color:#000000}   -i examples/reuters-extracted/ \{color}
{color:#000000}   -o reuters-seqfiles{color}
# {color:#000000}Examine the sequence files with seqdumper:{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/chunk-0 \| more{color}
{color:#000000}you should see something like this:{color}

{color:#000000}Input Path: reuters-seqfiles/chunk-0{color}

{color:#000000}Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text{color}

{color:#000000}Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987 15:01:01.79{color}

{color:#000000}BAHIA COCOA REVIEW{color}

{color:#000000}Showers continued throughout the week in the Bahia cocoa zone, alleviating
the drought since early January and improving prospects for the coming temporao, although
normal ...{color}
# {color:#000000}Create tfidf vectors.{color}
{color:#000000}mahout seq2sparse \{color}
{color:#000000}   -i reuters-seqfiles/ \{color}
{color:#000000}   -o reuters-vectors/ \{color}
{color:#000000}   -ow \-chunk 100 \{color}
{color:#000000}   -x 90 \{color}
{color:#000000}   -seq \{color}
{color:#000000}   -a com.finderbots.analyzers.LuceneStemmingAnalyzer
\{color}
{color:#000000}   -ml 50 \{color}
{color:#000000}   -n 2 \{color}
{color:#000000}   -nv{color}
{color:#000000}This uses a custom lucene analyzer which incorporates several token filters
to stem, toss numbers, stop words (from a list), and small words. n = 2 is best for cosine
distance, which we are using in clustering and for similarity. x is 90 meaning that if a token
appears in 90% of the docs it is considered a stop word. ml = 50 \-\- not sure what this does...{color}
{color:#000000}Note:{color} {color:#000000}get named vectors or it is difficult to map docs
to clusters{color}
# {color:#000000}Examine the vectors if you like but they are not really human readable...{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/part-r-00000{color}
# {color:#000000}Examine the tokenized docs to make sure the custom analyzer did right:{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}   -s reuters-vectors/tokenized-documents/part-m-00000{color}
{color:#000000}This should show each doc with nice clean tokenized text with no numbers, stemmed,
etc.{color}
# {color:#000000}Make sure to look at the dictionary. It has every token with the integer
that references it. All the vectors will use the integer, not the token so a lookup is required
to see what is really inside a vector.{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}   -s reuters-vectors/dictionary.file-0 \{color}
{color:#000000}   \| more{color}

h1. {color:#000000}{*}Cluster the documents using kmeans{*}{color}

# {color:#000000}Calculate clusters and put document into the them.{color}
{color:#000000}mahout kmeans \{color}
{color:#000000}   -i reuters-vectors/tfidf-vectors/ \{color}
{color:#000000}   -c reuters-kmeans-centroids \{color}
{color:#000000}   -cl \{color}
{color:#000000}   -o reuters-kmeans-clusters \{color}
{color:#000000}   -k 20 \{color}
{color:#000000}   -ow \{color}
{color:#000000}   -x 10 \{color}
{color:#000000}   -dm org.apache.mahout.common.distance.CosineDistanceMeasure{color}
{color:#000000}This calculates cluster centroids and puts them in the output dir it then finds
which vectors are included in the final clusters and puts them in output/clusteredPoints.
If you leav out \-cl you will not get the mapping of doc to cluster.{color}
{color:#000000}Note: fuzzy kmeans it is pretty sensitive to the fuzzyness measure so you can
get meaningless clusters so look at the \-m parameter to fkmeans before trying it. m = 2 produced
garbage results.{color}
# {color:#000000}Examine the clusters and perhaps even do some anaylsis of how good the clusters
are:{color}
{color:#000000}mahout clusterdump \{color}
{color:#000000}   -d reuters-vectors/dictionary.file-0 \{color}
{color:#000000}   -dt sequencefile \{color}
{color:#000000}   -s reuters-kmeans-clusters/clusters-3-final/part-r-00000
\{color}
{color:#000000}   -n 20 \{color}
{color:#000000}   -b 100 \{color}
{color:#000000}   -p reuters-kmeans-clusters/clusteredPoints/{color}
# {color:#000000}The clusteredPoints dir has the docs mapped into clusters, and if you created
vectors with names (seq2sparse \-nv) you’ll see them. You also have the distance from the
centroid using the distance measure supplied to the clustering driver. To look at this use
seqdumper:{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}   -s reuters-kmeans-clusters/clusteredPoints/part-m-00000
\{color}
{color:#000000}   \| more{color}

{color:#000000}You will see that the file contains{color}

{color:#000000}   key: clusterid, value: wt = % likelihood the
vector is in cluster, distance from centroid, named vector belonging to the cluster, vector
data.{color}

{color:#000000}For kmeans the likelihood will be 1.0 or 0. For example:{color}

{color:#000000}   Key: 21477: Value: wt: 1.0distance: 0.9420744909793364
 vec: /-tmp/reut2-000.sgm-158.txt = \[372:0.318, 966:0.396, 3027:0.230, 8816:0.452,
8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270, 14371:0.413\]{color}
{color:#000000}Clusters, of course, cannot have names. A simple solution is to construct a
name from the top terms in the centroid output from clusterdump.{color}

h1. {color:#000000}{*}Calculate several similar docs to each doc in the data{*}{color}

{color:#000000}This will take all docs in the data set then for each calculate the 10 most
similar docs. This is like “find more like this” type search but is calculated in the
background. This seems to be fast and requires only three mapreduce passes.{color}# {color:#000000}First
create a matrix from the vectors:{color}
{color:#000000}mahout rowid \{color}
{color:#000000}   -i reuters-vectors/tfidf-vectors/part-r-00000{color}
{color:#000000}   -o reuters-matrix{color}
{color:#000000}You’ll get output announcing the number of columns/dimensions in the doc
collection stored in the matrix. I looks like this:{color}
{color:#000000}Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix{color}
{color:#000000}Save the number of column since it is needed in the next step. Also note that
this creates a reuters-matrix/docIndex file where the rowids are mapped to docids. In the
case of this example it will be rowid-->file name since we created named vectors in seq2sparse.{color}
{color:#000000}Note: This does not create a Mahout Matrix class but a sequence file so use
seqdumper to examine the results.{color}
# {color:#000000}Create a collection of similar docs to each row of the matrix above:{color}
{color:#000000}mahout rowsimilarity \{color}
{color:#000000}   -i reuters-named-matrix/matrix \{color}
{color:#000000}   -o reuters-named-similarity \{color}
{color:#000000}   -r 19515{color}
{color:#000000}   --similarityClassname SIMILARITY_COSINE{color}
{color:#000000}   -m 10{color}
{color:#000000}   -ess{color}
{color:#000000}This will generate the 10 most similar docs to each doc in the collection.{color}
# {color:#000000}Examine the similarity list:{color}
{color:#000000}mahout seqdumper \-s reuters-matrix/matrix \| more{color}
{color:#000000}Which should look something like this{color}
{color:#000000}   {color}{color:#000000}Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095,{color}
{color:#000000}  12793:0.22009858979452146,3275:0.1871791030103281,{color}
{color:#000000}  14613:0.3534278632679437,4411:0.2516380602790199,{color}
{color:#000000}  17520:0.3139731583634198,13611:0.18968888212315968,{color}
{color:#000000}  14354:0.17673965754661425,0:1.0000000000000004}{color}
{color:#000000}For each rowid there is a list of ten rowids and distances. These corespond
to documents and distance created by the \--similarityCalssname. In this case they are cosines
of the angle between doc and similar doc. Look in the reuters-matrix/docIndex to find rowid
to docid mapping. It should look something like this:{color}
{color:#000000}   Key: 0: Value: /-tmp/reut2-000.sgm-0.txt{color}
{color:#000000}   Key: 1: Value: /-tmp/reut2-000.sgm-1.txt{color}
{color:#000000}   Key: 2: Value: /-tmp/reut2-000.sgm-10.txt{color}
{color:#000000}   Key: 3: Value: /-tmp/reut2-000.sgm-100.txt{color}

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action

Mime
View raw message