Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 00A17991C for ; Tue, 20 Mar 2012 19:42:23 +0000 (UTC) Received: (qmail 3313 invoked by uid 500); 20 Mar 2012 19:42:23 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 3235 invoked by uid 500); 20 Mar 2012 19:42:23 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 3212 invoked by uid 99); 20 Mar 2012 19:42:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2012 19:42:23 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2012 19:42:21 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id q2KJg1eQ013179 for ; Tue, 20 Mar 2012 19:42:01 GMT Date: Tue, 20 Mar 2012 15:42:00 -0400 (EDT) From: confluence@apache.org To: commits@mahout.apache.org Message-ID: <33037397.51700.1332272520976.JavaMail.confluence@thor> Subject: [CONF] Apache Mahout > Quick tour of Mahout text processing from the command line MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Quick tour of Mahout text processing from the command line (https://c= wiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+Mahout+text+process= ing+from+the+command+line) Added by Pat Ferrel: --------------------------------------------------------------------- h1. {color:#000000}{*}Quick tour of Mahout text processing from the command= line{*}{color} {color:#000000}This is a concise quick tour of using the mahout command lin= e to generate text analysis data. It follows examples from the{color} [Maho= ut in Action|http://manning.com/owen/]{color:#000000}book and uses the Reut= ers-21578 data set. This is one simple path through vectorizing text, creat= ing clusters and calculating similar documents. The examples will work loca= lly or distributed on a hadoop cluster. With the small data set provided a = local installation is probably fast enough.{color} h1. {color:#000000}{*}Generate Mahout sequence files from text{*}{color} {color:#000000}Get the{color} [Reuters-21578|http://www.daviddlewis.com/res= ources/testcollections/reuters21578/] {color:#000000}(http://www.daviddlewi= s.com/resources/testcollections/reuters21578/) files and extract them in = =E2=80=9C./reuters=E2=80=9D. They are in SGML format.{color}# {color:#00000= 0}We first convert from SGML to text:{color} {color:#000000}mvn \-e \-q exec:java \-Dexec.mainClass=3D"org.apache.lucene= .benchmark.utils.ExtractReuters" \-Dexec.args=3D"reuters/ reuters-extracted= /"{color} {color:#000000}If you plan to run this example on a hadoop cluster you will= need to copy the files to HDFS, which is not covered here.{color} # {color:#000000}Now turn raw text in a directory into mahout sequence file= s:{color} {color:#000000}mahout seqdirectory \{color} {color:#000000}   -c UTF-8 \{color} {color:#000000}   -i examples/reuters-extracted/ \{color} {color:#000000}   -o reuters-seqfiles{color} # {color:#000000}Examine the sequence files with seqdumper:{color} {color:#000000}mahout seqdumper \-s reuters-seqfiles/chunk-0 \| more{color} {color:#000000}you should see something like this:{color} {color:#000000}Input Path: reuters-seqfiles/chunk-0{color} {color:#000000}Key class: class org.apache.hadoop.io.Text Value Class: clas= s org.apache.hadoop.io.Text{color} {color:#000000}Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987 15:01:01.= 79{color} {color:#000000}BAHIA COCOA REVIEW{color} {color:#000000}Showers continued throughout the week in the Bahia cocoa zon= e, alleviating the drought since early January and improving prospects for = the coming temporao, although normal ...{color} # {color:#000000}Create tfidf vectors.{color} {color:#000000}mahout seq2sparse \{color} {color:#000000}   -i reuters-seqfiles/ \{color} {color:#000000}   -o reuters-vectors/ \{color} {color:#000000}   -ow \-chunk 100 \{color} {color:#000000}   -x 90 \{color} {color:#000000}   -seq \{color} {color:#000000}   -a com.finderbots.analyzers.LuceneStemming= Analyzer \{color} {color:#000000}   -ml 50 \{color} {color:#000000}   -n 2 \{color} {color:#000000}   -nv{color} {color:#000000}This uses a custom lucene analyzer which incorporates severa= l token filters to stem, toss numbers, stop words (from a list), and small = words. n =3D 2 is best for cosine distance, which we are using in clusterin= g and for similarity. x is 90 meaning that if a token appears in 90% of the= docs it is considered a stop word. ml =3D 50 \-\- not sure what this does.= ..{color} {color:#000000}Note:{color} {color:#000000}get named vectors or it is diffi= cult to map docs to clusters{color} # {color:#000000}Examine the vectors if you like but they are not really hu= man readable...{color} {color:#000000}mahout seqdumper \-s reuters-seqfiles/part-r-00000{color} # {color:#000000}Examine the tokenized docs to make sure the custom analyze= r did right:{color} {color:#000000}mahout seqdumper \{color} {color:#000000}   -s reuters-vectors/tokenized-documents/par= t-m-00000{color} {color:#000000}This should show each doc with nice clean tokenized text wit= h no numbers, stemmed, etc.{color} # {color:#000000}Make sure to look at the dictionary. It has every token wi= th the integer that references it. All the vectors will use the integer, no= t the token so a lookup is required to see what is really inside a vector.{= color} {color:#000000}mahout seqdumper \{color} {color:#000000}   -s reuters-vectors/dictionary.file-0 \{col= or} {color:#000000}   \| more{color} h1. {color:#000000}{*}Cluster the documents using kmeans{*}{color} # {color:#000000}Calculate clusters and put document into the them.{color} {color:#000000}mahout kmeans \{color} {color:#000000}   -i reuters-vectors/tfidf-vectors/ \{color} {color:#000000}   -c reuters-kmeans-centroids \{color} {color:#000000}   -cl \{color} {color:#000000}   -o reuters-kmeans-clusters \{color} {color:#000000}   -k 20 \{color} {color:#000000}   -ow \{color} {color:#000000}   -x 10 \{color} {color:#000000}   -dm org.apache.mahout.common.distance.Cosi= neDistanceMeasure{color} {color:#000000}This calculates cluster centroids and puts them in the outpu= t dir it then finds which vectors are included in the final clusters and pu= ts them in output/clusteredPoints. If you leav out \-cl you will not get th= e mapping of doc to cluster.{color} {color:#000000}Note: fuzzy kmeans it is pretty sensitive to the fuzzyness m= easure so you can get meaningless clusters so look at the \-m parameter to = fkmeans before trying it. m =3D 2 produced garbage results.{color} # {color:#000000}Examine the clusters and perhaps even do some anaylsis of = how good the clusters are:{color} {color:#000000}mahout clusterdump \{color} {color:#000000}   -d reuters-vectors/dictionary.file-0 \{col= or} {color:#000000}   -dt sequencefile \{color} {color:#000000}   -s reuters-kmeans-clusters/clusters-3-fina= l/part-r-00000 \{color} {color:#000000}   -n 20 \{color} {color:#000000}   -b 100 \{color} {color:#000000}   -p reuters-kmeans-clusters/clusteredPoints= /{color} # {color:#000000}The clusteredPoints dir has the docs mapped into clusters,= and if you created vectors with names (seq2sparse \-nv) you=E2=80=99ll see= them. You also have the distance from the centroid using the distance meas= ure supplied to the clustering driver. To look at this use seqdumper:{color= } {color:#000000}mahout seqdumper \{color} {color:#000000}   -s reuters-kmeans-clusters/clusteredPoints= /part-m-00000 \{color} {color:#000000}   \| more{color} {color:#000000}You will see that the file contains{color} {color:#000000}   key: clusterid, value: wt =3D % likelihood= the vector is in cluster, distance from centroid, named vector belonging t= o the cluster, vector data.{color} {color:#000000}For kmeans the likelihood will be 1.0 or 0. For example:{col= or} {color:#000000}   Key: 21477: Value: wt: 1.0distance: 0.9420= 744909793364  vec: /-tmp/reut2-000.sgm-158.txt =3D \[372:0.318, 966:0.= 396, 3027:0.230, 8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.= 270, 14371:0.413\]{color} {color:#000000}Clusters, of course, cannot have names. A simple solution is= to construct a name from the top terms in the centroid output from cluster= dump.{color} h1. {color:#000000}{*}Calculate several similar docs to each doc in the dat= a{*}{color} {color:#000000}This will take all docs in the data set then for each calcul= ate the 10 most similar docs. This is like =E2=80=9Cfind more like this=E2= =80=9D type search but is calculated in the background. This seems to be fa= st and requires only three mapreduce passes.{color}# {color:#000000}First c= reate a matrix from the vectors:{color} {color:#000000}mahout rowid \{color} {color:#000000}   -i reuters-vectors/tfidf-vectors/part-r-00= 000{color} {color:#000000}   -o reuters-matrix{color} {color:#000000}You=E2=80=99ll get output announcing the number of columns/d= imensions in the doc collection stored in the matrix. I looks like this:{co= lor} {color:#000000}Wrote out matrix with 21578 rows and 19515 columns to reuter= s-matrix/matrix{color} {color:#000000}Save the number of column since it is needed in the next ste= p. Also note that this creates a reuters-matrix/docIndex file where the row= ids are mapped to docids. In the case of this example it will be rowid-->fi= le name since we created named vectors in seq2sparse.{color} {color:#000000}Note: This does not create a Mahout Matrix class but a seque= nce file so use seqdumper to examine the results.{color} # {color:#000000}Create a collection of similar docs to each row of the mat= rix above:{color} {color:#000000}mahout rowsimilarity \{color} {color:#000000}   -i reuters-named-matrix/matrix \{color} {color:#000000}   -o reuters-named-similarity \{color} {color:#000000}   -r 19515{color} {color:#000000}   --similarityClassname SIMILARITY_COSINE{co= lor} {color:#000000}   -m 10{color} {color:#000000}   -ess{color} {color:#000000}This will generate the 10 most similar docs to each doc in t= he collection.{color} # {color:#000000}Examine the similarity list:{color} {color:#000000}mahout seqdumper \-s reuters-matrix/matrix \| more{color} {color:#000000}Which should look something like this{color} {color:#000000}   {color}{color:#000000}Key: 0: Value: {1445= 8:0.2966480826934176,11399:0.30290014772966095,{color} {color:#000000}  12793:0.22009858979452146,3275:0.187179103010328= 1,{color} {color:#000000}  14613:0.3534278632679437,4411:0.2516380602790199= ,{color} {color:#000000}  17520:0.3139731583634198,13611:0.189688882123159= 68,{color} {color:#000000}  14354:0.17673965754661425,0:1.0000000000000004}{= color} {color:#000000}For each rowid there is a list of ten rowids and distances. = These corespond to documents and distance created by the \--similarityCalss= name. In this case they are cosines of the angle between doc and similar do= c. Look in the reuters-matrix/docIndex to find rowid to docid mapping. It s= hould look something like this:{color} {color:#000000}   Key: 0: Value: /-tmp/reut2-000.sgm-0.txt{c= olor} {color:#000000}   Key: 1: Value: /-tmp/reut2-000.sgm-1.txt{c= olor} {color:#000000}   Key: 2: Value: /-tmp/reut2-000.sgm-10.txt{= color} {color:#000000}   Key: 3: Value: /-tmp/reut2-000.sgm-100.txt= {color} Change your notification preferences: https://cwiki.apache.org/confluence/u= sers/viewnotifications.action