Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Date: Tue, 20 Mar 2012 15:42:00 -0400 (EDT)
From: confluence@apache.org
To: commits@mahout.apache.org
Message-ID: <33037397.51700.1332272520976.JavaMail.confluence@thor>
Subject: [CONF] Apache Mahout > Quick tour of Mahout text processing from
 the command line
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Auto-Submitted: auto-generated

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Quick tour of Mahout text processing from the command line (https://c=
wiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+Mahout+text+process=
ing+from+the+command+line)

Added by Pat Ferrel:
---------------------------------------------------------------------
h1. {color:#000000}{*}Quick tour of Mahout text processing from the command=
 line{*}{color}

{color:#000000}This is a concise quick tour of using the mahout command lin=
e to generate text analysis data. It follows examples from the{color} [Maho=
ut in Action|http://manning.com/owen/]{color:#000000}book and uses the Reut=
ers-21578 data set. This is one simple path through vectorizing text, creat=
ing clusters and calculating similar documents. The examples will work loca=
lly or distributed on a hadoop cluster. With the small data set provided a =
local installation is probably fast enough.{color}

h1. {color:#000000}{*}Generate Mahout sequence files from text{*}{color}

{color:#000000}Get the{color} [Reuters-21578|http://www.daviddlewis.com/res=
ources/testcollections/reuters21578/] {color:#000000}(http://www.daviddlewi=
s.com/resources/testcollections/reuters21578/) files and extract them in =
=E2=80=9C./reuters=E2=80=9D. They are in SGML format.{color}# {color:#00000=
0}We first convert from SGML to text:{color}
{color:#000000}mvn \-e \-q exec:java \-Dexec.mainClass=3D"org.apache.lucene=
.benchmark.utils.ExtractReuters" \-Dexec.args=3D"reuters/ reuters-extracted=
/"{color}
{color:#000000}If you plan to run this example on a hadoop cluster you will=
 need to copy the files to HDFS, which is not covered here.{color}
# {color:#000000}Now turn raw text in a directory into mahout sequence file=
s:{color}
{color:#000000}mahout seqdirectory \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-c UTF-8 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-i examples/reuters-extracted/ \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-o reuters-seqfiles{color}
# {color:#000000}Examine the sequence files with seqdumper:{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/chunk-0 \| more{color}
{color:#000000}you should see something like this:{color}

{color:#000000}Input Path: reuters-seqfiles/chunk-0{color}

{color:#000000}Key class: class org.apache.hadoop.io.Text Value Class: clas=
s org.apache.hadoop.io.Text{color}

{color:#000000}Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987 15:01:01.=
79{color}

{color:#000000}BAHIA COCOA REVIEW{color}

{color:#000000}Showers continued throughout the week in the Bahia cocoa zon=
e, alleviating the drought since early January and improving prospects for =
the coming temporao, although normal ...{color}
# {color:#000000}Create tfidf vectors.{color}
{color:#000000}mahout seq2sparse \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-i reuters-seqfiles/ \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-o reuters-vectors/ \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-ow \-chunk 100 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-x 90 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-seq \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-a com.finderbots.analyzers.LuceneStemming=
Analyzer \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-ml 50 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-n 2 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-nv{color}
{color:#000000}This uses a custom lucene analyzer which incorporates severa=
l token filters to stem, toss numbers, stop words (from a list), and small =
words. n =3D 2 is best for cosine distance, which we are using in clusterin=
g and for similarity. x is 90 meaning that if a token appears in 90% of the=
 docs it is considered a stop word. ml =3D 50 \-\- not sure what this does.=
..{color}
{color:#000000}Note:{color} {color:#000000}get named vectors or it is diffi=
cult to map docs to clusters{color}
# {color:#000000}Examine the vectors if you like but they are not really hu=
man readable...{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/part-r-00000{color}
# {color:#000000}Examine the tokenized docs to make sure the custom analyze=
r did right:{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-s reuters-vectors/tokenized-documents/par=
t-m-00000{color}
{color:#000000}This should show each doc with nice clean tokenized text wit=
h no numbers, stemmed, etc.{color}
# {color:#000000}Make sure to look at the dictionary. It has every token wi=
th the integer that references it. All the vectors will use the integer, no=
t the token so a lookup is required to see what is really inside a vector.{=
color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-s reuters-vectors/dictionary.file-0 \{col=
or}
{color:#000000}&nbsp;&nbsp;&nbsp;\| more{color}

h1. {color:#000000}{*}Cluster the documents using kmeans{*}{color}

# {color:#000000}Calculate clusters and put document into the them.{color}
{color:#000000}mahout kmeans \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-i reuters-vectors/tfidf-vectors/ \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-c reuters-kmeans-centroids \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-cl \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-o reuters-kmeans-clusters \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-k 20 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-ow \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-x 10 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-dm org.apache.mahout.common.distance.Cosi=
neDistanceMeasure{color}
{color:#000000}This calculates cluster centroids and puts them in the outpu=
t dir it then finds which vectors are included in the final clusters and pu=
ts them in output/clusteredPoints. If you leav out \-cl you will not get th=
e mapping of doc to cluster.{color}
{color:#000000}Note: fuzzy kmeans it is pretty sensitive to the fuzzyness m=
easure so you can get meaningless clusters so look at the \-m parameter to =
fkmeans before trying it. m =3D 2 produced garbage results.{color}
# {color:#000000}Examine the clusters and perhaps even do some anaylsis of =
how good the clusters are:{color}
{color:#000000}mahout clusterdump \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-d reuters-vectors/dictionary.file-0 \{col=
or}
{color:#000000}&nbsp;&nbsp;&nbsp;-dt sequencefile \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-s reuters-kmeans-clusters/clusters-3-fina=
l/part-r-00000 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-n 20 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-b 100 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-p reuters-kmeans-clusters/clusteredPoints=
/{color}
# {color:#000000}The clusteredPoints dir has the docs mapped into clusters,=
 and if you created vectors with names (seq2sparse \-nv) you=E2=80=99ll see=
 them. You also have the distance from the centroid using the distance meas=
ure supplied to the clustering driver. To look at this use seqdumper:{color=
}
{color:#000000}mahout seqdumper \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-s reuters-kmeans-clusters/clusteredPoints=
/part-m-00000 \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;\| more{color}

{color:#000000}You will see that the file contains{color}

{color:#000000}&nbsp;&nbsp;&nbsp;key: clusterid, value: wt =3D % likelihood=
 the vector is in cluster, distance from centroid, named vector belonging t=
o the cluster, vector data.{color}

{color:#000000}For kmeans the likelihood will be 1.0 or 0. For example:{col=
or}

{color:#000000}&nbsp;&nbsp;&nbsp;Key: 21477: Value: wt: 1.0distance: 0.9420=
744909793364 &nbsp;vec: /-tmp/reut2-000.sgm-158.txt =3D \[372:0.318, 966:0.=
396, 3027:0.230, 8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.=
270, 14371:0.413\]{color}
{color:#000000}Clusters, of course, cannot have names. A simple solution is=
 to construct a name from the top terms in the centroid output from cluster=
dump.{color}

h1. {color:#000000}{*}Calculate several similar docs to each doc in the dat=
a{*}{color}

{color:#000000}This will take all docs in the data set then for each calcul=
ate the 10 most similar docs. This is like =E2=80=9Cfind more like this=E2=
=80=9D type search but is calculated in the background. This seems to be fa=
st and requires only three mapreduce passes.{color}# {color:#000000}First c=
reate a matrix from the vectors:{color}
{color:#000000}mahout rowid \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-i reuters-vectors/tfidf-vectors/part-r-00=
000{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-o reuters-matrix{color}
{color:#000000}You=E2=80=99ll get output announcing the number of columns/d=
imensions in the doc collection stored in the matrix. I looks like this:{co=
lor}
{color:#000000}Wrote out matrix with 21578 rows and 19515 columns to reuter=
s-matrix/matrix{color}
{color:#000000}Save the number of column since it is needed in the next ste=
p. Also note that this creates a reuters-matrix/docIndex file where the row=
ids are mapped to docids. In the case of this example it will be rowid-->fi=
le name since we created named vectors in seq2sparse.{color}
{color:#000000}Note: This does not create a Mahout Matrix class but a seque=
nce file so use seqdumper to examine the results.{color}
# {color:#000000}Create a collection of similar docs to each row of the mat=
rix above:{color}
{color:#000000}mahout rowsimilarity \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-i reuters-named-matrix/matrix \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-o reuters-named-similarity \{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-r 19515{color}
{color:#000000}&nbsp;&nbsp;&nbsp;--similarityClassname SIMILARITY_COSINE{co=
lor}
{color:#000000}&nbsp;&nbsp;&nbsp;-m 10{color}
{color:#000000}&nbsp;&nbsp;&nbsp;-ess{color}
{color:#000000}This will generate the 10 most similar docs to each doc in t=
he collection.{color}
# {color:#000000}Examine the similarity list:{color}
{color:#000000}mahout seqdumper \-s reuters-matrix/matrix \| more{color}
{color:#000000}Which should look something like this{color}
{color:#000000}&nbsp;&nbsp;&nbsp;{color}{color:#000000}Key: 0: Value: {1445=
8:0.2966480826934176,11399:0.30290014772966095,{color}
{color:#000000}&nbsp;&nbsp;12793:0.22009858979452146,3275:0.187179103010328=
1,{color}
{color:#000000}&nbsp;&nbsp;14613:0.3534278632679437,4411:0.2516380602790199=
,{color}
{color:#000000}&nbsp;&nbsp;17520:0.3139731583634198,13611:0.189688882123159=
68,{color}
{color:#000000}&nbsp;&nbsp;14354:0.17673965754661425,0:1.0000000000000004}{=
color}
{color:#000000}For each rowid there is a list of ten rowids and distances. =
These corespond to documents and distance created by the \--similarityCalss=
name. In this case they are cosines of the angle between doc and similar do=
c. Look in the reuters-matrix/docIndex to find rowid to docid mapping. It s=
hould look something like this:{color}
{color:#000000}&nbsp;&nbsp;&nbsp;Key: 0: Value: /-tmp/reut2-000.sgm-0.txt{c=
olor}
{color:#000000}&nbsp;&nbsp;&nbsp;Key: 1: Value: /-tmp/reut2-000.sgm-1.txt{c=
olor}
{color:#000000}&nbsp;&nbsp;&nbsp;Key: 2: Value: /-tmp/reut2-000.sgm-10.txt{=
color}
{color:#000000}&nbsp;&nbsp;&nbsp;Key: 3: Value: /-tmp/reut2-000.sgm-100.txt=
{color}

Change your notification preferences: https://cwiki.apache.org/confluence/u=
sers/viewnotifications.action