Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0AED210A49 for ; Sat, 21 Dec 2013 00:06:32 +0000 (UTC) Received: (qmail 83714 invoked by uid 500); 21 Dec 2013 00:06:30 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 83669 invoked by uid 500); 21 Dec 2013 00:06:30 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 83661 invoked by uid 99); 21 Dec 2013 00:06:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Dec 2013 00:06:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of scottccote@gmail.com designates 209.85.219.54 as permitted sender) Received: from [209.85.219.54] (HELO mail-oa0-f54.google.com) (209.85.219.54) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Dec 2013 00:06:23 +0000 Received: by mail-oa0-f54.google.com with SMTP id h16so3673706oag.27 for ; Fri, 20 Dec 2013 16:06:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=user-agent:date:subject:from:to:message-id:thread-topic:references :in-reply-to:mime-version:content-type:content-transfer-encoding; bh=2JYgMI7lt/4Zng8CsTi6byNnaXu6JhrpOLTId8SBieA=; b=RzVio01Jy7zZbcTNeEcgzZf+WZW22hmOc5LjJIHitWS4v9wZLUheNZnj6+kkQFqn19 wH/XvCl58Z5vk76D8HpjiRtxVfa4Gf6CJiqPGq1NCNXrmf4VXxMSSCm5SLEkNOc4PwSD 46gxTkpN+A0ECWNUaUosWr9U3YC+eYhl7hz4gnIJErzE5axqXFnFKffhzJRITGvGdy1t yDzGymzVQ38f7KfvRsEm9EmGEvbZlXzu4ZqErMbWbTaPiCgjAuxqHnf2y08IN5Fcy3kA 4/8ABPJymdc9EXWdjESwGc6EK5EAdDDC47hnJJWU4JvB9skyzd3X24bgWGItTnRYl+Pq wnDg== X-Received: by 10.182.143.103 with SMTP id sd7mr404595obb.70.1387584361659; Fri, 20 Dec 2013 16:06:01 -0800 (PST) Received: from [192.168.1.6] ([69.26.215.221]) by mx.google.com with ESMTPSA id tr10sm12151284obb.6.2013.12.20.16.05.58 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 20 Dec 2013 16:06:00 -0800 (PST) User-Agent: Microsoft-MacOutlook/14.3.9.131030 Date: Fri, 20 Dec 2013 18:05:54 -0600 Subject: Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis From: "Scott C. Cote" To: Message-ID: Thread-Topic: unexpected results in seqdump of reuters-matrix in quick tour of text analysis References: <1387475634.82990.YahooMailNeo@web160203.mail.bf1.yahoo.com> <1387479006.38082.YahooMailNeo@web160205.mail.bf1.yahoo.com> <1387479624.12338.YahooMailNeo@web160203.mail.bf1.yahoo.com> <1387480107.45948.YahooMailNeo@web160204.mail.bf1.yahoo.com> <1387578600.624.YahooMailNeo@web160201.mail.bf1.yahoo.com> In-Reply-To: <1387578600.624.YahooMailNeo@web160201.mail.bf1.yahoo.com> Mime-version: 1.0 Content-type: text/plain; charset="EUC-KR" Content-transfer-encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Suneel, Thank you for your help. :) Thought I was completely in the ditch. If you are interested: inline with you comments are demonstrations that I finally have it (and the commands that I used)=A1=A6. YAQ (Yet another question): How do I see with the dumper the documents that belong in a given cluster? I issued the command: mahout seqdumper -I reuters-kmeans-clusters/clusters-3-final/part-r-00000 Which yields data like: Input Path: part-r-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.iterator.ClusterWritable Key: 0: Value:=20 org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2 Key: 1: Value:=20 org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2 Key: 2: Value:=20 org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2 =A1=A6 Key: 19: Value:=20 org.apache.mahout.clustering.iterator.ClusterWritable@193936e1 Count: 20 Was hoping to see something that associated a centroid/cluster with its members. =20 Given that there are 20 centroids, how do I break out the files into say: 20 folders - one folder per centroid so that I know their associations (I'm assuming that the clusters don't overlap). Or - is there a sequence file that is generated somewhere that definitively associates the vectors with each cluster?=20 Here is what I do know: I know that the clusters are not given names and it is suggested that we use the top terms of the cluster to define a name. According to the tour, I should be able to see a likelihood that a given vector is in a cluster. But mahout seqdumper -i reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more Yields: Input Path: part-m-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable Key: 10266: Value: 1.0: /reut2-000.sgm-0.txt =3D [62:0.085, 222:0.043, 291:0.084, 1411:0.083, 1421:0.087, 1451:0.085, 1456:0.092, 1457:0.092, 1462:0.135, 1512:0.070, 1543:0.104, 2962:0.037 =A1=A6. which does NOT look like the output in the tour (did I miss something again?). But I'll try to interpret the output as saying vector with key 62 has a cosine distance of .085 from key 10266 - is that right? What do I need to look at? - MiA sheds no light on this part that I have found. NOTE: I wrote a very simple - non scalable k-means java routine that found the clusters in a set of points (2 dimensional) and tracked which point belongs to which cluster (no overlap). Want to do the same with Mahout. Looking forward to your response to get me over this next hump =A1=A6. SCott On 12/20/13 4:30 PM, "Suneel Marthi" wrote: >Sorry Scott I should have looked at this more closely. I apologize. > >1. You are doing a seqdumper of the matrix (which is generated from the >rowid job and is not the output of the rowsimilarity job). > > Rowid Job generates a MxN matrix where M - no. of documents and N - >terms associated with each document > > The value of a cell in the Matrix is the tf-idf weight of the term. > > So in the following output: > > {Code} > > > =20 >Key: 2: Value:=20 >/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 >6 >2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 >0 >5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 >0 >:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: >0 >.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 >4 >:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 >7 >38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 >2 >224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 >, >23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 >7 >77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 >6 >9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 >3 >8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 >7 >1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 >8 >8897003744,} > >{Code} > >means for document 2 what follows are the terms:tf-df weights. > >To see the term corresponding to 41625 look at dictionary.file-0 for the >corresponding key. > >Hope that clarifies and clears the confusion here. To your point, a dump of the dictionary sequence file coupled with a tail shows: Dec 20, 2013 4:38:28 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Program took 1078 ms (Minutes: 0.017966666666666666) Key: zuccherifici: Value: 41798 Key: zuckerman: Value: 41799 Key: zuercher: Value: 41800 Key: zulia: Value: 41801 Key: zurich: Value: 41802 Key: zurn: Value: 41803 Key: zverev: Value: 41804 Key: zweig: Value: 41805 Key: zy: Value: 41806 Count: 41807 This is what I get for only looking at the beginning of the file and not really taking the time to understand the nature of the file. > >2. In order to see the most similar documents for a given document you >should be looking at a seqdumper of the output from rowsimilarity which >in ur case would be the output in reuters-similarity. That should give >the 10 most similar documents and their cosine distances from the >referenced document. mahout seqdumper -i reuters-similarity/part-r-* | more Yields Input Path: reuters-similarity/part-r-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value:=20 {0:0.9999999999999999,13611:0.17446750688012366,13430:0.15853208358190823,1 7520:0.19351644052283437,18330:0.15898358188286904,4411:0.20851636244169733 ,13403:0.1663674094837415,14458:0.17265033919444714,14613:0.153651769452232 38,11399:0.19745333923929734} Key: 1: Value:=20 {9858:0.32081902404236906,9704:0.2485999435029943,9833:0.30851564542610826, 19789:0.37458607189215337,10056:0.2885413911200995,10601:0.2598640283997712 4,11858:0.3057183602839999,17412:0.30330496505095894,1:0.9999999999999998,9 702:0.26198579353949075} Key: 2: Value:=20 {2:1.0000000000000004,1087:0.28125327148896956,10390:0.2690057046963114,100 22:0.27668518648436297,6746:0.26969982074464605,12886:0.27032675431539793,1 3168:0.25889934686395943,997:0.26225673856545156,1392:0.2673559453473729,20 614:0.3009916279814217} =A1=A6.. :) > >There's an error on the wiki link instructions, the seqdumper should have >been on rowsimilarity/part-r-* and not on matrix/matrix for determining >similar documents. > >Hope this helps. Sorry again for the confusion. > > =20 > > > > > > >On Friday, December 20, 2013 4:51 PM, Scott C. Cote > wrote: >=20 >Suneel and others, > >I am still getting the strange results when I do the tour. Suneel: I >manually wiped out the temp folder and also deleted the reuters-XXX >folders. =20 >Also, per your advice I added the -ow option to all of the commands. >NOTE: The step to create a matrix would NOT take a -ow option > >I have tried again, and am still seeing references to documents that do >not exist. > >The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i >reuters-matrix/matrix | tail) : > >INFO: Program took 1077 ms (Minutes: 0.01795) >Key: 21569: Value: /reut2-021.sgm-91.txt >Key: 21570: Value: /reut2-021.sgm-92.txt >Key: 21571: Value: /reut2-021.sgm-93.txt >Key: 21572: Value: /reut2-021.sgm-94.txt >Key: 21573: Value: /reut2-021.sgm-95.txt >Key: 21574: Value: /reut2-021.sgm-96.txt >Key: 21575: Value: /reut2-021.sgm-97.txt >Key: 21576: Value: /reut2-021.sgm-98.txt >Key: 21577: Value: /reut2-021.sgm-99.txt >Count: 21578 > > > >And the following snippet exists inside reuters-matrix/matrix and >references key 41625 (which is larger than any key in docindex). > >Key: 2: Value:=20 >/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 >6 >2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 >0 >5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 >0 >:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: >0 >.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 >4 >:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 >7 >38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 >2 >224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 >, >23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 >7 >77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 >6 >9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 >3 >8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 >7 >1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 >8 >8897003744,} > >--->>>>> So in this email, I have listed the following pieces > of >information 1. Commands, 2. Env vars, 3. Sw version info > >Again, thank you in advance for your help. > >Scott > >INFO Below: > >1. sequence of commands with relevant logged output points (omitted the >sequence dump commands): > >mv reuters xreuters >rm -r temp > >rm -r reuters-* >mv xreuters reuters >mvn -e -q exec:java >-Dexec.mainClass=3D"org.apache.lucene.benchmark.utils.ExtractReuters" >-Dexec.args=3D"reuters/ reuters-extracted/" >mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow >mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100 >-x 90 -seq -ml 50 -n 2 -nv ># ># added the -cd option per instructions in the Mahout In Action (MiA) so >the convergance threhsold is .1 (originally this was default value but no >affect on the unexpected results) ># instead of default value of .5 because cosines lie within 0 and >1. ># >mahout kmeans -i reuters-vectors/tfidf-vectors/ -c >reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10 >-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 >mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile >-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p >reuters-kmeans-clusters/clusteredPoints/ > >mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o >reuters-matrix ># ># the prior step had 21578 rows and 41807 columns ># 41807 came from the prior step columns output ># 10 most similar docs to each doc in the collection ># >mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r >41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess > > > > >2. env vars are as follows: > >MAHOUT_LOCAL=3Dyes >TERM_PROGRAM=3DApple_Terminal >MAHOUT_HOME=3D/Users/scottccote/mahout >TERM=3Dxterm-256color >SHELL=3D/bin/bash >TMPDIR=3D/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/ >Apple_PubSub_Socket_Render=3D/tmp/launch-82C1fm/Render >HADOOP_PREFIX=3D/Users/scottccote/hadoop >TERM_PROGRAM_VERSION=3D309 >TERM_SESSION_ID=3DA5B10188-433E-419A-A263-65BDDEABB9CF >USER=3Dscottccote >COMMAND_MODE=3Dunix2003 >SSH_AUTH_SOCK=3D/tmp/launch-XEgaqv/Listeners >__CF_USER_TEXT_ENCODING=3D0x1F5:0:0 >Apple_Ubiquity_Message=3D/tmp/launch-N1BDIz/Apple_Ubiquity_Message >PATH=3D/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/o >p >t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin: >/ >usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/ >U >sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin >PWD=3D/Users/scottccote/Documents/toy-workspace/MiA >HADOOP_VERSION=3D1.1.2 >JAVA_HOME=3D/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home >EDITOR=3D/usr/bin/vi >HADOOP_CONF_DIR=3D/Users/scottccote/hadoop/conf >LANG=3Den_US.UTF-8 >HADOOP_OPTS=3D-Djava.security.krb5.realm=3DOX.AC.UK >-Djava.security.krb5.kdc=3Dkdc0.ox.ac.uk:kdc1.ox.ac.uk > > > > > >3. Software/OS Version Info: >version of mahout is (property of pom.xml in mahout home): 0.8 > >version of java (java -version): java version "1.6.0_65", Java(TM) SE >Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM) >64-Bit Server VM (build 20.65-b04-462, mixed mode) > >Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin >Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; >root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 > > > > > >On 12/19/13 1:08 PM, "Suneel Marthi" wrote: > >>I don't see a need for uploading ur commands. Clean up HDFS (both output >>and temp folders) and try running the 5 steps again - extract reuters, >>seqdirectory, seq2sparse, rowid job, rowsimilarity job. >> >>Please use '-ow' option while running each of the jobs. >> >> >> >> >> >> >> >>On Thursday, December 19, 2013 2:04 PM, Scott C. Cote >> wrote: >>=20 >>I manually deleted the temp folder too (After 2 failed starts). >> >>Would it be helpful for me to upload my shells that encapsulate all of >>the >>commands posted on the tour? They reflect the current state of reuters >>and .8 mahout. >>And if I did - how would I do it? >> >>Thanks, >> >>SCott >> >> >>On 12/19/13 1:00 PM, "Suneel Marthi" wrote: >> >>>Yep, that's what has happened in ur case. the wiki doesn't have but >>>please specify the > -ow (overwrite) option while running the >>>RowsimilarityJob. That should clear up both the output and temp folders >>>before running the job. >>> >>> >>> >>> >>> >>>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi >>> wrote: >>>=20 >>>Haha... that could explain it, Rowsimilarityjob creates temp files >>>during >>>execution. If ur laptop 'sleeped' then the temp files still persist and >>>running the job again wouldn't overwrite the old temp files (i need to >>>verify that). >>> >>>It should be > good enough to run the Rowsimilarity job again. >>> >>> >>> >>> >>> >>> >>> >>>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote >>> wrote: >>>=20 >>>Suneel, >>> >>>I'm going to do the similarity part of the tour over - my laptop was >>>"sleeped" in the middle of the run of the rowsimilarity job. >>>Maybe the job is sensitive to that =A1=A6. :( Normally - a server would no= t >>>go to sleep nor would it run >>>in local mode. >>> >>>Sorry that I didn't think of that sooner. >>>Will let you know my outcome. >>> >>>Am planning on redoing by deleting the contents and the folder titled >>>"reuters-similarity" >>> >>>Please let me know if that is not good enough. >>> >>>Thanks again. >>> >>>SCott >>> >>> >>>On 12/19/13 11:53 AM, "Suneel Marthi" wrote: >>> >>>>What you are seeing is the output matrix of the RowSimilarity job. You >>>>are right there should be 21578 documents only in the reuters > corpus. >>>> >>>>a) How many documents do you have in your docIndex? DocIndex is one of >>>>the artifacts of the RowIDJob and should have been executed prior to >>>>the >>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output. >>>> >>>>b) Also what was the message at the end of the RowId job. It should >>>>read >>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to >>>>reuters-matrix/matrix'. >>>> >>>> >>>> >>>> >>>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote >>>> wrote: >>>>=20 >>>>All, >>>> >>>>I am a newbie Mahout user and am trying to use the "Quick tour of text >>>>analysis using the Mahout command line" . Thank you to whomever >>>>contributed >>>>to that page. >>>> >>>>>=20 >>>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ >>>>>a >>>>>n >>>>>a >>>>>lysis >>>>> +using+the+Mahout+command+line >>>> >>>>Went all the way from beginning to end of >>> the page with "seemingly" no >>>>hiccups. >>>>At the very end of the "tour", I became confused because the command: >>>> >>>>> mahout seqdumper -i reuters-matrix/matrix | more >>>> >>>>Allowed me to see output (snippet) >>>> >>>>> Key: 1: Value: >>>>>=20 >>>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121, >>>>>4 >>>>>4 >>>>>0 >>>>>3:0.2 >>>>>=20 >>>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,101 >>>>>0 >>>>>8 >>>>>: >>>>>0.126 >>>>>=20 >>>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,137 >>>>>5 >>>>>0 >>>>>: >>>>>0.188 >>>>>=20 >>>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969 >>>>>: >>>>>0 >>>>>. >>>>>36601 >>>>>=20 >>>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734 >>>>>: >>>>>0 >>>>>. >>>>>10869 >>>>>=20 >>>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224: >>>>>0 >>>>>. >>>>>1 >>>>>23091 >>>>>=20 >>>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0 >>>>>. >>>>>0 >>>>>6 >>>>>16936 >>>>>=20 >>>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507: >>>>>0 >>>>>. >>>>>1 >>>>>23271 >>>>>=20 >>>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0 >>>>>. >>>>>0 >>>>>8 >>>>>01873 >>>>>=20 >>>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0 >>>>>. >>>>>1 >>>>>9 >>>>>87470 >>>>>=20 >>>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0. >>>>>1 >>>>>4 >>>>>7 >>>>>88025 >>>>>=20 >>>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10 >>>>>9 >>>>>7 >>>>>3 >>>>>79357 >>>>>=20 >>>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0 >>>>>3 >>>>>5 >>>>>8 >>>>>19767 >>>>>=20 >>>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1 >>>>>0 >>>>>8 >>>>>1 >>>>>98203 >>>>>=20 >>>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0 >>>>>9 >>>>>5 >>>>>2 >>>>>82500 >>>>> >>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} >>>> >>>>Reading through that snippet of data made me think that there exists a >>>>document with rowed 41154 with cosine value of ~0.0658 (the last >>>>element >>>>in >>>>the snippet). >>>> >>>>The problem is that the folder >>>> >>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted >>>> >>>>Only has 21578 files in it. Indeed, my dictionary file (output >>>>command >>>>used shown below) >>>> >>>>> mahout seqdumper -i reuters-matrix/docIndex | tail >>>> >>>>Has a max key of >>>> >>>>> Key: 21576: Value: /reut2-021.sgm-98.txt >>>>> Key: 21577: Value: >>> /reut2-021.sgm-99.txt >>>>> Count: 21578 >>>> >>>>So I cannot find the document with key value 41154 . What > does the >>>>41154 >>>>related to???? >>>> >>>>Obviously I have misunderstood something that I did =A1=A9 or need to do =A1=A9 >>>>in >>>>the >>>>tour. Can someone please shine a light on where I strayed? I have >>>>scripted >>>>every step that I took and can share them here if desired (I noticed >>>>that >>>>some of the output file names changed since the page was written =A1=A9 so = I >>>>made >>>>adjustments). >>>> >>>>Regards, >>>> >>>>SCott =20 >>>> >>>>PS Thanks TD for helping me earlier