mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott C. Cote" <scottcc...@gmail.com>
Subject Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
Date Thu, 19 Dec 2013 18:46:22 GMT
Suneel,

I'm going to do the similarity part of the tour over - my laptop was
"sleeped" in the middle of the run of the rowsimilarity job.
Maybe the job is sensitive to that ….  :(  Normally - a server would not
go to sleep nor would it run
in local mode.

Sorry that I didn't think of that sooner.
Will let you know my outcome.

Am planning on redoing by deleting the contents and the folder titled
"reuters-similarity"

Please let me know if that is not good enough.

Thanks again.

SCott

On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_marthi@yahoo.com> wrote:

>What you are seeing is the output matrix of the RowSimilarity job.  You
>are right there should be 21578 documents only in the reuters corpus.
>
>a) How many documents do you have in your docIndex?  DocIndex is one of
>the artifacts of the RowIDJob and should have been executed prior to the
>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>
>b) Also what was the message at the end of the RowId job. It should read
>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>reuters-matrix/matrix'.
>
>
>
>
>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
><scottccote@gmail.com> wrote:
> 
>All,
>
>I am a newbie Mahout user and am trying to use the "Quick tour of text
>analysis using the Mahout command line" .  Thank you to whomever
>contributed
>to that page.
>
>> 
>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana
>>lysis
>> +using+the+Mahout+command+line
>
>Went all the way from beginning to end of the page with "seemingly" no
>hiccups.
>At the very end of the "tour", I became confused because the command:
>
>> mahout seqdumper -i reuters-matrix/matrix | more
>
>Allowed me to see output (snippet)
>
>> Key: 1: Value: 
>> 
>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440
>>3:0.2
>> 
>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:
>>0.126
>> 
>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:
>>0.188
>> 
>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.
>>36601
>> 
>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.
>>10869
>> 
>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.1
>>23091
>> 
>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06
>>16936
>> 
>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1
>>23271
>> 
>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08
>>01873
>> 
>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19
>>87470
>> 
>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.147
>>88025
>> 
>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973
>>79357
>> 
>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358
>>19767
>> 
>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081
>>98203
>> 
>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952
>>82500
>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>
>Reading through that snippet of data made me think that there exists a
>document with rowed 41154 with cosine value of  ~0.0658 (the last element
>in
>the snippet).
>
>The problem is that the folder
>
>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>
>Only has 21578 files in it.  Indeed, my dictionary file  (output command
>used shown below)
>
>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>
>Has a max key of
>
>> Key: 21576: Value: /reut2-021.sgm-98.txt
>> Key: 21577: Value: /reut2-021.sgm-99.txt
>> Count: 21578
>
>So I cannot find the document with key value 41154   .  What does the
>41154
>related to????
>
>Obviously I have misunderstood something that I did ­ or need to do ­ in
>the
>tour.  Can someone please shine a light on where I strayed?  I have
>scripted
>every step that I took and can share them here if desired (I noticed that
>some of the output file names changed since the page was written ­ so I
>made
>adjustments).
>
>Regards,
>
>SCott  
>
>PS  Thanks TD for helping me earlier



Mime
View raw message