mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis
Date Thu, 19 Dec 2013 19:08:27 GMT
I don't see a need for uploading ur commands.  Clean up HDFS (both output and temp folders)
and try running the 5 steps again - extract reuters, seqdirectory, seq2sparse, rowid job,
rowsimilarity job.

Please use '-ow' option while running each of the jobs.







On Thursday, December 19, 2013 2:04 PM, Scott C. Cote <scottccote@gmail.com> wrote:
 
I manually deleted the temp folder too (After 2 failed starts).

Would it be helpful for me to upload my shells that encapsulate all of the
commands posted on the tour?  They reflect the current state of reuters
and .8 mahout.
And if I did - how would I do it?

Thanks,

SCott


On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_marthi@yahoo.com> wrote:

>Yep, that's what has happened in ur case. the wiki doesn't have but
>please specify the -ow (overwrite) option while running the
>RowsimilarityJob. That should clear up both the output and temp folders
>before running the job.
>
>
>
>
>
>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
><suneel_marthi@yahoo.com> wrote:
> 
>Haha... that could explain it, Rowsimilarityjob creates temp files during
>execution. If ur laptop 'sleeped' then the temp files still persist and
>running the job again wouldn't overwrite the old temp files (i need to
>verify that).
>
>It should be good enough to run the Rowsimilarity job again.
>
>
>
>
>
>
>
>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
><scottccote@gmail.com> wrote:
> 
>Suneel,
>
>I'm going to do the similarity part of the tour over - my laptop was
>"sleeped" in the middle of the run of the rowsimilarity job.
>Maybe the job is sensitive to that ….  :(  Normally - a server would not
>go to sleep nor would it run
>in local mode.
>
>Sorry that I didn't think of that sooner.
>Will let you know my outcome.
>
>Am planning on redoing by deleting the contents and the folder titled
>"reuters-similarity"
>
>Please let me know if that is not good enough.
>
>Thanks again.
>
>SCott
>
>
>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_marthi@yahoo.com> wrote:
>
>>What you are seeing is the output matrix of the RowSimilarity job.  You
>>are right there should be 21578 documents only in the reuters corpus.
>>
>>a) How many documents do you have in your docIndex?  DocIndex is one of
>>the artifacts of the RowIDJob and should have been executed prior to the
>>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>>
>>b) Also what was the message at the end of the RowId job. It should read
>>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>>reuters-matrix/matrix'.
>>
>>
>>
>>
>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
>><scottccote@gmail.com> wrote:
>> 
>>All,
>>
>>I am a newbie Mahout user and am trying to use the "Quick tour of text
>>analysis using the Mahout command line" .  Thank you to whomever
>>contributed
>>to that page.
>>
>>> 
>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+an
>>>a
>>>lysis
>>> +using+the+Mahout+command+line
>>
>>Went all the way from beginning to end of
> the page with "seemingly" no
>>hiccups.
>>At the very end of the "tour", I became confused because the command:
>>
>>> mahout seqdumper -i reuters-matrix/matrix | more
>>
>>Allowed me to see output (snippet)
>>
>>> Key: 1: Value: 
>>> 
>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,44
>>>0
>>>3:0.2
>>> 
>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108
>>>:
>>>0.126
>>> 
>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750
>>>:
>>>0.188
>>> 
>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0
>>>.
>>>36601
>>> 
>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0
>>>.
>>>10869
>>> 
>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.
>>>1
>>>23091
>>> 
>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0
>>>6
>>>16936
>>> 
>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.
>>>1
>>>23271
>>> 
>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0
>>>8
>>>01873
>>> 
>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1
>>>9
>>>87470
>>> 
>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.14
>>>7
>>>88025
>>> 
>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097
>>>3
>>>79357
>>> 
>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035
>>>8
>>>19767
>>> 
>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108
>>>1
>>>98203
>>> 
>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095
>>>2
>>>82500
>>>
> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>>
>>Reading through that snippet of data made me think that there exists a
>>document with rowed 41154 with cosine value of  ~0.0658 (the last element
>>in
>>the snippet).
>>
>>The problem is that the folder
>>
>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>>
>>Only has 21578 files in it.  Indeed, my dictionary file  (output command
>>used shown below)
>>
>>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>>
>>Has a max key of
>>
>>> Key: 21576: Value: /reut2-021.sgm-98.txt
>>> Key: 21577: Value:
> /reut2-021.sgm-99.txt
>>> Count: 21578
>>
>>So I cannot find the document with key value 41154   .  What does the
>>41154
>>related to????
>>
>>Obviously I have misunderstood something that I did ­ or need to do ­ in
>>the
>>tour.  Can someone please shine a light on where I strayed?  I have
>>scripted
>>every step that I took and can share them here if desired (I noticed that
>>some of the output file names changed since the page was written ­ so I
>>made
>>adjustments).
>>
>>Regards,
>>
>>SCott  
>>
>>PS  Thanks TD for helping me earlier
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message