mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: How to find the k most similar docs
Date Fri, 09 Mar 2012 17:50:21 GMT
I assume that the other matrix operations will consume and produce 
<Text, MatrixWritable>? If so how do you create <Text, MatrixWritable> 
from the output of rowid <IntWritable, VectorWritable>?

Also while we are at it how do you use vectordump? If you do "bin/mahout 
vectordump --help" you get some crazy output that is unreadable. I would 
have guessed that vectordump would work on either <IntWritable, 
VectorWritable> so the output of rowid OR <Text, VectorWritable> the 
contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on 
either using "bin/mahout vectordump -s path-to-file"

Thanks
Pat

On 3/9/12 4:26 AM, Suneel Marthi wrote:
> Pat,
>
> MatrixDump expects an input file of<Text, MatrixWritable>  .  The matrix that gets
created from RowIdJob is<IntWritable, VectorWritable>  and you cannot run MatrixDump
to see the contents of the matrix.  You need to use seqdumper as you had done.
>
>
>
> ________________________________
>   From: Pat Ferrel<pat@occamsmachete.com>
> To: user@mahout.apache.org
> Sent: Thursday, March 8, 2012 7:14 PM
> Subject: Re: How to find the k most similar docs
>
> OK, back to the beginning. I went through the entire sequence again with the notable
exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters.
>
>     bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a
>     org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf
>     -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
>
> after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix.
Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity
job and now get a far bigger file created that looks OK
>
>     bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o
>     wp-similarity -ess -s SIMILARITY_COSINE -m 10
>     MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>     Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
>     HADOOP_CONF_DIR=/usr/local/hadoop/conf
>     MAHOUT-JOB:
>     /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
>     12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments:
>     {--endPhase=2147483647, --excludeSelfSimilarity=false,
>     --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10,
>     --numberOfColumns=311433, --output=wp-similarity,
>     --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
>     12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to
>     process : 1
>     12/03/08 15:48:36 INFO mapred.JobClient: Running job:
>     job_201203071745_0040
>     12/03/08 15:48:37 INFO mapred.JobClient:  map 0% reduce 0%
>     12/03/08 15:48:58 INFO mapred.JobClient:  map 17% reduce 0%
>     12/03/08 15:49:01 INFO mapred.JobClient:  map 27% reduce 0%
>     12/03/08 15:49:04 INFO mapred.JobClient:  map 40% reduce 0%
>     12/03/08 15:49:07 INFO mapred.JobClient:  map 47% reduce 0%
>     12/03/08 15:49:10 INFO mapred.JobClient:  map 60% reduce 0%
>     12/03/08 15:49:13 INFO mapred.JobClient:  map 70% reduce 0%
>     12/03/08 15:49:16 INFO mapred.JobClient:  map 80% reduce 0%
>     12/03/08 15:49:19 INFO mapred.JobClient:  map 92% reduce 0%
>     12/03/08 15:49:22 INFO mapred.JobClient:  map 100% reduce 0%
>     12/03/08 15:49:46 INFO mapred.JobClient:  map 100% reduce 33%
>     12/03/08 15:49:52 INFO mapred.JobClient:  map 100% reduce 100%
>     12/03/08 15:49:57 INFO mapred.JobClient: Job complete:
>     job_201203071745_0040
>     12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26
>     12/03/08 15:49:57 INFO mapred.JobClient:   Job Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     Launched reduce tasks=1
>     12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=55564
>     12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
>     reduces waiting after reserving slots (ms)=0
>     12/03/08 15:49:57 INFO mapred.JobClient:     Total time spent by all
>     maps waiting after reserving slots (ms)=0
>     12/03/08 15:49:57 INFO mapred.JobClient:     Rack-local map tasks=1
>     12/03/08 15:49:57 INFO mapred.JobClient:     Launched map tasks=1
>     12/03/08 15:49:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=13565
>     12/03/08 15:49:57 INFO mapred.JobClient:   File Output Format Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Written=45587186
>     12/03/08 15:49:57 INFO mapred.JobClient:   FileSystemCounters
>     12/03/08 15:49:57 INFO mapred.JobClient:     FILE_BYTES_READ=99732287
>     12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_READ=17156393
>     12/03/08 15:49:57 INFO mapred.JobClient:       FILE_BYTES_WRITTEN=138104586
>     12/03/08 15:49:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=45587207
>     12/03/08 15:49:57 INFO mapred.JobClient:   File Input Format Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     Bytes Read=17156283
>     12/03/08 15:49:57 INFO mapred.JobClient:     org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>     12/03/08 15:49:57 INFO mapred.JobClient:     ROWS=4838
>     12/03/08 15:49:57 INFO mapred.JobClient:   Map-Reduce Framework
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input groups=294936
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map output materialized
>     bytes=38326948
>     12/03/08 15:49:57 INFO mapred.JobClient:     Combine output
>     records=2242965
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map input records=4838
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce shuffle
>     bytes=38326948
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce output
>     records=294933
>     12/03/08 15:49:57 INFO mapred.JobClient:     Spilled Records=3432447
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map output bytes=83168813
>     12/03/08 15:49:57 INFO mapred.JobClient:     Combine input
>     records=5912090
>     12/03/08 15:49:57 INFO mapred.JobClient:     Map output records=3964061
>     12/03/08 15:49:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
>     12/03/08 15:49:57 INFO mapred.JobClient:     Reduce input records=294936
>     12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to
>     process : 1
>     12/03/08 15:49:58 INFO mapred.JobClient: Running job:
>     job_201203071745_0041
>     12/03/08 15:49:59 INFO mapred.JobClient:  map 0% reduce 0%
>     12/03/08 15:50:19 INFO mapred.JobClient:  map 8% reduce 0%
>     12/03/08 15:50:22 INFO mapred.JobClient:  map 12% reduce 0%
>     12/03/08 15:50:25 INFO mapred.JobClient:  map 15% reduce 0%
>     12/03/08 15:50:28 INFO mapred.JobClient:  map 21% reduce 0%
>     12/03/08 15:50:31 INFO mapred.JobClient:  map 23% reduce 0%
>     12/03/08 15:50:34 INFO mapred.JobClient:  map 28% reduce 0%
>     12/03/08 15:50:37 INFO mapred.JobClient:  map 32% reduce 0%
>     12/03/08 15:50:40 INFO mapred.JobClient:  map 33% reduce 0%
>     12/03/08 15:50:43 INFO mapred.JobClient:  map 35% reduce 0%
>     12/03/08 15:50:46 INFO mapred.JobClient:  map 40% reduce 0%
>     12/03/08 15:50:49 INFO mapred.JobClient:  map 42% reduce 0%
>     12/03/08 15:50:52 INFO mapred.JobClient:  map 47% reduce 0%
>     12/03/08 15:50:55 INFO mapred.JobClient:  map 48% reduce 0%
>     12/03/08 15:50:58 INFO mapred.JobClient:  map 55% reduce 0%
>     12/03/08 15:51:01 INFO mapred.JobClient:  map 57% reduce 0%
>     12/03/08 15:51:04 INFO mapred.JobClient:  map 62% reduce 0%
>     12/03/08 15:51:07 INFO mapred.JobClient:  map 67% reduce 0%
>     12/03/08 15:51:10 INFO mapred.JobClient:  map 69% reduce 0%
>     12/03/08 15:51:13 INFO mapred.JobClient:  map 75% reduce 0%
>     12/03/08 15:51:20 INFO mapred.JobClient:  map 80% reduce 0%
>     12/03/08 15:51:23 INFO mapred.JobClient:  map 81% reduce 0%
>     12/03/08 15:51:26 INFO mapred.JobClient:  map 86% reduce 0%
>     12/03/08 15:51:29 INFO mapred.JobClient:  map 88% reduce 0%
>     12/03/08 15:51:31 INFO mapred.JobClient:  map 92% reduce 0%
>     12/03/08 15:51:34 INFO mapred.JobClient:  map 94% reduce 0%
>     12/03/08 15:51:37 INFO mapred.JobClient:  map 98% reduce 0%
>     12/03/08 15:51:40 INFO mapred.JobClient:  map 100% reduce 0%
>     12/03/08 15:52:19 INFO mapred.JobClient:  map 100% reduce 70%
>     12/03/08 15:52:26 INFO mapred.JobClient:  map 100% reduce 100%
>     12/03/08 15:52:31 INFO mapred.JobClient: Job complete:
>     job_201203071745_0041
>     12/03/08 15:52:31 INFO mapred.JobClient: Counters: 27
>     12/03/08 15:52:31 INFO mapred.JobClient:   Job Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     Launched reduce tasks=1
>     12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=124769
>     12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
>     reduces waiting after reserving slots (ms)=0
>     12/03/08 15:52:31 INFO mapred.JobClient:     Total time spent by all
>     maps waiting after reserving slots (ms)=0
>     12/03/08 15:52:31 INFO mapred.JobClient:     Rack-local map tasks=1
>     12/03/08 15:52:31 INFO mapred.JobClient:     Launched map tasks=1
>     12/03/08 15:52:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16543
>     12/03/08 15:52:31 INFO mapred.JobClient:   File Output Format Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Written=73395270
>     12/03/08 15:52:31 INFO mapred.JobClient:   FileSystemCounters
>     12/03/08 15:52:31 INFO mapred.JobClient:     FILE_BYTES_READ=509127834
>     12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_READ=45587326
>     12/03/08 15:52:31 INFO mapred.JobClient:       FILE_BYTES_WRITTEN=577589760
>     12/03/08 15:52:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=73395270
>     12/03/08 15:52:31 INFO mapred.JobClient:   File Input Format Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     Bytes Read=45587186
>     12/03/08 15:52:31 INFO mapred.JobClient:     org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>     12/03/08 15:52:31 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>     12/03/08 15:52:31 INFO mapred.JobClient:     COOCCURRENCES=65114863
>     12/03/08 15:52:31 INFO mapred.JobClient:   Map-Reduce Framework
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input groups=4837
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map output materialized
>     bytes=68416023
>     12/03/08 15:52:31 INFO mapred.JobClient:     Combine output
>     records=79108
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map input records=294933
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce shuffle
>     bytes=68416023
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce output records=4837
>     12/03/08 15:52:31 INFO mapred.JobClient:     Spilled Records=117235
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map output bytes=694645784
>     12/03/08 15:52:31 INFO mapred.JobClient:     Combine input
>     records=4038329
>     12/03/08 15:52:31 INFO mapred.JobClient:     Map output records=3964058
>     12/03/08 15:52:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
>     12/03/08 15:52:31 INFO mapred.JobClient:     Reduce input records=4837
>     12/03/08 15:52:32 INFO input.FileInputFormat: Total input paths to
>     process : 1
>     12/03/08 15:52:32 INFO mapred.JobClient: Running job:
>     job_201203071745_0042
>     12/03/08 15:52:33 INFO mapred.JobClient:  map 0% reduce 0%
>     12/03/08 15:52:52 INFO mapred.JobClient:  map 3% reduce 0%
>     12/03/08 15:52:55 INFO mapred.JobClient:  map 5% reduce 0%
>     12/03/08 15:52:58 INFO mapred.JobClient:  map 7% reduce 0%
>     12/03/08 15:53:01 INFO mapred.JobClient:  map 9% reduce 0%
>     12/03/08 15:53:04 INFO mapred.JobClient:  map 10% reduce 0%
>     12/03/08 15:53:07 INFO mapred.JobClient:  map 12% reduce 0%
>     12/03/08 15:53:10 INFO mapred.JobClient:  map 14% reduce 0%
>     12/03/08 15:53:13 INFO mapred.JobClient:  map 17% reduce 0%
>     12/03/08 15:53:16 INFO mapred.JobClient:  map 18% reduce 0%
>     12/03/08 15:53:19 INFO mapred.JobClient:  map 21% reduce 0%
>     12/03/08 15:53:22 INFO mapred.JobClient:  map 23% reduce 0%
>     12/03/08 15:53:25 INFO mapred.JobClient:  map 25% reduce 0%
>     12/03/08 15:53:28 INFO mapred.JobClient:  map 27% reduce 0%
>     12/03/08 15:53:31 INFO mapred.JobClient:  map 29% reduce 0%
>     12/03/08 15:53:34 INFO mapred.JobClient:  map 31% reduce 0%
>     12/03/08 15:53:37 INFO mapred.JobClient:  map 33% reduce 0%
>     12/03/08 15:53:40 INFO mapred.JobClient:  map 35% reduce 0%
>     12/03/08 15:53:43 INFO mapred.JobClient:  map 37% reduce 0%
>     12/03/08 15:53:46 INFO mapred.JobClient:  map 39% reduce 0%
>     12/03/08 15:53:49 INFO mapred.JobClient:  map 41% reduce 0%
>     12/03/08 15:53:52 INFO mapred.JobClient:  map 43% reduce 0%
>     12/03/08 15:53:55 INFO mapred.JobClient:  map 46% reduce 0%
>     12/03/08 15:53:58 INFO mapred.JobClient:  map 48% reduce 0%
>     12/03/08 15:54:01 INFO mapred.JobClient:  map 50% reduce 0%
>     12/03/08 15:54:04 INFO mapred.JobClient:  map 53% reduce 0%
>     12/03/08 15:54:07 INFO mapred.JobClient:  map 55% reduce 0%
>     12/03/08 15:54:10 INFO mapred.JobClient:  map 57% reduce 0%
>     12/03/08 15:54:13 INFO mapred.JobClient:  map 60% reduce 0%
>     12/03/08 15:54:16 INFO mapred.JobClient:  map 63% reduce 0%
>     12/03/08 15:54:19 INFO mapred.JobClient:  map 65% reduce 0%
>     12/03/08 15:54:22 INFO mapred.JobClient:  map 68% reduce 0%
>     12/03/08 15:54:25 INFO mapred.JobClient:  map 71% reduce 0%
>     12/03/08 15:54:28 INFO mapred.JobClient:  map 74% reduce 0%
>     12/03/08 15:54:31 INFO mapred.JobClient:  map 77% reduce 0%
>     12/03/08 15:54:34 INFO mapred.JobClient:  map 81% reduce 0%
>     12/03/08 15:54:37 INFO mapred.JobClient:  map 84% reduce 0%
>     12/03/08 15:54:40 INFO mapred.JobClient:  map 88% reduce 0%
>     12/03/08 15:54:43 INFO mapred.JobClient:  map 93% reduce 0%
>     12/03/08 15:54:46 INFO mapred.JobClient:  map 99% reduce 0%
>     12/03/08 15:54:49 INFO mapred.JobClient:  map 100% reduce 0%
>     12/03/08 15:55:01 INFO mapred.JobClient:  map 100% reduce 100%
>     12/03/08 15:55:06 INFO mapred.JobClient: Job complete:
>     job_201203071745_0042
>     12/03/08 15:55:06 INFO mapred.JobClient: Counters: 25
>     12/03/08 15:55:06 INFO mapred.JobClient:   Job Counters
>     12/03/08 15:55:06 INFO mapred.JobClient:     Launched reduce tasks=1
>     12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=133985
>     12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
>     reduces waiting after reserving slots (ms)=0
>     12/03/08 15:55:06 INFO mapred.JobClient:     Total time spent by all
>     maps waiting after reserving slots (ms)=0
>     12/03/08 15:55:06 INFO mapred.JobClient:     Launched map tasks=1
>     12/03/08 15:55:06 INFO mapred.JobClient:     Data-local map tasks=1
>     12/03/08 15:55:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10311
>     12/03/08 15:55:06 INFO mapred.JobClient:   File Output Format Counters
>     12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Written=580158
>     12/03/08 15:55:06 INFO mapred.JobClient:   FileSystemCounters
>     12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_READ=14921344
>     12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_READ=73395400
>     12/03/08 15:55:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=15396906
>     12/03/08 15:55:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=580158
>     12/03/08 15:55:06 INFO mapred.JobClient:   File Input Format Counters
>     12/03/08 15:55:06 INFO mapred.JobClient:     Bytes Read=73395270
>     12/03/08 15:55:06 INFO mapred.JobClient:   Map-Reduce Framework
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input groups=4837
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map output materialized
>     bytes=431573
>     12/03/08 15:55:06 INFO mapred.JobClient:     Combine output
>     records=96955
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map input records=4837
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce output records=4837
>     12/03/08 15:55:06 INFO mapred.JobClient:     Spilled Records=166369
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map output bytes=153928302
>     12/03/08 15:55:06 INFO mapred.JobClient:     Combine input
>     records=7418380
>     12/03/08 15:55:06 INFO mapred.JobClient:     Map output records=7326262
>     12/03/08 15:55:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
>     12/03/08 15:55:06 INFO mapred.JobClient:     Reduce input records=4837
>     12/03/08 15:55:06 INFO driver.MahoutDriver: Program took 391379 ms
>     (Minutes: 6.522983333333333)
>
> performing seqdumper on the output looks reasonable.
>
> Maybe named vectors is a problem?
>
>
> On 3/7/12 8:50 AM, Sebastian Schelter wrote:
>> Hi Pat,
>>
>> Something is going completely wrong. The first pass over the data of
>> RowSimilarityJob transposes the input matrix. From the output of the
>> first jobs, it seems as if your input is a 4838 x 3 matrix only:
>>
>> Map input records=4838
>> Map output records=3
>> Combine input records=3
>> Combine output records=3
>> Reduce input records=3
>>
>> Could you have a detailed look at the input to RowSimilarityJob?
>>
>> --sebastian
>>
>>
>> On 07.03.2012 17:38, Pat Ferrel wrote:
>>>       12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0

Mime
View raw message