mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: How to find the k most similar docs
Date Wed, 07 Mar 2012 16:38:54 GMT
I have been experimenting with different analyzers and n-grams to clean 
up the vectors. Here is a run on a high dimensionality set of vectors 
with a loose analyzer (I think it was the default) The output of the 
rowid job was:

    pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowid -i
    wikipedia-tfidf-custom-analyzer/tfidf-vectors/ -o wikipedia-matrix
    --tempDir temp
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/06 16:53:29 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647,
    --input=wikipedia-tfidf-custom-analyzer/tfidf-vectors/,
    --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
    12/03/06 16:53:30 INFO util.NativeCodeLoader: Loaded the
    native-hadoop library
    12/03/06 16:53:30 INFO zlib.ZlibFactory: Successfully loaded &
    initialized native-zlib library
    12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor
    12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor
    12/03/06 16:53:30 INFO vectors.RowIdJob: Wrote out matrix with 4838
    rows and 286907 columns to wikipedia-matrix/matrix
    12/03/06 16:53:30 INFO driver.MahoutDriver: Program took 1248 ms
    (Minutes: 0.0208)

Then I removed temp (shouldn't the jobs do that?) and ran the 
rowsililarity job:

    pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowsimilarity -i
    wikipedia-matrix/matrix -o wikipedia-similarity -r 286907
    --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/06 17:00:55 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --excludeSelfSimilarity=true,
    --input=wikipedia-matrix/matrix, --maxSimilaritiesPerRow=10,
    --numberOfColumns=286907, --output=wikipedia-similarity,
    --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
    12/03/06 17:00:56 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/06 17:00:56 INFO mapred.JobClient: Running job:
    job_201203061645_0006
    12/03/06 17:00:57 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/06 17:01:13 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/06 17:01:25 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/06 17:01:30 INFO mapred.JobClient: Job complete:
    job_201203061645_0006
    12/03/06 17:01:30 INFO mapred.JobClient: Counters: 26
    12/03/06 17:01:30 INFO mapred.JobClient:   Job Counters
    12/03/06 17:01:30 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/06 17:01:30 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=13502
    12/03/06 17:01:30 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/06 17:01:30 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/06 17:01:30 INFO mapred.JobClient:     Rack-local map tasks=1
    12/03/06 17:01:30 INFO mapred.JobClient:     Launched map tasks=1
    12/03/06 17:01:30 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10496
    12/03/06 17:01:30 INFO mapred.JobClient:   File Output Format Counters
    12/03/06 17:01:30 INFO mapred.JobClient:     Bytes Written=97
    12/03/06 17:01:30 INFO mapred.JobClient:   FileSystemCounters
    12/03/06 17:01:30 INFO mapred.JobClient:     FILE_BYTES_READ=40
    12/03/06 17:01:30 INFO mapred.JobClient:     HDFS_BYTES_READ=122407
    12/03/06 17:01:30 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45437
    12/03/06 17:01:30 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=118
    12/03/06 17:01:30 INFO mapred.JobClient:   File Input Format Counters
    12/03/06 17:01:30 INFO mapred.JobClient:     Bytes Read=122290
    12/03/06 17:01:30 INFO mapred.JobClient:
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
    12/03/06 17:01:30 INFO mapred.JobClient: ROWS=4838
    12/03/06 17:01:30 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce input groups=3
    12/03/06 17:01:30 INFO mapred.JobClient:     Map output materialized
    bytes=32
    12/03/06 17:01:30 INFO mapred.JobClient:     Combine output records=3
    12/03/06 17:01:30 INFO mapred.JobClient:     Map input records=4838
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce shuffle bytes=32
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce output records=0
    12/03/06 17:01:30 INFO mapred.JobClient:     Spilled Records=6
    12/03/06 17:01:30 INFO mapred.JobClient:     Map output bytes=33
    12/03/06 17:01:30 INFO mapred.JobClient:     Combine input records=3
    12/03/06 17:01:30 INFO mapred.JobClient:     Map output records=3
    12/03/06 17:01:30 INFO mapred.JobClient:     SPLIT_RAW_BYTES=117
    12/03/06 17:01:30 INFO mapred.JobClient:     Reduce input records=3
    12/03/06 17:01:30 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/06 17:01:31 INFO mapred.JobClient: Running job:
    job_201203061645_0007
    12/03/06 17:01:32 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/06 17:01:49 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/06 17:02:01 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/06 17:02:06 INFO mapred.JobClient: Job complete:
    job_201203061645_0007
    12/03/06 17:02:06 INFO mapred.JobClient: Counters: 25
    12/03/06 17:02:06 INFO mapred.JobClient:   Job Counters
    12/03/06 17:02:06 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/06 17:02:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12989
    12/03/06 17:02:06 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Launched map tasks=1
    12/03/06 17:02:06 INFO mapred.JobClient:     Data-local map tasks=1
    12/03/06 17:02:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10341
    12/03/06 17:02:06 INFO mapred.JobClient:   File Output Format Counters
    12/03/06 17:02:06 INFO mapred.JobClient:     Bytes Written=97
    12/03/06 17:02:06 INFO mapred.JobClient:   FileSystemCounters
    12/03/06 17:02:06 INFO mapred.JobClient:     FILE_BYTES_READ=22
    12/03/06 17:02:06 INFO mapred.JobClient:     HDFS_BYTES_READ=237
    12/03/06 17:02:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=45937
    12/03/06 17:02:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=97
    12/03/06 17:02:06 INFO mapred.JobClient:   File Input Format Counters
    12/03/06 17:02:06 INFO mapred.JobClient:     Bytes Read=97
    12/03/06 17:02:06 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce input groups=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map output materialized
    bytes=14
    12/03/06 17:02:06 INFO mapred.JobClient:     Combine output records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map input records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce shuffle bytes=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce output records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Spilled Records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map output bytes=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Combine input records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     Map output records=0
    12/03/06 17:02:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
    12/03/06 17:02:06 INFO mapred.JobClient:     Reduce input records=0
    12/03/06 17:02:07 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/06 17:02:07 INFO mapred.JobClient: Running job:
    job_201203061645_0008
    12/03/06 17:02:08 INFO mapred.JobClient:  map 0% reduce 0%
    12/03/06 17:02:25 INFO mapred.JobClient:  map 100% reduce 0%
    12/03/06 17:02:37 INFO mapred.JobClient:  map 100% reduce 100%
    12/03/06 17:02:42 INFO mapred.JobClient: Job complete:
    job_201203061645_0008
    12/03/06 17:02:42 INFO mapred.JobClient: Counters: 25
    12/03/06 17:02:42 INFO mapred.JobClient:   Job Counters
    12/03/06 17:02:42 INFO mapred.JobClient:     Launched reduce tasks=1
    12/03/06 17:02:42 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12971
    12/03/06 17:02:42 INFO mapred.JobClient:     Total time spent by all
    reduces waiting after reserving slots (ms)=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Total time spent by all
    maps waiting after reserving slots (ms)=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Launched map tasks=1
    12/03/06 17:02:42 INFO mapred.JobClient:     Data-local map tasks=1
    12/03/06 17:02:42 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10322
    12/03/06 17:02:42 INFO mapred.JobClient:   File Output Format Counters
    12/03/06 17:02:42 INFO mapred.JobClient:     Bytes Written=97
    12/03/06 17:02:42 INFO mapred.JobClient:   FileSystemCounters
    12/03/06 17:02:42 INFO mapred.JobClient:     FILE_BYTES_READ=22
    12/03/06 17:02:42 INFO mapred.JobClient:     HDFS_BYTES_READ=227
    12/03/06 17:02:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44039
    12/03/06 17:02:42 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=97
    12/03/06 17:02:42 INFO mapred.JobClient:   File Input Format Counters
    12/03/06 17:02:42 INFO mapred.JobClient:     Bytes Read=97
    12/03/06 17:02:42 INFO mapred.JobClient:   Map-Reduce Framework
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce input groups=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map output materialized
    bytes=14
    12/03/06 17:02:42 INFO mapred.JobClient:     Combine output records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map input records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce shuffle bytes=14
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce output records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Spilled Records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map output bytes=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Combine input records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     Map output records=0
    12/03/06 17:02:42 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
    12/03/06 17:02:42 INFO mapred.JobClient:     Reduce input records=0
    12/03/06 17:02:42 INFO driver.MahoutDriver: Program took 107225 ms
    (Minutes: 1.7870833333333334)

It seems to have executed correctly. I ran it on a small cluster but it 
was awfully fast at that. The row counter is there but not the others.

How is the output stored? What does it represent? I would expect a 
sequence of row ids as keys with ten rowids each as values? I used named 
vectors if that matters.

The output is of the correct type but empty. Here is the seqdump output, 
notice count=0, and the file is 97 bytes.

    pat@occam2:~/mahout-distribution-0.6$ bin/mahout seqdumper -s
    wikipedia-similarity/part-r-00000
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    HADOOP_CONF_DIR=/usr/local/hadoop/conf
    MAHOUT-JOB:
    /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar
    12/03/07 08:31:59 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --seqFile=wikipedia-similarity/part-r-00000,
    --startPhase=0, --tempDir=temp}
    Input Path: wikipedia-similarity/part-r-00000
    Key class: class org.apache.hadoop.io.IntWritable Value Class: class
    org.apache.mahout.math.VectorWritable
    Count: 0
    12/03/07 08:31:59 INFO driver.MahoutDriver: Program took 603 ms
    (Minutes: 0.01005)


On 3/6/12 11:09 PM, Sebastian Schelter wrote:
> Hi Pat,
>
> You are right, these results look strange. RowSimilarityJob has 3 custom
> counters (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES), can you give use
> the numbers for these?
>
> --sebastian
>
> On 07.03.2012 02:14, Pat Ferrel wrote:
>> Ok, making progress. I created a matrix using rowid and got the
>> following output:
>>
>>     Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
>>     wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
>>     ...
>>     12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
>>     {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
>>     --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
>>     2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
>>     from SCDynamicStore
>>     12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
>>     native-hadoop library for your platform... using builtin-java
>>     classes where applicable
>>     12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>>     12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
>>     12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
>>     rows and 87325 columns to wikipedia-matrix/matrix
>>     12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
>>     (Minutes: 0.0293)
>>
>> So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
>> RowSimilarityJob
>>
>>     Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
>>     -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
>>     --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp
>>
>> This gives me output in wikipedia-similarity/part-m-00000 but the size
>> is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
>> set no threshold so I'd expect it to pick the 10 nearest even if they
>> are far away.
>>
>> BTW what is the output format?
>>
>> On 3/5/12 11:48 AM, Suneel Marthi wrote:
>>> Pat,
>>>
>>> Your input to RowSimilarity seems to be the tfidf-vectors directory
>>> which is<Text, vectorWritable>.
>>>
>>> Before executing the RowSimilarity job u need to run the RowIdJob
>>> which creates a matrix of<IntWritable, VectorWritable>.  This matrix
>>> should be the input to RowSimilarity.
>>>
>>> Also from your command, you seem to be missing --tempDir argument, you
>>> would need that too.
>>>
>>> Suneel
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Sebastian Schelter<ssc@apache.org>
>>> *To:* user@mahout.apache.org
>>> *Sent:* Monday, March 5, 2012 2:32 PM
>>> *Subject:* Re: How to find the k most similar docs
>>>
>>> That's the problem:
>>>
>>> org.apache.hadoop.io.Text cannot be
>>>    cast to org.apache.hadoop.io
>>> <http://org.apache.hadoop.io.Int>.IntWritable
>>>
>>> RowSimilarityJob expects<IntWritable,VectorWritable>  as input, it seems
>>> you supply<Text,VectorWritable>.
>>>
>>> --sebastian
>>>
>>> On 05.03.2012 20:29, Pat Ferrel wrote:
>>>> org.apache.hadoop.io.Text cannot be
>>>>     cast to org.apache.hadoop.io.IntWritable
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message