mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: How to find the k most similar docs
Date Mon, 05 Mar 2012 19:29:35 GMT
I'm using Mahout 0.6 compiled from source via 'mvn install' I used 
Suneel's code below to get NumberOfColumns.

When I try to run the rowsimilarity job via:

    bin/mahout rowsimilarity -i wikipedia-clusters/tfidf-vectors/ -o
    /wikipedia-similarity -r 87325 -s SIMILARITY_COSINE -m 10  -ess true

I get the following error

    12/03/04 19:14:32 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --excludeSelfSimilarity=true,
    --input=wikipedia-clusters/tfidf-vectors/,
    --maxSimilaritiesPerRow=10, --numberOfColumns=87325,
    --output=/wikipedia-similarity,
    --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
    2012-03-04 19:14:32.376 java[1090:1903] Unable to load realm info
    from SCDynamicStore
    12/03/04 19:14:33 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/04 19:14:33 INFO mapred.JobClient: Running job: job_local_0001
    12/03/04 19:14:33 INFO mapred.MapTask: io.sort.mb = 100
    12/03/04 19:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
    12/03/04 19:14:33 INFO mapred.MapTask: record buffer = 262144/327680
    12/03/04 19:14:34 WARN mapred.LocalJobRunner: job_local_0001
    java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
    cast to org.apache.hadoop.io.IntWritable
         at
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:154)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at
    org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

The cast error (as I understand it) usually happens when you pass in a 
classname incorrectly. This seems likely since coocurence similarity is 
being used?

I've probably missed something obvious about how to pass in similarity 
measure to use?


On 2/19/12 9:00 PM, Suneel Marthi wrote:
> Hi Pat,
>
>
> 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser
for a description of what the RowSimilarityJob does.  The RowSimilarityJob implementation
is based on the research paper  - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf
>
> I'll add the details on the mahout wiki page sometime this week.
>
> 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default
this returns top 100 if not specified.
>
> 3. If you would like to discard the similarities per row below a certain value you can
specify a threshold -tr,  which would limit the results to only those documents that have
a similarity value greater than the threshold.
>
>     Depending on the similarity measures that you get as the final output, it should
give you an idea of what T1 and T2 should be.  In my particular use case I was only interested
in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and
the top most similar documents has a similarity value of 0.99999 (which was what I used as
my T1).
>
> 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be
inferred automatically if not specified by the size of the input vector.  This could be an
enhancement to add to the RowSimilarityJob.
>
>     Code snippet below gets the number of columns in a matrix if not specified by the
user.
>
>     Path inputMatrixPath = new Path(getInputPath());
>
>     SequenceFile.Reader  sequenceFileReader =  new SequenceFile.Reader (fs, inputMatrixPath,
conf);
>
>     int NumberOfColumns = getDimensions(sequenceFileReader);
>
> sequenceFileReader.close();
> private int getDimensions(Reader reader) throws IOException, InstantiationException,
IllegalAccessException {
>      Class keyClass = reader.getKeyClass();
>      Writable row = (Writable) keyClass.newInstance();
>      if (! reader.getValueClass().equals(VectorWritable.class)) {
>        throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable");
>      }
>      VectorWritable vw = new VectorWritable();
> if (!reader.next(row, vw)) {
>        log.error("matrix must have at least one row");
>        throw new IllegalStateException();
>      }
>      Vector v = vw.get();
>      return v.size();
>   }
> 5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default)
but you need to specify this so that you don't end up comparing a document with itself and
ending up with a similarity measure of 1.0 (if using Cosine measure).
>
> Let me know if you have any more questions.
>      
>
>
>
>
> ________________________________
>   From: Sebastian Schelter<ssc@apache.org>
> To: user@mahout.apache.org
> Sent: Sunday, February 19, 2012 4:33 PM
> Subject: Re: How to find the k most similar docs
>
> Hi Pat,
>
> 'numberOfColumns' is not optional but is only used by a few
> similarityMeasures (such as loglikelihood ratio).
> 'maxSimilaritiesPerRow' retains the top similarities.
>
> --sebastian
>
>
> On 19.02.2012 22:11, Pat Ferrel wrote:
>> This looks perfect, thanks.
>>
>> I had planned to do the RowSimilarityJob after clustering to reduce the
>> rows from the entire corpus to only those in a cluster. You mention
>> using the distance between similar rows to get an idea of the distances
>> for canopy clustering. This seems a very good idea since I have no other
>> good way to generate T1 and T2. The downside is that I have to do
>> RowSimilarityJob on all docs in the corpus. I assume that since you have
>> done this on 10 Million docs that the benefit in getting good canopies
>> outweighs doing similarity on all docs as far as processing resources
>> needed?
>>
>> I am
>   new to reading mapreduce code so may I ask some noob questions:
>>    * is the best documentation here?
>>   
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[])
>>
>>    * the command line arguments include: numberOfColumns, shouldn't that
>>      be easily extracted from the input matrix? is this optional? How do
>>      I tell which argument is optional from the docs?
>>    * the argument maxSimilaritiesPerRow could return first or best, it is
>>      difficult to see which.
>>
>> I have the source but perhaps due to the string based binding I am
>> finding it hard to track down what code is run so any tips
>   for reading
>> the code or docs are greatly appreciated.
>>
>>
>> On 2/18/12 1:27 PM, Suneel Marthi wrote:
>>> You might want to look at the RowSimilarityJob in Mahout to determine
>>> document similarity.
>>>
>>>
>>> Here's what you would do:-
>>>
>>> Assuming that your documents have already been vectorized, first
>>> convert the vectors into an M*N matrix by calling the RowIdJob in
>>> Mahout where M = No. of rows (or documents in your case) and N= No. of
>>> columns (or the unique terms).
>>>
>>>
>>> Then run the RowSimilarity job on the matrix generated in the previous
>>> step by specifying a cosine similarity measure, this should generate
>>> an output that gives the most similar documents for each of the
>>> documents and the similarity distance between them. RowSimilarityJob
>>> is a
>   mapreduce job so you should be able to run this on a really large
>>> corpus (I had run this on 10 million web pages).
>>> The output of the RowSimilarity along with the similarity distances
>>> that are generated between document pairs should give an idea as to
>>> what the values of T1 and T2 should be when running canopy clustering.
>>> And the number of clusters generated by running canopy would
>>> eventually be fed into k-means as you had mentioned.
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>>     From: Pat Ferrel<pat@occamsmachete.com>
>>> To: user@mahout.apache.org
>>> Sent: Saturday, February 18, 2012 2:39 PM
>>> Subject: How to
>   find the k most similar docs
>>> Given documents that are vectorized into Mahout vectors, have stop
>>> words removed, and a TFIDF dictionary, what is the best distributed
>>> way to get k nearest documents using a measure like cosine similarity
>>> (or the others provided in Mahout)? I will be doing this for every
>>> document in the corpus so the question is partly how best to do this
>>> given the existing mahout+hadoop framework. What is the intuition
>>> about processing resources needed?
>>>
>>> Expansion: At some point I'd like to extend this idea to find similar
>>> clusters but expect that the same method should work only with
>>> centroids instead of doc vectors. Also I expect to do canopy
>>> clustering to feed into kmeans clustering. I'll perform the similarity
>>> measure only on docs in the same cluster. I think I understand
>   how to
>>> do this preprocessing so the question is primarily the k most similar
>>> docs and/or centroids. This sounds like k nearest neighbors, if so is
>>> this the best way to do it in
>>>     mahout+hadoop?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message