mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <suneel_mar...@yahoo.com>
Subject Re: Question on RowSimilarityJob
Date Thu, 02 Feb 2012 15:51:17 GMT
Thanks for the details, Sebastian. This is indeed very helpful and I completely understand
now what you had on your whiteboard.

I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto
Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is
normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact
the final outcome anyways.  


This has been a very useful thread and thanks for all your help.



________________________________
 From: Sebastian Schelter <ssc@apache.org>
To: Vicky <javaxmlsoapdev@yahoo.com> 
Cc: "user@mahout.apache.org" <user@mahout.apache.org> 
Sent: Thursday, February 2, 2012 8:08 AM
Subject: Re: Question on RowSimilarityJob
 
The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)

row1: column1, column2
row2: column1, column3
row3: column1, column2

In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)

column1: row1, row2, row3
column2: row1, row3
column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:

for column1:

(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)

for column2:

(row1,row3)
(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:

for row1:

(row1,row2)
(row1,row3)
(row1,row3)

row1,row2 -> 1 cooccurrence
row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)
(row2,row3)

row2,row1 -> 1 cooccurrence
row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)
(row3,row2)
(row3,row1)

row3,row1 -> 2 cooccurrences
row3,row2 -> 1 cooccurrence


Hope that helps with understanding the approach.

--sebastian




On 02.02.2012 12:47, Vicky wrote:
> Sebastian,
> when you said,
>>> The computation should be executed in a way that only rows that have at least
one non-zero value in the same dimension (column) are compared
> 
> 
> if there are three columns in the document (columnA,columnB,columnC,columnD).  
> 
> Doc1: ColumnA, columnB, columnC,columnD
> Doc2:ColumnA,columnB,columnC,columnD
> 
> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB,
Doc1:columnC against Doc2:columnC and likewise?
> 
> 
> 
> 
> ----- Original Message -----
> From: Sebastian Schelter <ssc@apache.org>
> To: user@mahout.apache.org
> Cc: 
> Sent: Wednesday, February 1, 2012 6:06 AM
> Subject: Re: Question on RowSimilarityJob
> 
> Here's a brief description of RowSimilarityJob:
> 
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
> 
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
> 
> The computation is executed using three MapReduce passes:
> 
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
> 
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
> 
> The third step ensures that only the top k similar rows per row are kept.
> 
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.
> 
> /s
> 
> 
> 
> On 31.01.2012 22:40, Suneel Marthi wrote:
>> Sebastian,
>>
>> Question on the RowSimilarity job.
>>
>> a) I created sequence files from my document corpus.
>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document
frequency = 1 and minimum support = 1
>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix
where M = number of  documents in my collection, N = cardinality of the vector - Correct?
>>
>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity
measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N.

>>
>>
>>      I am not sure I completely understand as to what's happening in the RowSimilarity
Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf 
and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg)
for a while to understand what's happening, but guess I need some help.
>>
>> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>>
>> Thanks for your help.
>>
>>
>>
>> ________________________________
>>   From: Sebastian Schelter <ssc@apache.org>
>> To: user@mahout.apache.org 
>> Sent: Friday, January 20, 2012 12:58 PM
>> Subject: Re: Question on RowSimilarityJob
>>  
>> Hi,
>>
>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>> (documents in your use case) to keep per document.
>> 'excludeSelfSimilarity' means that rows (documents) should not be
>> compared to themselves.
>>
>> Sry for the lack of documentation, RowSimilarityJob was originally only
>> an internal job for the recommendation code. I'll try to add something
>> on the wiki in the next days.
>>
>> --sebastian
>>
>>
>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>> I am working on determining document similarity of a corpus I am working with
using RowSimilarity.
>>>
>>> Questions:-
>>>
>>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity'
mean?
>>> b) Are there any docs available on RowSimilarityJob available, this is the best
I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>>
>>> c) Also do we have any docs on RowIdJob ?
>>>
>>> Thanks and Regards,
>>> Suneel
>>>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message