Thanks for the details, Sebastian. This is indeed very helpful and I completely understand
now what you had on your whiteboard.
I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto
Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is
normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact
the final outcome anyways.
This has been a very useful thread and thanks for all your help.
________________________________
From: Sebastian Schelter <ssc@apache.org>
To: Vicky <javaxmlsoapdev@yahoo.com>
Cc: "user@mahout.apache.org" <user@mahout.apache.org>
Sent: Thursday, February 2, 2012 8:08 AM
Subject: Re: Question on RowSimilarityJob
The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:
We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)
row1: column1, column2
row2: column1, column3
row3: column1, column2
In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)
column1: row1, row2, row3
column2: row1, row3
column3: row2
In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:
for column1:
(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)
for column2:
(row1,row3)
(row3,row1)
for column3 there's nothing to emit.
The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:
for row1:
(row1,row2)
(row1,row3)
(row1,row3)
row1,row2 > 1 cooccurrence
row1,row3 > 2 cooccurrences
for row2:
(row2,row1)
(row2,row3)
row2,row1 > 1 cooccurrence
row2,row3 > 1 cooccurrence
for row3:
(row3,row1)
(row3,row2)
(row3,row1)
row3,row1 > 2 cooccurrences
row3,row2 > 1 cooccurrence
Hope that helps with understanding the approach.
sebastian
On 02.02.2012 12:47, Vicky wrote:
> Sebastian,
> when you said,
>>> The computation should be executed in a way that only rows that have at least
one nonzero value in the same dimension (column) are compared
>
>
> if there are three columns in the document (columnA,columnB,columnC,columnD).
>
> Doc1: ColumnA, columnB, columnC,columnD
> Doc2:ColumnA,columnB,columnC,columnD
>
> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB,
Doc1:columnC against Doc2:columnC and likewise?
>
>
>
>
>  Original Message 
> From: Sebastian Schelter <ssc@apache.org>
> To: user@mahout.apache.org
> Cc:
> Sent: Wednesday, February 1, 2012 6:06 AM
> Subject: Re: Question on RowSimilarityJob
>
> Here's a brief description of RowSimilarityJob:
>
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
>
> The computation should be executed in a way that only rows that have at
> least one nonzero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
>
> The computation is executed using three MapReduce passes:
>
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unitlength), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
>
> The third step ensures that only the top k similar rows per row are kept.
>
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.
>
> /s
>
>
>
> On 31.01.2012 22:40, Suneel Marthi wrote:
>> Sebastian,
>>
>> Question on the RowSimilarity job.
>>
>> a) I created sequence files from my document corpus.
>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document
frequency = 1 and minimum support = 1
>> c) Ran the RowId job on the vectors generated in (2)  this gives me an M * N matrix
where M = number of documents in my collection, N = cardinality of the vector  Correct?
>>
>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity
measure and 'N' as the number of columns  this gives me an M * R matrix where R < N.
>>
>>
>> I am not sure I completely understand as to what's happening in the RowSimilarity
Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
and have been staring at your whiteboard (http://ssc.io/wpcontent/uploads/2011/08/v2.jpg)
for a while to understand what's happening, but guess I need some help.
>>
>> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>>
>> Thanks for your help.
>>
>>
>>
>> ________________________________
>> From: Sebastian Schelter <ssc@apache.org>
>> To: user@mahout.apache.org
>> Sent: Friday, January 20, 2012 12:58 PM
>> Subject: Re: Question on RowSimilarityJob
>>
>> Hi,
>>
>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>> (documents in your use case) to keep per document.
>> 'excludeSelfSimilarity' means that rows (documents) should not be
>> compared to themselves.
>>
>> Sry for the lack of documentation, RowSimilarityJob was originally only
>> an internal job for the recommendation code. I'll try to add something
>> on the wiki in the next days.
>>
>> sebastian
>>
>>
>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>> I am working on determining document similarity of a corpus I am working with
using RowSimilarity.
>>>
>>> Questions:
>>>
>>> a) What do the parameters  'maxSimilaritiesPerRow' and 'excludeSelfSimilarity'
mean?
>>> b) Are there any docs available on RowSimilarityJob available, this is the best
I could find on Sebastian's blog  http://ssc.io/rowsimilarityjobonsteroids/ .
>>>
>>> c) Also do we have any docs on RowIdJob ?
>>>
>>> Thanks and Regards,
>>> Suneel
>>>
