mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Brickley <dan...@danbri.org>
Subject Re: Question on RowSimilarityJob
Date Thu, 02 Feb 2012 12:59:27 GMT
On 1 February 2012 12:06, Sebastian Schelter <ssc@apache.org> wrote:
> Here's a brief description of RowSimilarityJob:
>
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
>
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
>
> The computation is executed using three MapReduce passes:
>
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
>
> The third step ensures that only the top k similar rows per row are kept.
>
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.

Thanks for this. I've taken liberty of dropping it in the Wiki at
https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob
(and linking this thread). Hope that's ok...

cheers,

Dan

Mime
View raw message