mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Schelter (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures
Date Mon, 25 Jul 2011 18:13:09 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070647#comment-13070647
] 

Sebastian Schelter edited comment on MAHOUT-767 at 7/25/11 6:11 PM:
--------------------------------------------------------------------

Patch with first proof-of-concept code. It introduces AlgebraicRowSimilarityJob.

Instead of emitting (n*(n-1))/2 pairs from each inverted index entry it emits n "stripes"
with each stripe consisting of two vectors with the first one holding the partial dot products/counts
and the second holding the norms of cooccurred rows. These stripes can be easily merged by
a combiner.

So we emit less objects and hopefully combine a lot of them which should lead to performance
increasements.

I attached implementations for LLR, Tanimoto, Cosine and Cooccurrence count. Euclidean distance
and Pearson-Correlation are still missing but we should be able to add them later (see AlgebraicVectorSimilarity)

Patch has unit tests, but as I don't have access to a testing cluster currently (this will
change in the next weeks), it would be great if someone could verify that this code performs
better than the existing approach, seeing some numbers would be awesome.

      was (Author: ssc):
    Patch with first proof-of-concept code. It introduces AlgebraicRowSimilarityJob.

Instead of emitting (n*(n-1)) pairs from each inverted index entry it emits n "stripes" with
each stripe consisting of two vectors with the first one holding the partial dot products/counts
and the second holding the norms of cooccurred rows. These stripes can be easily merged by
a combiner.

So we emit less objects and hopefully combine a lot of them which should lead to performance
increasements.

I attached implementations for LLR, Tanimoto, Cosine and Cooccurrence count. Euclidean distance
and Pearson-Correlation are still missing but we should be able to add them later (see AlgebraicVectorSimilarity)

Patch has unit tests, but as I don't have access to a testing cluster currently (this will
change in the next weeks), it would be great if someone could verify that this code performs
better than the existing approach, seeing some numbers would be awesome.
  
> Improve RowSimilarityJob performance for count-based distance measures
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-767
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-767.patch
>
>
> (See http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7
for background)
> Currently, the RowSimilarityJob defers the calculation of the similarity metric until
the reduce phase, while emitting many Cooccurrence objects.  For similarity metrics that are
algebraic (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be able
to do much of the computation during the Mapper part of this phase and also take advantage
of a Combiner.  
> We should use a marker interface to know whether a similarity metric is algebraic and
then make use of an appropriate Mapper implementation, otherwise we can fall back on our existing
implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message