mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Schelter (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix
Date Mon, 21 Jun 2010 09:22:23 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880757#action_12880757
] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

Attached a new patch including equals() and hashCode() for WeightedRowPair, thank you for
pointing me to that.

I'm not sure whether the code in o.a.m.cf.taste.hadoop.similarity should be removed by now.
Although it is an implementation of the same algorithm as this patch here, there are some
differences in the details. By merging them we would lose some optimizations in the cf-specific
implementation but I agree with Sean that it is desirable to have the cf code use standard
matrix operations.

Differences between the two implementations:

 * vectors use ints as indices, preferences use longs as IDs, so those IDs would need to be
mapped to ints and back (I think the distributed recommender job is already doing that, so
that shouldn't be a big problem)

 * o.a.m.math.hadoop.similarity.RowSimilarityJob writes the whole result matrix to disk (although
it should be symmetric and no information would be lost if only half of it was written) because
we need the whole matrix to be available for following operations and integration into DistributedRowMatrix

 * o.a.m.cf.taste.hadoop.similarity.SimilarityJob automatically assumes the similarity of
an item to itself as NaN (and doesn't compute it) whereas a similarity matrix created by RowSimilarityJob
actively computes and includes these values (because it's a mathematical operation and should
be agnostic of the fact that it's main use case is collaborative filtering)

A possible solution for the cf usecase (that would allow merging the implementations) would
be to have the RowSimilarityJob do the computation and after that pick out only the matrix
entries we're interested in in another M/R run.

If you want it that way, I can implement that.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the mailing list
started by Kris Jack about computing a document similarity matrix, I tried to generalize the
approach we're already using to compute the item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix in a distributed
manner, is uses a SequenceFile<IntWritable,VectorWritable> as input and outputs such
a file too. Custom similarity implementations can be supplied, I've already implemented tanimoto
and cosine for demo and testing purposes. The algorithm is based on the one presented here:
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by running it
with a reasonably large input, I'm also worried that it might buffer to much data in certain
steps.
> If you decide to include it in mahout, some more efforts and decisions (like more tests,
more similarity measures, integration with DistributedRowMatrix) would need to be made, I
guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message