mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: RowSimilarity ?'s
Date Tue, 19 Jul 2011 15:20:38 GMT
On Tue, Jul 19, 2011 at 12:24 AM, Sebastian Schelter <ssc@apache.org> wrote:

> Class 1 would be count based similarity measures like Tanimoto-coefficient
> or LLR that can be easily combined by summing the partial counts.
>
> Class 2 would be measures that only need the cooccurrences between the
> vectors like Pearson-Correlation or Euclidean distance or Cosine if the
> vectors are normalized, it should be possible to find intelligent (yet a bit
> hacky) ways to combine their intermediate data.
>
> Class 3 would be measures that are possibly user-supplied and need the
> "weight" of the input vectors as well as all the cooccurrences.
>

I think that with a bit of algebra that the Euclidean and cosine cases can
go into class 1.

Probably Pearson as well.

I also remember that we once had someone on the list that used
> RowSimilarityJob for precomputing the similarities between millions of
> documents. Unfortunately I couldn't find the conversation yet. IIRC he
> successfully applied a very aggressive sampling strategy.
>

That could have been me.  I didn't use RowSimilarityJob, but I used to
handle 50 million users and 10-20 million documents using a similar approach
(emit pairs and counts).

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message