mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: best similarity metric for collaborative filtering
Date Tue, 26 Apr 2011 16:12:06 GMT
That reduces to something like the Jaccard / Tanimoto coefficient -- not
precisely since you're dividing by the length of those vectors rather than
the size of their "union", but practically similar. And that's implemented
as TanimotoCoefficientSimilarity.

Perhaps my point is that in Mahout (well the recommender end of the world),
binary data is not {0,1} data but {null,1} data. That's on purpose, mostly
for performance. And then everything else is implemented in terms of this
more or less equally valid model of the world. I had thought the question
was "how do you do this in Mahout".

On Tue, Apr 26, 2011 at 5:03 PM, Ted Dunning <> wrote:

> Setting didn't-buy to 0 and getting a valid cosine distance is pretty
> common
> in these scenarios.
> I still prefer what Sean is recommending in terms of LLR for item to item
> links, but the cosine version does make sense to support, especially for
> purchase histories.
> Even better would be to remember number of times an item was offered as
> well
> as the number of times it was purchased.  This allows regression techniques
> to be applied, often with good results.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message