Yes this is the essential problem with some similarity metrics like
Pearson correlation. In its pure form, it takes no account of the size
of the data set on which the calculation is based. (That's why the
framework has a crude variation which you can invoke with
Weighting.WEIGHTED, to factor this in.)
I think your proposal perhaps goes far the other way, completely
favoring "count". But it's not crazy or anything and probably works
reasonably in some data sets.
There are many ways you could modify these stock algorithms to account
for the effects you have in mind. Most of what's in the framework is
just the basic ideas that come from canonical books and papers.
Here's another idea to play with: instead of weighting and item's
score by average similarity to the user's preferred items, weight by
average minus standard deviation. This tends to penalize candidate
items that are similar to only a few of the user's items, since there
will be only a few data points and the standard deviation larger.
Matrix factorizaton / SVDbased approaches are deeper magic  more
complex, more computation, much harder math, but theoretically more
powerful. I'd see how far you can get on a basic useruser approach
(or itemitem) as a baseline and then go dig into these.
On Sat, Feb 19, 2011 at 12:02 PM, Chris Schilling <chris@cellixis.com> wrote:
> Hey Sean,
>
> Thank you for the detailed reply. Interesting points. I think I have approached some
of these points in my subsequent emails.
>
> You bring up the case where all the users hate the same item. What about the case where
very few (a single?) similar users loves a place? In that case, is this really a better
recommendation than the popular vote? Where is the middle ground. I think its an interesting
point. Ill see how the SVD performs.
>
