mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Email and Collab. Filtering
Date Tue, 06 Sep 2011 21:25:36 GMT
My rationale for being such a binary bigot is that I have found that (in my
experience) one signal always dominates pretty much completely.  Other
signals are pretty much just noise (too little engagement) are subject to
spammy misdirection (bad titles on videos, for instance) or are too rare to
give any significant lift (user ratings versus views/engagements).

In cases where the alternative signal is more voluminous than the engagement
that I am interested in, it is invariable very noisy.  This is guaranteed
since I would otherwise have used the higher volume signal.  In every case I
have tried, using the high volume, high noise signal degraded performance
significantly because it made it hard to find the clean signal.  The low
volume signals have never led to any gain and often were strange enough that
they hurt things badly.  Besides, they typically are much less than 10% of
the data.

Aside from the general data quality and availability issues, there are the
computational issues.  Having binary data allows me to use much faster and
cooler algorithms like LLR.

The upshot is that I don't see anything but downside for including rating or
synthetic rating data.

I should add, of course, before lightning strikes that your mileage may
vary.

On Tue, Sep 6, 2011 at 12:56 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> Ted,
>
> Been meaning to follow up on this...
>
> On Aug 22, 2011, at 11:29 AM, Ted Dunning wrote:
>
> > On Mon, Aug 22, 2011 at 8:21 AM, Daniel Xiaodan Zhou <
> danithaca@gmail.com>wrote:
> >
> >> I think this is reasonable. Some suggestions:
> >>
> >> 1. Instead of using the total number of interactions as cell value, map
> the
> >> number to a 1-5 score based on histogram
> >>
> >
> > I would map to {0,1} rather than a fake rating scale.
>
> What's your reasoning for this, versus, something like number of replies?
>  My somewhat naive intuition thought that I would want to somehow capture
> the fact that a particular user has interacted more frequently with an item
> vs. simply a boolean preference.  Or, is it just that in the big scheme of
> things, it won't matter much, so why complicate it?
>
> Thanks,
> Grant
>
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message