mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Email and Collab. Filtering
Date Wed, 07 Sep 2011 01:08:01 GMT
> I should add, of course, before lightning strikes that your mileage may
vary.

Know Your Data. I got far with KNime (visual programming for statistics) and
still use it, before breaking down and using Excel.. There's a brilliant
book called "Think Stats: Probability and statics for programmers". It is
not brilliant in execution but in concept: it plays upon the deep allergy of
programmers for Excel et. al. that it does all of its examples in Python
scripts. "Excel" and "spreadsheet" were not in the index. Skinny little book
for $30, but you can download it for free.

http://greenteapress.com/thinkstats/

The problem with ratings is very basic marketing survey principles: it is a
self-selected sample. If you want valid results in a marketing survey, you
pick the sample, he does not pick himself. But, there is a binary feature
per user: are the ratings 3.0 (default) or was the user engaged enough to
push the button? If you do some charting on this with your dataset, you may
find that because the UI is set for a default of 3.0, you can separate two
sets of users very reliably.

On Tue, Sep 6, 2011 at 2:25 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> My rationale for being such a binary bigot is that I have found that (in my
> experience) one signal always dominates pretty much completely.  Other
> signals are pretty much just noise (too little engagement) are subject to
> spammy misdirection (bad titles on videos, for instance) or are too rare to
> give any significant lift (user ratings versus views/engagements).
>
> In cases where the alternative signal is more voluminous than the
> engagement
> that I am interested in, it is invariable very noisy.  This is guaranteed
> since I would otherwise have used the higher volume signal.  In every case
> I
> have tried, using the high volume, high noise signal degraded performance
> significantly because it made it hard to find the clean signal.  The low
> volume signals have never led to any gain and often were strange enough
> that
> they hurt things badly.  Besides, they typically are much less than 10% of
> the data.
>
> Aside from the general data quality and availability issues, there are the
> computational issues.  Having binary data allows me to use much faster and
> cooler algorithms like LLR.
>
> The upshot is that I don't see anything but downside for including rating
> or
> synthetic rating data.
>
> I should add, of course, before lightning strikes that your mileage may
> vary.
>
> On Tue, Sep 6, 2011 at 12:56 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > Ted,
> >
> > Been meaning to follow up on this...
> >
> > On Aug 22, 2011, at 11:29 AM, Ted Dunning wrote:
> >
> > > On Mon, Aug 22, 2011 at 8:21 AM, Daniel Xiaodan Zhou <
> > danithaca@gmail.com>wrote:
> > >
> > >> I think this is reasonable. Some suggestions:
> > >>
> > >> 1. Instead of using the total number of interactions as cell value,
> map
> > the
> > >> number to a 1-5 score based on histogram
> > >>
> > >
> > > I would map to {0,1} rather than a fake rating scale.
> >
> > What's your reasoning for this, versus, something like number of replies?
> >  My somewhat naive intuition thought that I would want to somehow capture
> > the fact that a particular user has interacted more frequently with an
> item
> > vs. simply a boolean preference.  Or, is it just that in the big scheme
> of
> > things, it won't matter much, so why complicate it?
> >
> > Thanks,
> > Grant
> >
> >
> > --------------------------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com
> > Lucene Eurocon 2011: http://www.lucene-eurocon.com
> >
> >
>



-- 
Lance Norskog
goksron@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message