mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [Taste] Sanity Check and Questions
Date Fri, 19 Jun 2009 11:16:32 GMT
This is all good stuff, here, Ted.  Thank you.

For the task at I hand, I am focusing on what is available in Taste as  
an expression of some level of capability for doing CF.

Two things that aren't clear to me just yet from the Taste APIs are:
1. Given a new user with no ratings, recommend items.  I see the  
recommenders have an estimatePreference() method, maybe that helps.  I  
suppose the other option is to assume the user rates all items as  
average and go from there.

2.  As a related approach, given a user visiting an item, recommend  
other items.  For the latter, I imagine that if I transpose the model  
to go from items->users, I can then get a set of recommended users.   
Then, from those users (reverting back to the original model) I can  
then get recommended items.



On Jun 18, 2009, at 4:43 PM, Ted Dunning wrote:

> Grant,
>
> The data you described is pretty simple and should produce good  
> results at
> all levels of overlap.  That it does not is definitely a problem.    
> In fact,
> I would recommend making the data harder to deal with by giving non- 
> Lincoln
> items highly variable popularities and then making the groundlings  
> rate
> items according to their popularity.  This will result in an apparent
> pattern where the inclusion of any number of non-lincoln fans will  
> show an
> apparent pattern of liking popular items.  The correct inference  
> should,
> however, be that any neighbor group that has a large number of  
> Lincoln fans
> seems to like popular items less than expected.
>
> For problems like this, I have had good luck with using measures  
> that were
> robust in the face of noise (aka small counts) and in the face of  
> large
> external trends (aka the top-40 problem).  The simplest one that I  
> know of
> is the generalized multinomial log-likelihood
> ratio<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html 
> >that
> you hear me nattering about so often.  LSI does a decent job of
> dealing
> with the top-40, but has trouble with small counts.  LDA and related
> probabiliistic methods should work somewhat better than log-likelihood
> ratio, but are considerably more complex to implement.
>
> The key here is to compare counts within the local neighborhood to  
> counts
> outside the neighborhood.  Things that are significantly different  
> about the
> neighborhood relative to rest of the world are candidates for
> recommendation.  Things to avoid when looking for interesting  
> differences
> include:
>
> - correlation measures such as Pearson's R (based on normal  
> distribution
> approximation and unscaled thus suffers from both small count and  
> top-40
> problems)
>
> - anomaly measures based simply on frequency ratios (very sensitive  
> to small
> count problems, doesn't account for top-40 at all)
>
> What I would recommend for a nearest neighbor approach is to  
> continue with
> the current neighbor retrieval, but switch to a log-likelihood ratio  
> for
> generating recommendations.
>
> What I would recommend for a production system would be to scrap the  
> nearest
> neighbor approach entirely and go to a coocurrence matrix based  
> approach.
> This costs much less to compute at recommendation and is very robust  
> against
> both small counts and top-40 issues.
>
> On Thu, Jun 18, 2009 at 9:37 AM, Sean Owen <srowen@gmail.com> wrote:
>
>> Still seems a little funny. I am away from the code otherwise I  
>> would check
>> - forget if I ever implemented weighting in the standard correlation
>> similarity metrics. A pure correlation does not account for whether  
>> you
>> overlapped in 10 or 1000 items. This sort of weighting exists  
>> elsewhere but
>> I forget about here. It might help.
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve


Mime
View raw message