mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Reimann <>
Subject Re: Lucene for UserSimilarity
Date Tue, 24 May 2011 10:31:25 GMT
Am 24.05.2011 11:39, schrieb Sean Owen:
> On Tue, May 24, 2011 at 10:17 AM, Uwe Reimann<>  wrote:
>> Since the user provides new preferences at a high rate, I expect to change
>> the neighborhood of an individual user rapidly. Using CachingUserSimilarity
>> or CachingUserNeighborhood probably won't work here. Using a
>> ClusteringRecommender seems to be an option here in order to search against
>> some clusters instead against many users. The cluster should be recalculated
>> periodically in the background.
> (You can have the cache clear just entries for the current user.)
> Neighborhoods ought to be stable-ish. I would not expect that one new
> data point would significantly change who your most similar users are.
Probably depends on how many data point were available before. I suspect 
i.e. the 5th data point having a greater impact than the 105th. Is there 
a lower limit (above 1) on the number of data points a user must have 
before recommendations make sense?

> So you can probably get away with perioidically recomputing these,
> perhaps frequently, but not necessarily at every update.
I could trigger the recalculation if the knowledge about the current 
user has changed by say 25%. That way the recomputing rate would decrease.
> You do need to use the latest preferences in recommendation, of
> course, but that's separate from calculating a neighborhood.
>> Dislikes should be considered during similarity search. I'd like to express
>> those as negative preference values. PearsonCorrelationSimilarity should be
>> ok with that, right?
> Yes.
>> Since I expect to have very low overlap in items between (especially new)
>> users, I'd like to take the item's category into account during similarity
>> search. User u1, who likes items i1 of category c1 should get item i2 of
>> category c1 recommended if user u2 likes that. Both users would have a
>> preference value for category c1 in common. This should clearly be possible
>> by just providing the calculated preference values for the category items.
> You are describing more of an item-based recommender and indeed I
> think that could be better here since it avoids cold-start problems
> better. (I prefer it as well.) You might instead look at
> GenericItemBasedRecommender and ItemSImilarity instead.
I did some testing of the different recommenders on a real data set from 
a bookmarking site. GenericBooleanPrefItemBasedRecommender did not work 
very well for me. It seemed to recommend the top links. Using 
GenericUserBasedRecommender worked way better (after some tweaking), 
which recommended links that actually fit my interests. Might need to do 
some more testing here.

> Your thinking about using Lucene almost surely also applies to
> item-item similarity.
>> I think I need to provide different DataModels to the different stages of
>> recommendation calculation: 1) one which includes likes and dislike for
>> items and categories for similarity search, 2) one which includes just the
>> liked items to pick the recommendations from and 3) one which includes all
>> items of a user (liked, disliked and skipped ones) for filtering out the
>> user's items using an IDRescorer.
> I think one DataModel is fine. You want to include all data in
> similarity calculations (1). It is also good to have all items
> available in recommendation (2); you don't want to exclude an item
> just because someone didn't like it. And in (3) you do not need to
> filter out items the user has rated; that's done already.
(1) would include categories, that should not be recommended, that's why 
(2) is being used to pick the recommendations from. (2) would contain 
the liked items of every user, that includes items that are disliked by 
other users. (3) is for filtering out items that the user has not rated, 
but has been presented before.

View raw message