mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: [Taste] Sanity Check and Questions
Date Thu, 18 Jun 2009 15:42:25 GMT
My first reaction is that this could make sense, if my guessed below are
accurate.

The 'philes show up as very similar, good. Since they have all rated mostly
the same things, they don't contribute much to the possible recommendations
for each other. If they have all rated all but one Lincoln article then they
will generate at best one new recommendation for each other!

When the normal user sneaks in, suddenly there are at least more possible
items to recommend. Naturally, they round out the list. You ask for 10 recs,
and you get the 1 good rec plus the next 9 best recs, all basically from
that user.

That is I imagine the estimated pref value of all the other 9 is notably
lower. In practice therefore you might choose to chop off recs whose
estimated pref is below a certain value, or maybe truncate the list when the
estimate from one to the next drops significantly.

Am I right? If so, one response is that this funky behavior is mostly a
function of the particular data you have constructed.

On Jun 18, 2009 4:30 PM, "Grant Ingersoll" <gsingers@apache.org> wrote:

I'm working on a demo on Mahout and part of it is on collab. filtering.
For the CF part, I'm taking the lead from an idea from Ted about a way to
demonstrate how CF works conceptually. (Ted please correct me if my
understanding is incorrect)

I took a subset of Wikipedia articles (2302, available at
http://people.apache.org/~gsingers/wikipedia/chunks.tar.gz, created by the
WikipediaXMLSplitter in the example directory).  Next, I picked a topic of
interest, in this case all docs containing the phrase "Abraham Lincoln", and
I made the assumption that there are 10 users out of a total of 1000 who are
"Lincolnphiles" and have thereby rated most of the articles (17 total) on
the topic.  The ratings range between -5 and 5 (as doubles), but for the
most part, the Lincolnphiles tend to like the same things, but to varying
degrees.  (Note, I did these ratings by hand and thus "stacked the deck")
The Lincolnphiles are really obsessed and did not rate any other documents.
 However, not all of them rated all 17 articles.  Next, I assumed the other
990 users are randomly rating across all the documents and in the same
range.  Thus, for every article in the set, I randomly grabbed X users and
then have them randomly assign a degree of like or dislike in the range
mentioned.

I then implemented a basic recommender according to the Taste docs under
User-based recommenders section.  I then pass in the user id of one of the
Lincolnphiles.  The results I get back are a bit surprising in that none of
the recommendations are for other items rated highly by the Lincolnphiles,
despite the fact that, when setting the neighborhood to be 10, all of the
other Lincolnphiles are in the neighborhood plus one non-Lincolnphile.  I
would expect the recommendations to be for items that are not rated by my
Lincolnphile, but have been rated by the other Lincolnphiles, or at least
some of them, but in fact none of the recommendations are for Lincoln docs.

OK, so I then played around a bit with the neighborhood size.  If I make it
9 (which is the number of other Lincolnphiles in the system) or less, I then
get what I expected.  So, it seems the one non-Lincolnphile rated a lot more
items than all the Lincolnphiles.  Is that why that user's items seem to
dominate the recommendations?  In looking at the non-Lincoln user, I see two
common items that they both rated, one that they both really liked and one
that they disagreed on.

I'm not exactly sure what my questions are, other than the one about an
active user dominating like minded, but less active raters and what's the
appropriate thing to do there, if anything, but I wanted to make sure this
all makes sense.

Also, is there any notion in Taste similar to Lucene's explain method (
http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query,%20int)
)?

After this sanity check, my next goal is to show how a new Lincolnphile
coming into the system would be guided to other content on Lincoln.

[And yes, once done, this code will be publicly available, but it will be a
little while]

Here's my snippet of code for recommending, pretty much verbatim from the
Taste docs:
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
// Optional:
   userSimilarity.setPreferenceInferrer(new
AveragingPreferenceInferrer(dataModel));

   UserNeighborhood neighborhood =
           new NearestNUserNeighborhood(neighSize, userSimilarity,
dataModel);
   Collection<User> users = neighborhood.getUserNeighborhood(userId);
   for (User neighbor : users) {
     System.out.println("Neighbor: " + neighbor);
   }

   Recommender recommender =
           new GenericUserBasedRecommender(dataModel, neighborhood,
userSimilarity);
   Recommender cachingRecommender = new CachingRecommender(recommender);


   List<RecommendedItem> recommendations =
           cachingRecommender.recommend(userId, 10);
   System.out.println("Recommendations:");
   for (RecommendedItem item : recommendations) {
     Item theItem = item.getItem();
     String title = idsToTitle.get(theItem.getID().toString());
     System.out.println("Doc Id: " + theItem + " Title: " + title);
   }

Cheers,
Grant

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message