mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Trimming Taste input (memory consumption)
Date Wed, 22 Oct 2008 17:43:44 GMT

In hopes of getting some feedback about possible improvements (e.g. which data to keep, which
to trim, how far back to go, etc.), here are some numbers I'm working with:

# number of unique users: $ cut -d, -f1 input.txt | sort | uniq | wc -l

# number of unique items: $ cut -d, -f2 input.txt | sort | uniq | wc -l

# total number of data points (the "user,item,1.0" triplets): $ wc -l input.txt 
1664289 input.txt

Each triplet represents user->item view.  Here is their distribution:
1—10 98485
1—100 118047 
1—200 119223 
1—1000 120100 

This means the top 10 most popular items account for 98485 views, and so on.  So top 100 items
account for vast majority of views.
I'm working with about 1 day's worth of data.  I think this is a problem, because it doesn't
give me info about user->item views from before, and I think that translates to losing
some user-user overlap data to compute better recommendations.  Is this correct?

I'm dealing with news, so the least popular items for a given day seem to really be old news
items (they are past their prime, so to speak).

Because I don't want to recommend old news, I *think* I can chop of some of the tail at some(?)
expense of quality.
Now that I see the distribution of items more clearly, I am also wondering if feeding the
most popular items into the recommendation engine is really valuable.  Items are very popular
because lots of people consumed them.  This produces a lot of overlap between users, which
is good, but maybe it's too good for its own good (kind of like the Harry Potter problem)?
 I wonder if it would make sense not to include (and thus not recommend) the most popular
items?  Hm, doesn't sound right, because of my 705K users only about 98K have seen the top
10 items already.  But would it make sense to artificially lower their rating, to put a damper
on them?

I'm thinking out loud, so any thoughts and feedback would be appreciated.

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Otis Gospodnetic <>
> To:
> Sent: Wednesday, October 22, 2008 12:52:41 PM
> Subject: Trimming Taste input (memory consumption)
> Hi,
> I've finally fed Taste some real data (in terms of volume, users, and item 
> preference distribution) and quickly hit the memory limits of my development 
> laptop. :).  Now I'm trying to see what, if anything, I can trim from the input 
> set (the user,item,rating triplets) to lower the memory consumption. N.b. I 
> don't actually have rating information - my ratings are all just "1.0" 
> indicating that the item has been seen/read/consumed.
> I ran one of these to see the item popularity distribution:
> $ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less
> And quickly saw the expected zipfian distribution.  Big head of several very 
> popular items and a loooong tail of items that have been seen/read/consumed only 
> a few times.
> So here are my questions:
> - Is there a point in keeping and loading very unpopular items (e.g.
> the ones read only once)?  I think keeping those might help very few
> people discover very obscure items, so removing them will hurt this
> small subset of people a bit, but this will not affect the majority of
> people.  Is this thinking correct?
> - I'm dealing with items where their freshness counts.  I don't want to 
> recommend items older than N days - think news stories.  Assume I have the age 
> of each item.  I could certainly then remove old items as I don't ever want to 
> recommend them, but if I remove them, won't that hurt the quality of 
> recommendations, simply because I'll lose users' "item consumption history"?
> Thanks,
> Otis
> --
> Sematext -- -- Lucene - Solr - Nutch

View raw message