mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peng Cheng <>
Subject Re: Regarding Online Recommenders
Date Thu, 18 Jul 2013 19:06:35 GMT
Strange, its just a little bit larger than limibseti dataset (17m 
ratings), did you encountered an outOfMemory or GCTimeOut exception? 
Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:
> It was about 2.5M users and 500K items with 25M actions over 6 months of data.
> On Jul 18, 2013, at 10:15 AM, Peng Cheng <> wrote:
> If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm
not sure if it can be used in item-based recommender, but this is definitely I would like
to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future.
> Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel
and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability
> May I ask about the scale of your dataset? how many rating does it have?
> Yours Peng
> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>> Well, with itembased the only problem is new items. New users can
>> immediately be served by the model (although this is not well supported by
>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>> fine to have a short delay until new items "enter" the recommender, usually
>> this happens after a retraining in batch. You have to care for cold-start
>> and collect some interactions anyway.
>> 2013/7/18 Pat Ferrel <>
>>> Yes, what Myrrix does is good.
>>> My last aside was a wish for an item-based online recommender not only
>>> factorized. Ted talks about using Solr for this, which we're experimenting
>>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>>> tinkering and doesn't have quite the same set of options--no llr similarity
>>> for instance.
>>> On the same subject I recently attended a workshop in Seattle for UAI2013
>>> where Walmart reported similar results using a factorized recommender. They
>>> had to increase the factor number past where it would perform well. Along
>>> the way they saw increasing performance measuring precision offline. They
>>> eventually gave up on a factorized solution. This decision seems odd but
>>> anyway… In the case of Walmart and our data set they are quite diverse. The
>>> best idea is probably to create different recommenders for separate parts
>>> of the catalog but if you create one model on all items our intuition is
>>> that item-based works better than factorized. Again caveat--no A/B tests to
>>> support this yet.
>>> Doing an online item-based recommender would quickly run into scaling
>>> problems, no? We put together the simple Mahout in-memory version and it
>>> could not really handle more than a down-sampled few months of our data.
>>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>>> version. Now we have use-cases for an online recommender that handles
>>> anonymous new users and that takes the story full circle.
>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <> wrote:
>>> Hi Pat
>>> I think we should provide a simple support for recommending to anonymous
>>> users. We should have a method recommendToAnonymous() that takes a
>>> PreferenceArray as argument. For itembased recommenders, its
>>> straightforward to compute recommendations, for userbased you have to
>>> search through all users once, for latent factor models, you have to fold
>>> the user vector into the low dimensional space.
>>> I think Sean already added this method in myrrix and I have some code for
>>> my kornakapi project (a simple weblayer for mahout).
>>> Would such a method fit your needs?
>>> Best,
>>> Sebastian
>>> 2013/7/17 Pat Ferrel <>
>>>> May I ask how you plan to support model updates and 'anonymous' users?
>>>> I assume the latent factors model is calculated offline still in batch
>>>> mode, then there are periodic updates? How are the updates handled? Do
>>> you
>>>> plan to require batch model refactorization for any update? Or perform
>>> some
>>>> partial update by maybe just transforming new data into the LF space
>>>> already in place then doing full refactorization every so often in batch
>>>> mode?
>>>> By 'anonymous users' I mean users with some history that is not yet
>>>> incorporated in the LF model. This could be history from a new user asked
>>>> to pick a few items to start the rec process, or an old user with some
>>> new
>>>> action history not yet in the model. Are you going to allow for passing
>>> the
>>>> entire history vector or userID+incremental new history to the
>>> recommender?
>>>> I hope so.
>>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>>> of
>>>> data. The data was purchase data from a diverse ecom source with a large
>>>> variety of products from electronics to clothes. We found Item based CF
>>> did
>>>> far better than ALS. As we increased the number of latent factors the
>>>> results got better but were never within 10% of item based (we used MAP
>>> as
>>>> the offline metric). Not sure why but maybe it has to do with the
>>> diversity
>>>> of the item types.
>>>> I understand that a full item based online recommender has very different
>>>> tradeoffs and anyway others may not have seen this disparity of results.
>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>> metric.
>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <> wrote:
>>>> Peng,
>>>> This is the reason I separated out the DataModel, and only put the
>>> learner
>>>> stuff there. The learner I mentioned yesterday just stores the
>>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>>> where preferences are stored.
>>>> I, kind of, agree with the multi-level DataModel approach:
>>>> One for iterating over "all" preferences, one for if one wants to deploy
>>> a
>>>> recommender and perform a lot of top-N recommendation tasks.
>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>> consumption, while still providing fast access, I am not sure. Let me
>>> try a
>>>> matrix-backed DataModel approach)
>>>> Gokhan
>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <>
>>>> wrote:
>>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>> FactorizablePreferences allow a much more memory efficient
>>>> representation,
>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>> into
>>>>> 3GB with that approach.
>>>>> 2013/7/16 Ted Dunning <>
>>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>> Note also that this is before you have done any work.
>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <>
>>>>> wrote:
>>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>>> first and space-complexity first implementation. What I'm not
>>>>> sure,
>>>>>>> is that if we really need to create two interfaces instead of
>>>>>>> Personally, I think 12G heap space is not that high right? Most
>>>>>> laptop
>>>>>>> can already handle that (emphasis on laptop). And if we replace
>>>>> map
>>>>>>> (the culprit of high memory consumption) with list/linkedList,
>>> would
>>>>>>> simply degrade time complexity for a linear search to O(n), not
>>> bad
>>>>>>> either. The current DataModel is a result of careful thoughts
and has
>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>> of
>>>>>>> subverting it.

View raw message