mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Mahout performance issues
Date Thu, 01 Dec 2011 18:10:50 GMT
If I remember correctly, you have 12M users and 18M interactions.

If I interpret the plots correctly there is one single item that
accounts for 8.5M interactions (nearly half of the overall interactions)
and more than two thirds of the users like it?

If that is true, this item will co-occurr with virtually every other
item in the dataset, ruining the runtime as you will have to estimate
the preference for every item each time you compute recommendations.

Normally the sampling done by SamplingCandidateItemStrategy should hit
such 'top-sellers' harder then the rest and therefore mitigate the
impact of them on the runtime, but I guess your dataset has so few
per-user interactions overall that the sampling doesn't really help here.

This top item is also of no real value as everybody seems to already
know it and was able to find it. You can't really learn a lot from an
item that everybody likes.

Can you check my findings and try to simply throw the item away?

--sebastian



On 01.12.2011 16:16, Sebastian Schelter wrote:

> 
> --sebastian
> 
> On 01.12.2011 16:12, Sean Owen wrote:
>> You can 'tickle' the cache asynchronously if you like.
>>
>> I am still not clear on why you are doing so many item-item similarity
>> calculations. Your change ought to let you do 1, or 10, or 100 per
>> calculation if you like. That, we know, is fast. And a few hundred
>> similarities should start to give reasonable recommendations.
>>
>> What is preventing you from making this tradeoff (with your change)?
>> Yes, it is essential for reasonable performance.
>>
>> On Thu, Dec 1, 2011 at 3:06 PM, Daniel Zohar <dissoman@gmail.com> wrote:
>>
>>> Hi Manuel,
>>> I haven't got to the point where CacheItemSimilarity kicks in. That is, I
>>> will have to run a lot of recommendations in order to get a real benefit
>>> from it. I would first like to optimize the 'cold start' so it's at least
>>> serves at reasonable time. Usually cache is used to prevent repeated
>>> calculations, but personally I dont think it's a replacement for optimized
>>> performance. Don't you agree?
>>>
>>> Also, I will try to profile the app now as you suggest and send the results
>>> asap.
>>>
>>> Thanks!
>>>
>>> On Thu, Dec 1, 2011 at 4:56 PM, Manuel Blechschmidt <
>>> Manuel.Blechschmidt@gmx.de> wrote:
>>>
>>>> Hi Daniel,
>>>> actually you are running the profile inside tomcat. You should take a
>>>> snapshot and then drill down to the functions where the actual
>>>> recommendation takes place. The current screenshots also contains some
>>>> profiles from Tomcat threads which are sleeping a lot and therefore
>>> taking
>>>> a lot of time.
>>>>
>>>> Further the screenshots does not contain the amount how often the
>>>> different functions are called.
>>>>
>>>> You have to profile multiple requests alone. The CacheItemSimilarity gets
>>>> filled therefore it should go faster and faster.
>>>>
>>>> On 01.12.2011, at 15:11, Daniel Zohar wrote:
>>>>
>>>>> @Manuel thanks for the tips. I have installed VisualVM and followed are
>>>> the
>>>>> results
>>>>> I did two sampling -
>>>>> - With the optimized SamplingCandidateItemsStrategy (
>>>>> http://pastebin.com/6n9C8Pw1):
>>> http://static.inky.ws/image/934/image.jpg
>>>>> - Without the optimized SamplingCandidateItemsStrategy:
>>>>> http://static.inky.ws/image/935/image.jpg
>>>>>
>>>>
>>>> The big hot spot is the function FastIDSet.find():
>>>>
>>>> Optimized: 13,759 s
>>>> Unoptimized: 246,487 s
>>>>
>>>> So you see that your optimization already got you a performance boost of
>>>> 2000%.
>>>>
>>>> Did you play around with the CacheItemSimilarity cache sizes?
>>>>
>>>> /Manuel
>>>>
>>>> --
>>>> Manuel Blechschmidt
>>>> Dortustr. 57
>>>> 14467 Potsdam
>>>> Mobil: 0173/6322621
>>>> Twitter: http://twitter.com/Manuel_B
>>>>
>>>>
>>>
>>
> 


Mime
View raw message