mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Rewig <>
Subject Re: Memory and Speed Questions for Item-Based-Recommender
Date Fri, 10 Jul 2009 12:18:39 GMT
Thank you for your fast reply!

Sean Owen:
> On Fri, Jul 10, 2009 at 10:03 AM, Thomas Rewig<> wrote:
>>     Question 1:
>>     The similarity-matrix uses 400MB memory at the MySQLDB - by
>>     setting the ItemCorrelation 8GB Ram will be used to load the
>>     similarity-matrix as a GenericItemSimilarity. Is it
>>     possible/plausible that this matix uses more than 20 times more
>>     memory in RAM then in the Database - or have I do something wrong ?
> I could believe this. 100,000 items means about 5,000,000,000
> item-item pairs are possible. Many are not kept, but seeing as each
> once requires 30 or so bytes of memory, I am not surprised that it
> could take 8GB.
> That's really a lot to keep in memory. I might suggest, instead, that
> you not pre-compute the similarities, but instead compute them as
> needed and cache (use CachingItemSimilarity). That way you are not
> spending so much memory on pairs that may never get used, but still
> get much of the speed improvement.
In the Moment to get the similarity-matrix I do that:

    * create a DataModel (MySQLDB) in this Way: aItem,
      aItemCharacteristic, aItemValue  (each aItem have 40
      aItemCharacteristics later there will be more)
    * set a UserSimilarity - Pearson or Euclidian
    * get in a multithreaded way all similarities: aCorrelation =
      aUserSimilarity.userSimilarity(user1, user2); - this is stressful
      for cpu, but in 4 hours it is done - not bad for n!/(n-2)!
      combinations ;-)
    * save them if they correlate more than 0.95
    * get it in the GenericItemSimilarity to use it in a

Ok I will test with the Casching(?)Similarity. If I understand you 
right, this will mean I

    * create a DataModel_1 (MySQLDB) in this Way: aItem,
      aItemCharacteristic, aItemValue (each aItem have 40
    * create a UserSimilarity so that I have the similarity of the
      aItems (if I use ItemSimilarity I would get the similarity of the
      aItemCharacteristic ... right?)
    * create a CachingUserSimilarity and put DataModel_1 and the
      UserSimilarity in there
    * create a DataModel_2 (MySQLDB) in this Way:
    * create the Neighborhood
    * create a UserBasedRecommender and put the Neighborhood, the
      DataModel_2 and the CachingUserSimilarity in there
    * create a CachingRecommender
    * et voilĂ  :-) I have a working memory sparing recommender

But I can't do that with a Itembased-Recommender because I have no 
ItemCorrelation (because theSimilarity of aItemCharacteristic doesn't 
matter ), is that right? So the sentence in the docu: "So, item-based 
recommenders can use pre-computed similarity values in the computations, 
which make them much faster. For large data sets, item-based 
recommenders are more appropriate" doesn't work for me. Or

In the moment I have a Testset of 500000 Users and 100000 Items. The 
Item-Similarity is computed with taste, but with external data.

Sean Owen:
>>   Question:
>>   Is there a way to increase the speed of a recommendation? (use
>>   InnoDB?, compute less Items ... someway ;-)...?)
> Your indexes are right. Are you using a connection pool? that is
> really important.
Yes I do use a connection pool:

        this.cPoolDS = new ConnectionPoolDataSource(dataSource);
        this.aConnection = cPoolDS.getConnection();

Sean Owen:
> How many users do you have? if you have relatively few users, you
> might use a user-based recommender instead. Or, consider a slope-one
> recommender.
In the Moment there are 5 times more users than items - later this could 
change to 1.5 Mio Items and 150,000 users but first my tests must work.
I testet the slope-one recommender as taste wasn't in mahout and I 
found, that the recommendations don't work for me. Has there something 
changed? ... maybe I should give it another try.

Sean Owen:
> It sounds like you have a lot of items, so the way item-based
> recommenders work, it will be slow.
> Using CachingItemSimilarity could help. I am surprised that a
> FileDataModel isn't much faster, since it loads data in memory. That
> suggests to me that the database isn't the bottleneck.
> Are you using multiple threads to compute recommendations
> simultaneously? you certainly can, to take advantage of the 4 cores.
Yes I do, but every .recommend command is taste intern only a single 
thread. Is that right?

best regards
Thomas Rewig

View raw message