mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Recommendations from flat data
Date Fri, 01 May 2009 04:22:14 GMT

Hello,

Some feedback from my Taste experience.  Tanimoto was the bottleneck for me, too.  I used
the highly sophisticated kill -QUIT pid method to determine that.  Such kills always caught
Taste in Tanimoto part of the code.

Do you know, roughly, what that nontrivial amount might be? e.g. 10% or more?


Also, does the "nearly instantaneous" refer to calling Taste with a single recommend request
at a time?  I'm asking because I recently did some heavy duty benchmarking and things were
definitely not instantaneous when I increased the number of concurrent requests.  To make
things fast (e.g. under 100 ms avg.) and run in reasonable amount of memory, I had to resort
to remove-noise-users-and-items-from-input-and-then-read-the-data-model.... which means users
who look like noise to the system (and that's a lot of them in order to keep things fast and
limit memory usage) will not get recommendations.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Sean Owen <srowen@gmail.com>
> To: mahout-user@lucene.apache.org
> Sent: Thursday, April 30, 2009 7:18:28 PM
> Subject: Re: Recommendations from flat data
> 
> After digging in this evening I have some answers I think.
> 
> First, can you use the very latest code from Subversion? Because the
> DataModel you use has actually been removed and rolled into FileDataModel.
> 
> This is also because I checked in a change tonight that should cut down peak
> memory usage while constructing a FileDataModel by a nontrivial amount.
> 
> I was able to run recommendations over 10M data points in 768M of memory
> tonight.
> 
> It does take some time to parse and build the model. After that the
> recommendation is nearly instantaneous with any similarity metric. Are you
> sure Tanimoto was taking a longer time - meaning did you test over a lot of
> recommendations?
> 
> Either way there are certainly some params you can tweak to trade a bit of
> accuracy (maybe) for speed. Look at the sampling rate param on the user
> neighborhood implementation. Set it to like 10% and it should get much
> faster - of course this doesn't change startup overhead though.
> 
> On Apr 30, 2009 7:52 PM, "Sean Owen" wrote:
> 
> Hm, something is off indeed. Tanimoto should be notably faster than a
> cosine measure correlation -- it's doing a simple, optimized set
> intersection and union rather than iterating over a bunch of
> preference values. While 5M data points is going to consume a
> reasonable amount of memory, I would not guess it would exhaust a 1GB
> heap -- should be in the hundreds of megs.
> 
> If you can run only the recommender in the JVM, obviously that frees
> up memory. I would probably remove the caching wrapper too if memory
> is at a premium, but that's not your problem. If you are running on a
> 64-bit machine in 64-bit mode, try 32-bit mode (-d32) to reduce the
> object overhead in the JVM.
> 
> From there, you could load the data in a DB instead and use a
> JDBC-based DataModel, since that doesn't load in memory. You could
> also try adapting my NetflixDataModel which reads from data organized
> in directories on disk.
> 
> But no something just doesn't seem right, your current setup should be
> OK.  I think I need to try replicate this with a similarly sized data
> set and see what's up.
> 
> On Thu, Apr 30, 2009 at 5:48 PM, Paul Loy wrote: > Hi
> Sean, > > that worked f...


Mime
View raw message