mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Recommendations from flat data
Date Thu, 30 Apr 2009 23:18:28 GMT
After digging in this evening I have some answers I think.

First, can you use the very latest code from Subversion? Because the
DataModel you use has actually been removed and rolled into FileDataModel.

This is also because I checked in a change tonight that should cut down peak
memory usage while constructing a FileDataModel by a nontrivial amount.

I was able to run recommendations over 10M data points in 768M of memory

It does take some time to parse and build the model. After that the
recommendation is nearly instantaneous with any similarity metric. Are you
sure Tanimoto was taking a longer time - meaning did you test over a lot of

Either way there are certainly some params you can tweak to trade a bit of
accuracy (maybe) for speed. Look at the sampling rate param on the user
neighborhood implementation. Set it to like 10% and it should get much
faster - of course this doesn't change startup overhead though.

On Apr 30, 2009 7:52 PM, "Sean Owen" <> wrote:

Hm, something is off indeed. Tanimoto should be notably faster than a
cosine measure correlation -- it's doing a simple, optimized set
intersection and union rather than iterating over a bunch of
preference values. While 5M data points is going to consume a
reasonable amount of memory, I would not guess it would exhaust a 1GB
heap -- should be in the hundreds of megs.

If you can run only the recommender in the JVM, obviously that frees
up memory. I would probably remove the caching wrapper too if memory
is at a premium, but that's not your problem. If you are running on a
64-bit machine in 64-bit mode, try 32-bit mode (-d32) to reduce the
object overhead in the JVM.

>From there, you could load the data in a DB instead and use a
JDBC-based DataModel, since that doesn't load in memory. You could
also try adapting my NetflixDataModel which reads from data organized
in directories on disk.

But no something just doesn't seem right, your current setup should be
OK.  I think I need to try replicate this with a similarly sized data
set and see what's up.

On Thu, Apr 30, 2009 at 5:48 PM, Paul Loy <> wrote: > Hi
Sean, > > that worked f...

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message