mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Mccormick <billmc...@gmail.com>
Subject Scalability of ParallelALSFactorizationJob with implicit feedback
Date Mon, 11 Jun 2012 18:42:45 GMT
Hi all,

We're interested in using Mahout for a recommendation system for a largish
online storefront.

The initial recommendations are based on download/purchase history, so we
were trying out the ParallelALSFactorizationJob which seems to give good
results.

The initial test run was limited to 100,000 users and the job ran with no
problems.

The next test set was structured differently with around 4M download
records and around 1.5 M users (rather than a fixed number of users, it was
the set of downloads over a fixed period of time).   The Hadoop tasks hung
in garbage collection on this job.

I started looking at memory usage, and I noticed that the existing
implementation attempts to compute the product of the user factor matrix
transpose with itself in memory.  (It also looks like it does this on every
mapper, instead of once per iteration.)

Our full data set has on the order of 100M users.    So this isn't going to
work as is.  (i.e. the user factor matrix will take 100M users x 20 factors
x 8 bytes per entry = 16 Gbytes)

I'm just pondering implementing a new version that does the large matrix
computations in a less memory intensive fashion.   Before I go too far, I
was hoping this list could provide some input:

- is my analysis correct?
- is someone already working on this?
- if we go ahead with this, is the Mahout project interested in accepting
the new implementation once it's done?

thank you very much.

-- 
Bill McCormick

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message