mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Running Taste Web example without the webserver
Date Fri, 24 Jul 2009 09:11:46 GMT
Ah yeah I thought it might be a spam filter issues.

Yeah, unfortunately the current code uses stuff only in Hadoop 0.20.
You could run Hadoop 0.20 locally, upgrade your cluster (it can run
older-style jobs I believe), or else roll back the code in that
package that touches Hadoop by one revision. The last version would
work on 0.18.3 I believe.

Hundreds of millions of users is big indeed. Sounds like you have way
more users than items. This tells me that any user-based algorithm is
probably out of the question. The model certainly can't be loaded into
memory on one machine. We could work on ways to compute all pairs of
similarities in a distributed way, but that's trillions of
similarities, even after filtering out some unnecessary work.

Item-based recommenders are more realistic. It would still take a long
time to compute item-item similarities given the number of users you
have, but at least you're only computing thousands to millions of such
similarities. Grant is right -- perhaps you can use approaches
unrelated to the preference data to compute item-item similarity.
Given a fixed set of item-item similarities, it is fast to compute
recommendations for any one user. It doesn't require loading the model
into memory. Hence, you could then use the pseudo-distributed Hadoop
framework I've pointed out to spread these computations for each user
across many machines.

For this -- you can test locally for sure. One machine can process
recommendations just fine, given a fixed set of item-item similarities
and an item-based recommender. Heck you don't even need Hadoop to see
how well this works. I would try seeing how well the recommendations
work first, before figuring out Hadoop.

There are also slope-one algorithms. I think they give good results.
They are going to be similar to item-based recommenders in this case.
It requires precomputing a large matrix data structure (and there is a
separate Hadoop job to do that in a distributed way) but it's also
pretty fast at runtime. That is going to require Hadoop to precompute
the data structure at your scale, so, I would try this next after

On Fri, Jul 24, 2009 at 12:09 AM, Aurora
Skarra-Gallagher<> wrote:
> Hi,
> Thank you for responding. My Spam filter was "out to get me" and your responses were
> I will investigate the Hadoop integration piece, specifically RecommenderJob. Currently,
the Hadoop grid I'm working with is using 0.18.3. Will that pose a problem? I noticed some
threads about versions of Hadoop less than 0.19 not working.
> We are looking at starting with 70M users and scaling up to 500M eventually. It is hard
for me to estimate the number of items. We could be starting out with 100, but as these items
are entities that we extract, there could be tens of thousands eventually. I would guess that
most users would have less than 100 of these.
> Does that help? I would be interested in your input on the algorithms and also being
a guinea pig for the code you're developing, if it makes sense.
> -Aurora

View raw message