mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?
Date Sun, 02 Jan 2011 10:08:44 GMT
Hi Stefano, happy new year too!

The running time of RecommenderJob is neither proportional to the number
of users you wanna compute recommendations for nor to the number of
recommendations per single user. Those parameters just influence the
last step of the job, but most time will be spent before when computing
item-item-similarities, which is done independently of the number of
users you wanna have recommendations for or the number of
recommendations per user.

We have some parameters to control the amount of data considered in the
recommendation process, have you tried adjusting them to your needs? If
you haven't I think playing with those should be the best place to start
for you:

  --maxPrefsPerUser maxPrefsPerUser
	Maximum number of preferences considered per user in final
	recommendation phase

  --maxSimilaritiesPerItem maxSimilaritiesPerItem
	Maximum number of similarities considered per item

  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem
	try to cap the number of cooccurrences per item to this number

It would be very cool if you could keep us up to date with your progress
and maybe provide some numbers. I think there are a lot of things in the
RecommenderJob that could be optimized by us to increase its performance
and scalability, I think we'd be happy to patch it for you if you
encounter a problem.


Am 02.01.2011 10:36, schrieb Stefano Bellasio:
> Hi guys, happy new year :) well, after several weeks of testing finally i had a complete
amazon ec2-hadoop working environment thanks to Cloudera ec2 script. Well, right now i'm doing
some test with movielens (10 mln version) and i need just to compute recommendations with
different similirity by RecommenderJob, all is ok. I ran Amazon EC2 cluster with 3 instances,
1 master node and 2 worker node (large instance) but even if i know that recommender is not
fast, i was thinking that 3 instances are very process took about 3 hours to complete
for 1 users (i specified the user that needs recommendation with a user.txt file)....and just
10 recommendations. So, my question is, what is the correct setup for my cluster? How many
nodes? How many data nodes and so on? Is there something that i can do to speed up this
goal is to recommend with a dataset of about 20/30 GB and 200 milions of i'm worried
about that. 
> Thanks :) Stefano

View raw message