mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?
Date Thu, 06 Jan 2011 16:52:40 GMT
Those numbers seem "reasonable" to a first approximation, maybe a
little higher than I would have expected given past experience.

You should be able to increase speed with more nodes, sure, but I use
3 for testing too.

The jobs are I/O bound for sure. I don't think you will see
appreciable difference with different algorithms.

Yes the amount of data used in the similarity computation is the big
factor for time. You probably need to tell it to keep fewer item-item
pairs with the "max" parameters you  mentioned earlier.

mapred.num.tasks controls the number of mappers -- or at leasts
suggests it to Hadoop.

What do you mean about the time of computation? The job tracker shows
you when the individual tasks start and finish.

On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
<stefanobellasio@gmail.com> wrote:
> Hi guys, well i'm doing some tests in those days and i have some questions. Here there
is my environment and basic configuration:
>
> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm using a 3 node
with large instances + one master node to control the cluster.
> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right now are on 10
mln versions.
>
> This is the command that i'm using to start my cluster:
>
> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio --maxSimilaritiesPerItem
150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem 100 -s SIMILARITY_COOCCURRENCE -n 10 -u
users.txt
>
> I'm trying different values for :
>
> maxSimilaritiesPerItem
> maxPrefsPerUser
> maxCooccurrencesPerItem
>
> and using about 10 users per time. With this command, 10 mln user data set, my cluster
took more than 4 hours (with 3 nodes) to give recommendations. Is a good time?
>
>
> Well, right now i have 2 goals, and im posting here to request your help to figure out
some problems :) My primary goal is to run item-based recommendations and see what happens
when i change the parameters in time and performance of my cluster. Also, i need to look at
the similarities, i will be test three of them: cousine, pearson, and co-occurence. Good choices?
I noted also that all the similarities computation is in RAM (right?) so my matrix is built
and stored in RAM, is there an other way to do that?
>
> - I need to understand what kind of scalability i obtain with many nodes (3 for now,
i can arrive to 5), i think that similarities calculation took most of the time, am i right?
>
> - I know there is something like mapred.task to define how many instances some task can
use...do i need that? How can i specify this?
>
> - I need to see the exact time of each computation, i'm looking to jobtracker but seems
that never happens in my cluster even if job (with mapping and reducing) is running. Is there
another way to know the perfect time of any computation?
>
> - Finally, i will take all the data and try to plot them to figure out some good trends
based on number of nodes, time and data set dimension.
>
> Well, any suggestion you want to give me is accepted :) Thank you guys
>
>

Mime
View raw message