mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Hints for Best Practices for Jobs with amazon EMR
Date Thu, 14 Apr 2011 11:43:59 GMT
A few assorted hints --

You don't need high-memory instance. You do want to use instance with higher
I/O performance, all else equal. m1.large is a good choice.

I'd also make sure the number of mappers / reducers is at least the number
of instances, in every case. If you see a job has chosen to deploy only 10
reducers for whatever reason when you have 20 instances running, that's
suboptimal. We can talk about how to change that number. In fact for
m1.large it will be default run 2 reducers per instance, so you in theory
want 2x instances reducers.

On Thu, Apr 14, 2011 at 11:18 AM, Thomas Rewig <> wrote:

>  Hello
> right now I'm testing Mahout (taste) Jobs on AWS EMR.
> I wonder if anyone does have any experience with the best cluster size and
> the best EC2 instances. Are there any best practices for mahout (taste)
> jobs?
> On my first test I used a small 22 MB user-item-model an compute an
> ItemSimilarityJob with 3 small EC2 instances:
> ruby elastic-mapreduce --create --alive --slave-instance-type m1.small
> --master-instance-type m1.small --num-instances 3  --name
> mahout-0.5-itemSimJob-TEST
> ruby elastic-mapreduce
> --jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
> --main-class
> --arg -i --arg s3://some-uri/input/data_small_in.csv
> --arg -o --arg s3://some-uri/output/data_out_small.csv
> --arg -s
> --arg -m
> --arg 500
> --arg -mo
> --arg 500
> -j JobId
> Here everything worked well, even if it took a few minutes.
> In a second test I used a bigger 200 MB user-item-model and do the same
> with a cluster of large instances:
> ruby elastic-mapreduce --create --alive --slave-instance-type m1.large
> --master-instance-type m1.large --num-instances 8  --name
> mahout-0.5-itemSimJob-TEST2
> I logged in on the masterNode with ssh and looked at the syslog. First a
> few hours everything looked ok and then it seems to stops at a 63% reduce
> step. I wait a few hours but nothing happend and so i terminate the job. I
> even couldn't find any errors in the logs.
> So here my questions:
> 1. are the any proved best practice clustersizes and instancetypes
> (Standard- or High-Memory- or High-CPU-Instances) that work fine for big
> recommender jobs, or do I have to test it for every  different job I use?
> 2. would it have some positiv effect if I split my big data_in.csv into
> many small csv's?
> Do anyone have any experience with it and have some hints?
> Thanks in advance
> Thomas

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message