mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Rewig <tre...@mufin.com>
Subject Re: Hints for Best Practices for Jobs with amazon EMR
Date Thu, 14 Apr 2011 14:59:50 GMT
  Hi Sebastian

in my datamodel are 17733658 Datapoints, there are 230116 unique users 
(U(Ix)) and 208760 unique items I(Ux).
The datapoints are in some way dense and sparse because I test to merge 
2 Datasets and invert the result, so I can use the ItemSimilarityJob:

e.g.:

I = Item
U = User

Dataset1 (the sparse one):
   I1 I2 I3 I4
U1 9        8
U2 7     4
U3    8     5
U5 5     9

Dataset2 (the dense one, but has much less Items than Dataset1):

   I5 I6
U1 1  2
U2 3  2
U3 2
U4 5  3
U5 1  1

Invert Dataset(1+2) so Users are Items an vice versa:

      I(U1) I(U2) I(U3) I(U4) I(U5)
U(I1) 9     7                 5
U(I2)             8
U(I3)       4                 9
U(I4) 8           5
U(I5) 1     3     2     5     1
U(I6) 2     2           3     1

So Yes your right because of this invertation I have users with a lots 
of preferences (nearly the number of users in Dataset2), and I can 
understand why the system seems to stop.

Maybe the Invertation of the Data isn't a good way for this purpose and 
i have to write my own UserUserSimilartityJob. (ok in the moment I have 
no idea how to do this, because I just started with hadoop and 
mapreduce, but I can try ;-) ).

Do you have some other hints I can try?

>
> Can you same how many datapoints your data contains and how dense 
> those are? 200MB doesn't seem that much, it shouldn't take hours with 
> 8 m1.large instances.
>
> Can you give us the values of the following counters?
>
> MaybePruneRowsMapper: Elements.USED
> MaybePruneRowsMapper: Elements.NEGLECTED
>
> CooccurrencesMapper: Counter.COOCCURRENCES

I'm not sure, if I find the data in the logs you want, but maybe this 
log-sample helps:

MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
MaybePruneRowsMapper: Elements.USED = 6821670

I cant find Counter.COOCCURRENCES


INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
INFO org.apache.hadoop.mapred.JobClient (main): Job complete: 
job_201104130912_0004
INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
INFO org.apache.hadoop.mapred.JobClient (main):   
org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce tasks=24
INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map tasks=3
INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=24
INFO org.apache.hadoop.mapred.JobClient (main):     Data-local map tasks=21
INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
INFO org.apache.hadoop.mapred.JobClient (main):     FILE_BYTES_READ=45452916
INFO org.apache.hadoop.mapred.JobClient (main):     
HDFS_BYTES_READ=120672701
INFO org.apache.hadoop.mapred.JobClient (main):     
FILE_BYTES_WRITTEN=106567950
INFO org.apache.hadoop.mapred.JobClient (main):     
HDFS_BYTES_WRITTEN=51234800
INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input 
groups=208760
INFO org.apache.hadoop.mapred.JobClient (main):     Combine output records=0
INFO org.apache.hadoop.mapred.JobClient (main):     Map input records=230201
INFO org.apache.hadoop.mapred.JobClient (main):     Reduce shuffle 
bytes=60461985
INFO org.apache.hadoop.mapred.JobClient (main):     Reduce output 
records=208760
INFO org.apache.hadoop.mapred.JobClient (main):     Spilled Records=13643340
INFO org.apache.hadoop.mapred.JobClient (main):     Map output 
bytes=136433400
INFO org.apache.hadoop.mapred.JobClient (main):     Combine input records=0
INFO org.apache.hadoop.mapred.JobClient (main):     Map output 
records=6821670
INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input 
records=6821670
INFO org.apache.mahout.common.AbstractJob (main): Command line 
arguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501, 
--numberOfColumns=230201, 
--similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0, 
--tempDir=temp}
INFO org.apache.hadoop.mapred.JobClient (main): Default number of map 
tasks: 2
INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce 
tasks: 24
INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total 
input paths to process : 24
INFO org.apache.hadoop.mapred.JobClient (main): Running job: 
job_201104130912_0005
INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%

>
> By the way, I see that you're located in Berlin, I have some free time 
> in the next 2 weeks if you want we could meet for a coffee and you'll 
> get some free consultation!
>
It would be really great to meet you, but only the head office is in 
Berlin. I am in Dresden and althougth that is not far away, it does not 
look like I can go to Berlin. Maybe it works out later when I visit the 
headquarter. I am sure you can explain a lot to me.



Thanks in advance
Thomas






>
>
> On 14.04.2011 12:18, Thomas Rewig wrote:
>>  Hello
>> right now I'm testing Mahout (taste) Jobs on AWS EMR.
>> I wonder if anyone does have any experience with the best cluster 
>> size and the best EC2 instances. Are there any best practices for 
>> mahout (taste) jobs?
>>
>> On my first test I used a small 22 MB user-item-model an compute an 
>> ItemSimilarityJob with 3 small EC2 instances:
>>
>> ruby elastic-mapreduce --create --alive --slave-instance-type 
>> m1.small --master-instance-type m1.small --num-instances 3  --name 
>> mahout-0.5-itemSimJob-TEST
>>
>>
>> ruby elastic-mapreduce
>> --jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
>> --main-class 
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>> --arg -i --arg s3://some-uri/input/data_small_in.csv
>> --arg -o --arg s3://some-uri/output/data_out_small.csv
>> --arg -s
>> --arg SIMILARITY_LOGLIKELIHOOD
>> --arg -m
>> --arg 500
>> --arg -mo
>> --arg 500
>> -j JobId
>>
>> Here everything worked well, even if it took a few minutes.
>>
>> In a second test I used a bigger 200 MB user-item-model and do the 
>> same with a cluster of large instances:
>>
>> ruby elastic-mapreduce --create --alive --slave-instance-type 
>> m1.large --master-instance-type m1.large --num-instances 8  --name 
>> mahout-0.5-itemSimJob-TEST2
>>
>> I logged in on the masterNode with ssh and looked at the syslog. 
>> First a few hours everything looked ok and then it seems to stops at 
>> a 63% reduce step. I wait a few hours but nothing happend and so i 
>> terminate the job. I even couldn't find any errors in the logs.
>>
>> So here my questions:
>> 1. are the any proved best practice clustersizes and instancetypes 
>> (Standard- or High-Memory- or High-CPU-Instances) that work fine for 
>> big recommender jobs, or do I have to test it for every  different 
>> job I use?
>> 2. would it have some positiv effect if I split my big data_in.csv 
>> into many small csv's?
>>
>> Do anyone have any experience with it and have some hints?
>>
>> Thanks in advance
>> Thomas
>>
>>
>>
>
>


Mime
View raw message