mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Hints for Best Practices for Jobs with amazon EMR
Date Thu, 14 Apr 2011 21:43:55 GMT
Hi Thomas,

I'd say now the long running time comes from items from dataset two that 
have lots of user-preferences. Seems like your data is to dense to 
compare all pairs of users with ItemSimilarityJob.

What exactly is the problem you're trying to solve with computing 
similar users? Do you need that as input for the computation of 
recommendations? Maybe we'll find another approach for you on this list.

--sebastian

On 14.04.2011 16:59, Thomas Rewig wrote:
>  Hi Sebastian
>
> in my datamodel are 17733658 Datapoints, there are 230116 unique users 
> (U(Ix)) and 208760 unique items I(Ux).
> The datapoints are in some way dense and sparse because I test to 
> merge 2 Datasets and invert the result, so I can use the 
> ItemSimilarityJob:
>
> e.g.:
>
> I = Item
> U = User
>
> Dataset1 (the sparse one):
>   I1 I2 I3 I4
> U1 9        8
> U2 7     4
> U3    8     5
> U5 5     9
>
> Dataset2 (the dense one, but has much less Items than Dataset1):
>
>   I5 I6
> U1 1  2
> U2 3  2
> U3 2
> U4 5  3
> U5 1  1
>
> Invert Dataset(1+2) so Users are Items an vice versa:
>
>      I(U1) I(U2) I(U3) I(U4) I(U5)
> U(I1) 9     7                 5
> U(I2)             8
> U(I3)       4                 9
> U(I4) 8           5
> U(I5) 1     3     2     5     1
> U(I6) 2     2           3     1
>
> So Yes your right because of this invertation I have users with a lots 
> of preferences (nearly the number of users in Dataset2), and I can 
> understand why the system seems to stop.
>
> Maybe the Invertation of the Data isn't a good way for this purpose 
> and i have to write my own UserUserSimilartityJob. (ok in the moment I 
> have no idea how to do this, because I just started with hadoop and 
> mapreduce, but I can try ;-) ).
>
> Do you have some other hints I can try?
>
>>
>> Can you same how many datapoints your data contains and how dense 
>> those are? 200MB doesn't seem that much, it shouldn't take hours with 
>> 8 m1.large instances.
>>
>> Can you give us the values of the following counters?
>>
>> MaybePruneRowsMapper: Elements.USED
>> MaybePruneRowsMapper: Elements.NEGLECTED
>>
>> CooccurrencesMapper: Counter.COOCCURRENCES
>
> I'm not sure, if I find the data in the logs you want, but maybe this 
> log-sample helps:
>
> MaybePruneRowsMapper: Elements.NEGLECTED = 8798627
> MaybePruneRowsMapper: Elements.USED = 6821670
>
> I cant find Counter.COOCCURRENCES
>
>
> INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 92%
> INFO org.apache.hadoop.mapred.JobClient (main):  map 100% reduce 100%
> INFO org.apache.hadoop.mapred.JobClient (main): Job complete: 
> job_201104130912_0004
> INFO org.apache.hadoop.mapred.JobClient (main): Counters: 20
> INFO org.apache.hadoop.mapred.JobClient (main):   
> org.apache.mahout.cf.taste.hadoop.MaybePruneRowsMapper$Elements
> INFO org.apache.hadoop.mapred.JobClient (main):     NEGLECTED=8798627
> INFO org.apache.hadoop.mapred.JobClient (main):     USED=6821670
> INFO org.apache.hadoop.mapred.JobClient (main):   Job Counters
> INFO org.apache.hadoop.mapred.JobClient (main):     Launched reduce 
> tasks=24
> INFO org.apache.hadoop.mapred.JobClient (main):     Rack-local map 
> tasks=3
> INFO org.apache.hadoop.mapred.JobClient (main):     Launched map tasks=24
> INFO org.apache.hadoop.mapred.JobClient (main):     Data-local map 
> tasks=21
> INFO org.apache.hadoop.mapred.JobClient (main):   FileSystemCounters
> INFO org.apache.hadoop.mapred.JobClient (main):     
> FILE_BYTES_READ=45452916
> INFO org.apache.hadoop.mapred.JobClient (main):     
> HDFS_BYTES_READ=120672701
> INFO org.apache.hadoop.mapred.JobClient (main):     
> FILE_BYTES_WRITTEN=106567950
> INFO org.apache.hadoop.mapred.JobClient (main):     
> HDFS_BYTES_WRITTEN=51234800
> INFO org.apache.hadoop.mapred.JobClient (main):   Map-Reduce Framework
> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input 
> groups=208760
> INFO org.apache.hadoop.mapred.JobClient (main):     Combine output 
> records=0
> INFO org.apache.hadoop.mapred.JobClient (main):     Map input 
> records=230201
> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce shuffle 
> bytes=60461985
> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce output 
> records=208760
> INFO org.apache.hadoop.mapred.JobClient (main):     Spilled 
> Records=13643340
> INFO org.apache.hadoop.mapred.JobClient (main):     Map output 
> bytes=136433400
> INFO org.apache.hadoop.mapred.JobClient (main):     Combine input 
> records=0
> INFO org.apache.hadoop.mapred.JobClient (main):     Map output 
> records=6821670
> INFO org.apache.hadoop.mapred.JobClient (main):     Reduce input 
> records=6821670
> INFO org.apache.mahout.common.AbstractJob (main): Command line 
> arguments: {--endPhase=2147483647, --maxSimilaritiesPerRow=501, 
> --numberOfColumns=230201, 
> --similarityClassname=SIMILARITY_LOGLIKELIHOOD, --startPhase=0, 
> --tempDir=temp}
> INFO org.apache.hadoop.mapred.JobClient (main): Default number of map 
> tasks: 2
> INFO org.apache.hadoop.mapred.JobClient (main): Default number of 
> reduce tasks: 24
> INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): 
> Total input paths to process : 24
> INFO org.apache.hadoop.mapred.JobClient (main): Running job: 
> job_201104130912_0005
> INFO org.apache.hadoop.mapred.JobClient (main):  map 0% reduce 0%
> INFO org.apache.hadoop.mapred.JobClient (main):  map 4% reduce 0%
>
>>
>> By the way, I see that you're located in Berlin, I have some free 
>> time in the next 2 weeks if you want we could meet for a coffee and 
>> you'll get some free consultation!
>>
> It would be really great to meet you, but only the head office is in 
> Berlin. I am in Dresden and althougth that is not far away, it does 
> not look like I can go to Berlin. Maybe it works out later when I 
> visit the headquarter. I am sure you can explain a lot to me.
>
>
>
> Thanks in advance
> Thomas
>
>
>
>
>
>
>>
>>
>> On 14.04.2011 12:18, Thomas Rewig wrote:
>>>  Hello
>>> right now I'm testing Mahout (taste) Jobs on AWS EMR.
>>> I wonder if anyone does have any experience with the best cluster 
>>> size and the best EC2 instances. Are there any best practices for 
>>> mahout (taste) jobs?
>>>
>>> On my first test I used a small 22 MB user-item-model an compute an 
>>> ItemSimilarityJob with 3 small EC2 instances:
>>>
>>> ruby elastic-mapreduce --create --alive --slave-instance-type 
>>> m1.small --master-instance-type m1.small --num-instances 3  --name 
>>> mahout-0.5-itemSimJob-TEST
>>>
>>>
>>> ruby elastic-mapreduce
>>> --jar s3://some-uri/mahout/mahout-core-0.5-Snapshot-job.jar
>>> --main-class 
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>> --arg -i --arg s3://some-uri/input/data_small_in.csv
>>> --arg -o --arg s3://some-uri/output/data_out_small.csv
>>> --arg -s
>>> --arg SIMILARITY_LOGLIKELIHOOD
>>> --arg -m
>>> --arg 500
>>> --arg -mo
>>> --arg 500
>>> -j JobId
>>>
>>> Here everything worked well, even if it took a few minutes.
>>>
>>> In a second test I used a bigger 200 MB user-item-model and do the 
>>> same with a cluster of large instances:
>>>
>>> ruby elastic-mapreduce --create --alive --slave-instance-type 
>>> m1.large --master-instance-type m1.large --num-instances 8  --name 
>>> mahout-0.5-itemSimJob-TEST2
>>>
>>> I logged in on the masterNode with ssh and looked at the syslog. 
>>> First a few hours everything looked ok and then it seems to stops at 
>>> a 63% reduce step. I wait a few hours but nothing happend and so i 
>>> terminate the job. I even couldn't find any errors in the logs.
>>>
>>> So here my questions:
>>> 1. are the any proved best practice clustersizes and instancetypes 
>>> (Standard- or High-Memory- or High-CPU-Instances) that work fine for 
>>> big recommender jobs, or do I have to test it for every  different 
>>> job I use?
>>> 2. would it have some positiv effect if I split my big data_in.csv 
>>> into many small csv's?
>>>
>>> Do anyone have any experience with it and have some hints?
>>>
>>> Thanks in advance
>>> Thomas
>>>
>>>
>>>
>>
>>
>


Mime
View raw message