mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Performance issues in Mahout recommendations
Date Fri, 06 Jun 2014 10:06:50 GMT
1M ratings take up something like 20 megabytes. This is a datasize where 
it does not make any sense to use Hadoop. Just try the single machine 
implementation.

--sebastian



On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:
> Hi Sebastian,
>
> Thanks for your prompt response. It's just a sample data set from our
> database and it may expand up to 6 million ratings. Since the performance
> was low for a smaller data set, I thought it would be even worse for a
> larger data set. As per your suggestion, I also applied the same command on
> 1 million user ratings for approx. 6000 users and got the same performance
> level.
>
> What is the average running time for the Mahout distributed recommendation
> job on 1 million ratings? Does it usually take more than 1 minute?
>
> Thanks in advance,
> Warunika
>
>
> On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter <ssc@apache.org> wrote:
>
>> You should not use Hadoop for such a tiny dataset. Use the
>> GenericItemBasedRecommender on a single machine in Java.
>>
>> --sebastian
>>
>>
>> On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:
>>
>>> Hi,
>>>
>>> I am using Mahout's recommenditembased algorithm on a data set with nearly
>>> 10,000 (implicit) user ratings. This is the command I used:
>>> *mahout recommenditembased --input ratings.csv --output recommendation
>>>
>>> --usersFile users.dat --tempDir temp --similarityClassname
>>> SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *
>>>
>>>
>>> Although the output is successfully generated, this process takes nearly 7
>>> minutes to produce recommendations for a single user. The Hadoop cluster
>>> has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
>>> c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that more
>>> than one machine is *not* utilized at a time, and the *recommenditembased*
>>>
>>> command takes 9 mapreduce jobs altogether with approx. 45 seconds taken
>>> per
>>> job.
>>>
>>> Since the performance is too slow for real time recommendations, it would
>>> be really helpful to know whether I'm missing out any additional commands
>>> or configurations that enables faster performance.
>>>
>>> Thanks,
>>> Warunikay
>>>
>>>
>>
>


Mime
View raw message