mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Warunika Ranaweera <warunik...@gmail.com>
Subject Re: Performance issues in Mahout recommendations
Date Tue, 24 Jun 2014 05:41:11 GMT
Hi Pat,

Thanks for your quite descriptive reply. I tried out some of your
suggestions, especially the in-memory recommender using Mahout libraries,
and it works well for now. Once we reach the point where data becomes large
enough to affect the performance of the in-memory recommender, we are
hoping to move to the distributed recommender.

Thanks for your help,
Warunika


On Fri, Jun 6, 2014 at 7:10 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:

> In the original case you were using a hadoop command line tools which
> produces all recs for all users, not just one. Since the recs are ALL
> calculated they just need to be stored and retrieved—very fast. Put them in
> a DB, when the user visits, show the precalculated recs, which is as fast
> as a single DB fetch.
>
> Sebastian talks about the in-memory recommender for one machine and medium
> sized datasets. It will produce recommendations for a specific user very
> fast as long as the data is not too big in which case the performance drops
> off.
>
> The third way to do this is to break out the core data structure created
> by ItemSimilarity Job, translate the Mahout IDs into your Item IDs and
> index it with Solr. Then you can use a user’s history as a query in
> realtime to Solr, which will return an ordered list of recs. This scales
> indefinitely as Solr scales and is very fast. It is also nice because you
> can bias result towards metadata like category, genre, catalog section,
> with the query, not new nodel creation required. You’ll find a tool to help
> with this in mahout/examples or here:
> https://github.com/pferrel/solr-recommender
>
> One of those should fit, they are all fast in the right environment. They
> all do require some background non-realtime model calculation but this is
> done only periodically.
>
>
> On Jun 6, 2014, at 5:33 AM, Sebastian Schelter <ssc@apache.org> wrote:
>
> Mahout has single machine and distributed recommenders.
>
>
> On 06/06/2014 02:31 PM, Warunika Ranaweera wrote:
> > I agree with your suggestion though. I have already implemented a Java
> > recommender and it performed better. But, due to scalability problems
> that
> > are predicted to occur in the future, we thought of moving to Mahout.
> > However, it seems like, for now, it's better to go with the single
> machine
> > implementation.
> >
> > Thanks for your suggestions,
> > Warunika
> >
> >
> >
> > On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter <ssc@apache.org>
> wrote:
> >
> >> 1M ratings take up something like 20 megabytes. This is a datasize where
> >> it does not make any sense to use Hadoop. Just try the single machine
> >> implementation.
> >>
> >> --sebastian
> >>
> >>
> >>
> >>
> >> On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:
> >>
> >>> Hi Sebastian,
> >>>
> >>> Thanks for your prompt response. It's just a sample data set from our
> >>> database and it may expand up to 6 million ratings. Since the
> performance
> >>> was low for a smaller data set, I thought it would be even worse for a
> >>> larger data set. As per your suggestion, I also applied the same
> command
> >>> on
> >>> 1 million user ratings for approx. 6000 users and got the same
> performance
> >>> level.
> >>>
> >>> What is the average running time for the Mahout distributed
> recommendation
> >>> job on 1 million ratings? Does it usually take more than 1 minute?
> >>>
> >>> Thanks in advance,
> >>> Warunika
> >>>
> >>>
> >>> On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter <ssc@apache.org>
> >>> wrote:
> >>>
> >>>  You should not use Hadoop for such a tiny dataset. Use the
> >>>> GenericItemBasedRecommender on a single machine in Java.
> >>>>
> >>>> --sebastian
> >>>>
> >>>>
> >>>> On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:
> >>>>
> >>>>  Hi,
> >>>>>
> >>>>> I am using Mahout's recommenditembased algorithm on a data set with
> >>>>> nearly
> >>>>> 10,000 (implicit) user ratings. This is the command I used:
> >>>>> *mahout recommenditembased --input ratings.csv --output
> recommendation
> >>>>>
> >>>>> --usersFile users.dat --tempDir temp --similarityClassname
> >>>>> SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *
> >>>>>
> >>>>>
> >>>>> Although the output is successfully generated, this process takes
> >>>>> nearly 7
> >>>>> minutes to produce recommendations for a single user. The Hadoop
> cluster
> >>>>> has 8 nodes and the machine on which Mahout is invoked is an AWS
EC2
> >>>>> c3.2xlarge server. When I tracked the mapreduce jobs, I noticed
that
> >>>>> more
> >>>>> than one machine is *not* utilized at a time, and the
> >>>>> *recommenditembased*
> >>>>>
> >>>>> command takes 9 mapreduce jobs altogether with approx. 45 seconds
> taken
> >>>>> per
> >>>>> job.
> >>>>>
> >>>>> Since the performance is too slow for real time recommendations,
it
> >>>>> would
> >>>>> be really helpful to know whether I'm missing out any additional
> >>>>> commands
> >>>>> or configurations that enables faster performance.
> >>>>>
> >>>>> Thanks,
> >>>>> Warunikay
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message