mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Zohar <disso...@gmail.com>
Subject Re: Mahout performance issues
Date Wed, 30 Nov 2011 14:19:17 GMT
Hi Sean,
First of all let me thank you for all your help thus far :)

I am using Mahout 0.5.
At the moment the application is not live yet, so I assume multi-threading
is not a problem at the moment.

I definitely see that the bottleneck is in the similarities computations.
Looking at TopItems:getTopItems, I can see that the method iterates over
all the 'possible items' and evaluates them using the Estimator which in
turn iterates over all the past user choices for every possible item.
Now lets assume a user chose 50 items before and has 100 possible items,
that's already 5k item-item similarities to calculate. If I wouldn't cap
the possible items, it can wind up at much larger numbers.

I would also like to add that although the solution I post before improves
performance, it extremely damages the quality of the recommendations as it
checks a smaller pool of possible items.

Thanks!

On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <srowen@gmail.com> wrote:

> I have a few more thoughts.
>
> First, I was wrong about what the first parameter to
> SamplingCandidateStrategy means. It's effectively a minimum, rather than
> maximum; setting to 1 just means it will sample at least 1 pref. I think
> you figured that out. I think values like (5,1) are probably about right
> for you.
>
> I see that your change is to further impose a global cap on the number of
> candidate items returned. I understand the logic of that -- Sebastian what
> do you think? (PS you can probably make that run slightly faster by using
> LongPrimitiveIterator instead of Iterator<Long>.)
>
>
> Something still feels a bit off here, that's a very long time. Your JVM
> params are impeccable, you have a good amount of RAM and strong machine.
>
> Since you're getting speed-up by directly reducing the number of candidate
> items, I get the idea that your similarity computations are the bottleneck.
> Does any of your profiling confirm that?
>
> Are you using the latest code? I can think of one change in the last few
> months that I added (certainly since 0.5) that would speed up
> LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this
> ought to be very fast.
>
>
> I'll also say that the computation here is not multi-threaded. I had always
> sort of thought that, at scale, you'd be getting parallelism from handling
> multiple concurrent requests. It would be possible to rewrite a lot of the
> internals to compute top recs using multiple threads. That might make
> individual requests return faster on a multi-core machine though wouldn't
> increase overall throughput.
>
>
> On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <dissoman@gmail.com> wrote:
>
> > Hello all,
> > This email follows the correspondence in StackExchange between myself and
> > Sean Owen. Please see
> >
> http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues
> >
> > I'm building a boolean-based recommendation engine with the following
> data:
> >
> >   - 12M users
> >   - 2M items
> >   - 18M user-item (boolean) choices
> >
> > The following code is used to build the recommender:
> >
> > DataModel dataModel = new FileDataModel(new File(dataFile));
> > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new
> > LogLikelihoodSimilarity(dataModel), dataModel);
> > CandidateItemsStrategy candidateItemsStrategy = new
> > SamplingCandidateItemsStrategy(20, 5);
> > MostSimilarItemsCandidateItemsStrategy
> > mostSimilarItemsCandidateItemsStrategy = new
> > SamplingCandidateItemsStrategy(20, 5);
> >
> > this.recommender = new GenericBooleanPrefItemBasedRecommender(dataModel,
> > itemSimilarity,
> > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy);
> >
> > My app runs on a Tomcat with the following JVM arguments:
> > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC
> > -XX:+UseParallelOldGC*
> >
> > Recommendations with the code above works very well for users who have
> made
> > 1-2 choices in the past, but can take over to a minute when a user had
> made
> > tens of choices, especially if one of these choices is a very popular
> item
> > (i.e. was chosen by many other users).
> >
> > Even when using the *SamplingCandidateItemsStrategy* with (1,1)
> arguments,
> > I still did not manage to achieve fast results.
> >
> > The only way I managed to get somewhat OK results (max recommendation
> time
> > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a way
> > that *doGetCandidateItems* returns a limited amount of items. Following
> is
> > the doGetCandidateItems method as I re-wrote it:
> > http://pastebin.com/6n9C8Pw1
> >
> > **I think a good response time for recommendations should be less than a
> > second (preferably less than 500 milliseconds).**
> > How can I make Mahout perform better? I have a feeling some optimization
> is
> > needed both on the *CandidateItemsStrategy* and the *Recommender* itself.
> > *
> > *
> > Thanks in advance!
> > Daniel
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message