mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: mahout collections updates
Date Tue, 12 Mar 2013 22:18:37 GMT
Ok now that's even weirder than I thought!  You're still calling vector
methods, you're not doing direct array access or anything...

On Tuesday, March 12, 2013, Sebastian Schelter wrote:

> I wrote a small benchmark as we investigated this issue recently for
> computing recommendations from an ALS factorization.
>
> https://gist.github.com/sscdotopen/5147521
>
> Here's some results on my notebook, indicating a 5x to 6x performance
> improvement:
>
> dot() 570ms, direct 98ms
> dot() 568ms, direct 97ms
> dot() 559ms, direct 97ms
> dot() 566ms, direct 98ms
> dot() 574ms, direct 98ms
> dot() 582ms, direct 101ms
> dot() 581ms, direct 98ms
> dot() 576ms, direct 98ms
> dot() 574ms, direct 97ms
> dot() 574ms, direct 99ms
> dot() 564ms, direct 98ms
> dot() 576ms, direct 97ms
> dot() 580ms, direct 100ms
>
>
>
> On 12.03.2013 23:00, Sean Owen wrote:
> > It's almost certainly the overhead of the iterator creation and iterator
> > methods.
> > DenseVector.dot() is not specialized and the simple dot product method
> here
> > could as well be placed there. Then the call to DenseVector.dot() would
> be
> > equally unsurprisingly fast.
> >
> >
> > On Tue, Mar 12, 2013 at 9:56 PM, Jake Mannix <jake.mannix@gmail.com<javascript:;>>
> wrote:
> >
> >> But then where does it slow down?  It just wraps a double[]
> >>
> >> On Tuesday, March 12, 2013, Sebastian Schelter wrote:
> >>
> >>> I looked into DenseVector and it doesn't use any primitive collections,
> >>> so ignore my last mail :)
> >>>
> >>> On 12.03.2013 22:16, Sebastian Schelter wrote:
> >>>> As a sidenote: I was kinda shocked recently, that switching from
> >>>> DenseVector's dot() method to a direct dot product computation gave
a
> >> 3x
> >>>> increase in performance in
> >>>> org.apache.mahout.cf.taste.hadoop.als.RecommenderJob.
> >>>>
> >>>> It seems like we really have a performance problem for some usecases.
> >>>>
> >>>> On 12.03.2013 22:04, Dawid Weiss wrote:
> >>>>>> The primary use case for mahout collections is directly *inside*
of
> >>>>>> our Vector interface.  Which is to say, it's not directly exposed
to
> >>>>>> most users, and we don't really expose the ability to do guava
> >>> collections
> >>>>>> stuff on them at all: We Do Math. :)  So in particular, we don't
> >> expose
> >>>>>
> >>>>> Fair enough. But you might want to expose some of it at some point
> and
> >>>>> if this happens it
> >>>>> may just be ready for you.
> >>>>>
> >>>>>> Question is whether there's anything to be gained by just swapping
> >>>>>> our own collections *out* for something else, like HPPC or fastutil.
> >>>>>
> >>>>> Depends. Speed optimizations may be one reason -- you'd need to
check
> >>>>> if the code gains anything by using these libraries compared to
> Mahout
> >>>>> collections. While microbenchmarks may show large differences my
bet
> >>>>> is that overall results, taking into account
> >>>>> computations and, God forbid, I/O, will be within noise range unless
> >>>>> you're really using these data structures a *lot* in tight loops.
The
> >>>>> only practical benefit I see is getting rid of a chunk of code you
> >>>>> don't wish to
> >>>>> maintain (like you said: missing features, unit tests, etc.). But
I
> >>>>> don't negate there is some entertainment value in going back to
such
> >>>>> fundamental data structures and trying to squeeze the last bit of
> >>>>> performance out of them. :)
> >>>>>
> >>>>> Dawid
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >> --
> >>
> >>   -jake
> >>
> >
>
>

-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message