Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF249D8AC for ; Tue, 12 Mar 2013 22:42:43 +0000 (UTC) Received: (qmail 22502 invoked by uid 500); 12 Mar 2013 22:42:43 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 22448 invoked by uid 500); 12 Mar 2013 22:42:43 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 22439 invoked by uid 99); 12 Mar 2013 22:42:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Mar 2013 22:42:43 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ssc.open@googlemail.com designates 209.85.214.46 as permitted sender) Received: from [209.85.214.46] (HELO mail-bk0-f46.google.com) (209.85.214.46) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Mar 2013 22:42:38 +0000 Received: by mail-bk0-f46.google.com with SMTP id j5so170379bkw.5 for ; Tue, 12 Mar 2013 15:42:17 -0700 (PDT) X-Received: by 10.204.169.144 with SMTP id z16mr6787506bky.109.1363128136555; Tue, 12 Mar 2013 15:42:16 -0700 (PDT) Received: from [192.168.0.103] (g225119015.adsl.alicedsl.de. [92.225.119.15]) by mx.google.com with ESMTPS id v2sm5549212bkw.5.2013.03.12.15.42.15 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 12 Mar 2013 15:42:16 -0700 (PDT) Message-ID: <513FAF4B.2080406@apache.org> Date: Tue, 12 Mar 2013 23:42:19 +0100 From: Sebastian Schelter Reply-To: ssc@apache.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130221 Thunderbird/17.0.3 MIME-Version: 1.0 To: dev@mahout.apache.org Subject: Re: mahout collections updates References: <513F9B35.7000707@apache.org> <513F9C57.4070206@apache.org> <513FA709.8030503@apache.org> In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Using this code in DenseVector gives me similar performance: @Override public double dot(Vector x) { if (!x.isDense()) { return super.dot(x); } else { double sum = 0; final int size = size(); for (int n = 0; n < size; n++) { sum += values[n] * x.getQuick(n); } return sum; } } On 12.03.2013 23:18, Jake Mannix wrote: > Ok now that's even weirder than I thought! You're still calling vector > methods, you're not doing direct array access or anything... > > On Tuesday, March 12, 2013, Sebastian Schelter wrote: > >> I wrote a small benchmark as we investigated this issue recently for >> computing recommendations from an ALS factorization. >> >> https://gist.github.com/sscdotopen/5147521 >> >> Here's some results on my notebook, indicating a 5x to 6x performance >> improvement: >> >> dot() 570ms, direct 98ms >> dot() 568ms, direct 97ms >> dot() 559ms, direct 97ms >> dot() 566ms, direct 98ms >> dot() 574ms, direct 98ms >> dot() 582ms, direct 101ms >> dot() 581ms, direct 98ms >> dot() 576ms, direct 98ms >> dot() 574ms, direct 97ms >> dot() 574ms, direct 99ms >> dot() 564ms, direct 98ms >> dot() 576ms, direct 97ms >> dot() 580ms, direct 100ms >> >> >> >> On 12.03.2013 23:00, Sean Owen wrote: >>> It's almost certainly the overhead of the iterator creation and iterator >>> methods. >>> DenseVector.dot() is not specialized and the simple dot product method >> here >>> could as well be placed there. Then the call to DenseVector.dot() would >> be >>> equally unsurprisingly fast. >>> >>> >>> On Tue, Mar 12, 2013 at 9:56 PM, Jake Mannix > >> wrote: >>> >>>> But then where does it slow down? It just wraps a double[] >>>> >>>> On Tuesday, March 12, 2013, Sebastian Schelter wrote: >>>> >>>>> I looked into DenseVector and it doesn't use any primitive collections, >>>>> so ignore my last mail :) >>>>> >>>>> On 12.03.2013 22:16, Sebastian Schelter wrote: >>>>>> As a sidenote: I was kinda shocked recently, that switching from >>>>>> DenseVector's dot() method to a direct dot product computation gave a >>>> 3x >>>>>> increase in performance in >>>>>> org.apache.mahout.cf.taste.hadoop.als.RecommenderJob. >>>>>> >>>>>> It seems like we really have a performance problem for some usecases. >>>>>> >>>>>> On 12.03.2013 22:04, Dawid Weiss wrote: >>>>>>>> The primary use case for mahout collections is directly *inside* of >>>>>>>> our Vector interface. Which is to say, it's not directly exposed to >>>>>>>> most users, and we don't really expose the ability to do guava >>>>> collections >>>>>>>> stuff on them at all: We Do Math. :) So in particular, we don't >>>> expose >>>>>>> >>>>>>> Fair enough. But you might want to expose some of it at some point >> and >>>>>>> if this happens it >>>>>>> may just be ready for you. >>>>>>> >>>>>>>> Question is whether there's anything to be gained by just swapping >>>>>>>> our own collections *out* for something else, like HPPC or fastutil. >>>>>>> >>>>>>> Depends. Speed optimizations may be one reason -- you'd need to check >>>>>>> if the code gains anything by using these libraries compared to >> Mahout >>>>>>> collections. While microbenchmarks may show large differences my bet >>>>>>> is that overall results, taking into account >>>>>>> computations and, God forbid, I/O, will be within noise range unless >>>>>>> you're really using these data structures a *lot* in tight loops. The >>>>>>> only practical benefit I see is getting rid of a chunk of code you >>>>>>> don't wish to >>>>>>> maintain (like you said: missing features, unit tests, etc.). But I >>>>>>> don't negate there is some entertainment value in going back to such >>>>>>> fundamental data structures and trying to squeeze the last bit of >>>>>>> performance out of them. :) >>>>>>> >>>>>>> Dawid >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> >>>> -jake >>>> >>> >> >> >