Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DE71D9058 for ; Fri, 2 Dec 2011 18:20:13 +0000 (UTC) Received: (qmail 22601 invoked by uid 500); 2 Dec 2011 18:20:12 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 22564 invoked by uid 500); 2 Dec 2011 18:20:12 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 22556 invoked by uid 99); 2 Dec 2011 18:20:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2011 18:20:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of srowen@gmail.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bw0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2011 18:20:06 +0000 Received: by bkcjm19 with SMTP id jm19so4778792bkc.1 for ; Fri, 02 Dec 2011 10:19:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=TA67YOtVyyeEKVU4F+ngeSASwsIsJI5YLURSmCmYSJU=; b=CRCrM0SEBQkdOn/HpngSeMc1jD4SWDg4ciUtDe6cWtMm1GjKhh7u/b3veZaEZWfOpj y3pFF3CX8mET4favKmhFckXoigQebNA5PKH0xZIx2RwCYHi1f/RfCAQpQXsPGIBaVmtH HcGXAA1UgyQn/APwJUKfV2OWRB9TpZo8dl0CI= MIME-Version: 1.0 Received: by 10.204.152.3 with SMTP id e3mr7804792bkw.70.1322849985032; Fri, 02 Dec 2011 10:19:45 -0800 (PST) Received: by 10.204.40.194 with HTTP; Fri, 2 Dec 2011 10:19:44 -0800 (PST) In-Reply-To: References: <21068ABA-D5DD-4AB0-9BCF-178D4688E120@gmx.de> <4ED79A4D.40209@apache.org> <4ED7C32A.6090504@apache.org> Date: Fri, 2 Dec 2011 18:19:44 +0000 Message-ID: Subject: Re: Mahout performance issues From: Sean Owen To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0015175cfdee72f36304b3200362 --0015175cfdee72f36304b3200362 Content-Type: text/plain; charset=UTF-8 On Fri, Dec 2, 2011 at 6:07 PM, Daniel Zohar wrote: > > I definitely agree that the correctness should not be broken. My solution > is not meant to decrease the number of possible items like you stated in > your example. It was meant to reduce the amount of item-user associations > (while preserving user-item associations) which will results much less > effort on intersectionSize(). Even in the case that we have two popular > My point is that intersectionSize() is called as part of a similarity computation. Yes, that's the bottleneck. But, that happens after the stage where candidate items are identified. And you are talking about changing the candidate identification stage, which is not the bottleneck. I think your change *happens* to also reduce the number of similarity computations since it assumes some are 0, when they are not! sure that saves time, in the same way that you'll finish an exam faster if you don't answer half the questions. I am instead suggesting to optimize intersectionSize(), such that for all of these 1-item cases, the answer is computed extremely fast. Which also addresses the bottleneck of course. I suppose this could be proven or disproven quickly -- do you get the same speed up with the change I committed, without your change? if you do, great, we have a solution. If not then I am wrong and you have some example that pinpoints where the new bottleneck is. --0015175cfdee72f36304b3200362--