Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 41577 invoked from network); 7 Aug 2009 15:31:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Aug 2009 15:31:40 -0000 Received: (qmail 84584 invoked by uid 500); 7 Aug 2009 15:31:47 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 84535 invoked by uid 500); 7 Aug 2009 15:31:47 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 84520 invoked by uid 99); 7 Aug 2009 15:31:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Aug 2009 15:31:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Aug 2009 15:31:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 3A89F234C044 for ; Fri, 7 Aug 2009 08:31:15 -0700 (PDT) Message-ID: <375272908.1249659075238.JavaMail.jira@brutus> Date: Fri, 7 Aug 2009 08:31:15 -0700 (PDT) From: =?utf-8?Q?Nicol=C3=A1s_Fantone_=28JIRA=29?= To: mahout-dev@lucene.apache.org Subject: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors In-Reply-To: <801364025.1242991785747.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-121?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1274= 0587#action_12740587 ]=20 Nicol=C3=A1s Fantone commented on MAHOUT-121: ---------------------------------------- {quote} The point illustrated by the String loop example has nothing to do with how= variables declared, and everything to do with the difference between Strin= g and StringBuilder. It doesn't seem to address the point previously raised= . {quote} Not quite right. The difference between String and StringBuilder IS EXACTLY= the difference between instantiating thousands of objects and re-using jus= t one, which is, I believe, the matter at hand here. {quote} In fact the first is ever so slightly worse since it sets s to null, but th= e value is unused. But it is worse for another reason: s continues to point= to "anotherString: 249999" after the loop terminates, which is also pointl= ess. {quote} If you create new Strings in a loop, then you'll have as many objects as it= erations pointing to "anotherString: 0", "anotherString: 1", ..., "anotherS= tring: 121410", and so on, waiting to be gcollected - which may not even ha= ppen in the short term. Even more pointless, following your logic. {quote} Hence I would undo that part of the patch unless there is another purpose t= o it I missed. {quote} Perhaps someone could run a profiler with and without the latest patch? I t= end to think the gain in execution speed would not be significant if any at= all, as some of you have stated. However, unless code readability is a pri= ority, I see no harm in changing something that can only help performance. {quote} This isn't an example of unrolling is it? {quote} That's right. It is not. It is about the cost of instatiation vs. reusabili= ty of short-lived objects. > Speed up distance calculations for sparse vectors > ------------------------------------------------- > > Key: MAHOUT-121 > URL: https://issues.apache.org/jira/browse/MAHOUT-121 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,= MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch,= MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.p= atch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patc= h, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch > > > From my mail to the Mahout mailing list. > I am working on clustering a dataset which has thousands of sparse vector= s. The complete dataset has few tens of thousands of feature items but each= vector has only couple of hundred feature items. For this, there is an opt= imization in distance calculation, a link to which I found the archives of = Mahout mailing list. > http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebr= a-sparse-vectors/ > I tried out this optimization. The test setup had 2000 document vectors= with few hundred items. I ran canopy generation with Euclidean distance a= nd t1, t2 values as 250 and 200. > =20 > Current Canopy Generation: 28 min 15 sec. > Canopy Generation with distance optimization: 1 min 38 sec. > I know by experience that using Integer, Double objects instead of primit= ives is computationally expensive. I changed the sparse vector implementat= ion to used primitive collections by Trove [ > http://trove4j.sourceforge.net/ ]. > Distance optimization with Trove: 59 sec > Current canopy generation with Trove: 21 min 55 sec > To sum, these two optimizations reduced cluster generation time by a 97%. > Currently, I have made the changes for Euclidean Distance, Canopy and KMe= ans. =20 > Licensing of Trove seems to be an issue which needs to be addressed. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.