Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 19796 invoked from network); 19 Aug 2009 15:33:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Aug 2009 15:33:27 -0000 Received: (qmail 33808 invoked by uid 500); 19 Aug 2009 15:33:46 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 33778 invoked by uid 500); 19 Aug 2009 15:33:46 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 33768 invoked by uid 99); 19 Aug 2009 15:33:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2009 15:33:46 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Aug 2009 15:33:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id ECAC5234C1EF for ; Wed, 19 Aug 2009 08:33:14 -0700 (PDT) Message-ID: <1146676633.1250695994968.JavaMail.jira@brutus> Date: Wed, 19 Aug 2009 08:33:14 -0700 (PDT) From: "Rob Eden (JIRA)" To: mahout-dev@lucene.apache.org Subject: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors In-Reply-To: <801364025.1242991785747.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745082#action_12745082 ] Rob Eden commented on MAHOUT-121: --------------------------------- Hi guys. I'm the lead developer for the Trove project. Shashi mentioned the problem you're having with Trove's license. I appreciate the interest and would like to accommodate your usage. I'm going to speak to the original developer about dual-licensing. My specific question is, would the MPL be an acceptable license for usage or does it have to be APL? (Not implying that I necessarily have a problem with APL... just checking options.) > Speed up distance calculations for sparse vectors > ------------------------------------------------- > > Key: MAHOUT-121 > URL: https://issues.apache.org/jira/browse/MAHOUT-121 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch > > > From my mail to the Mahout mailing list. > I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list. > http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ > I tried out this optimization. The test setup had 2000 document vectors with few hundred items. I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200. > > Current Canopy Generation: 28 min 15 sec. > Canopy Generation with distance optimization: 1 min 38 sec. > I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector implementation to used primitive collections by Trove [ > http://trove4j.sourceforge.net/ ]. > Distance optimization with Trove: 59 sec > Current canopy generation with Trove: 21 min 55 sec > To sum, these two optimizations reduced cluster generation time by a 97%. > Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. > Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.